I am trying to develop a machine learning classifier that can automatically find fast or slow speech given the output of a speech to text engine.
I have a dataset I've created that consists of automatic speech to text transcription (ASR) where each word contains the following data:
{"word":"the", "start":0.0, "end":0.50, "pro":"dh ih s", "class":"O"}
Assume time is in seconds, "pro" is the phonetic transcription of the word (I used my own grapheme to phoneme converted to get that), and "class" is the annotated label I added. There are three possible classes for words:
C = [SLOW, O, FAST]
"O" is basically a word that's normal in terms of speed. SLOW and FAST are exactly what they seem.
I have roughly 15,000 words annotated for speed. About half are O, and the other half is either SLOW or FAST.
I ran a baseline conditional random field sequence classifier (tested and trained on the data, no cross folds yet) that has an overall F1 of about 83%. SLOW words only have a recall of about 66%, whereas FAST has good recall and precision (both around 80%). Obviously, O also has good F1 at around 86%.
I haven't done any true evaluation yet because I have no idea what my features should be. I am not a speech scientist or a signal processing engineer. My hope is to not have to use actual features from the recorded audio, just the ASR output.
Given this information, what tools or algorithms can I use to help me discover features automatically?
Should I be using a conditional random field? Are there better approaches to this problem?
Thank you.
[Machine Learning] Finding fast and slow speech automatically
Moderators: phlip, Moderators General, Prelates
Who is online
Users browsing this forum: No registered users and 4 guests