Share via


Speech recognition of interviews & videos

Jonathan Tregear recently posted some comments/questions on speech recognition of interviews (in response to a brief discussion I had with Scoble in my Channel 9 interview a couple of months back). I looped in Frank Seide from Microsoft Research Asia to answer these, since he's done a lot of work in this field.

Jonathan:

Late in your interview with Robert Scoble, he asked you about the possibility of using speech rocognition to produce transcripts of his interviews for example. Your answer was that the results he would get would not be very good unless the the speech engine was trained to each of the speaker's voices.

Frank:

This is true according to our experience. In interviews, a number of problems come together. The first is spontaneous speech style -- speech is irregular with interspersed self-corrections, stutters, uhms, etc..

This can put off a recognizer. Second, recognizers always have a predefined vocabulary. For generic interviews, topics can be very broad and important words like names and special terminology might not be included in the vocabulary.

Jonathan:

I've heard about ASR engines produced by companies like Autonomy/Virage

that claim to be able to do a decent job of speaker independent and unconstrained domain voice recognition for similar uses like indexing and and searching newscasts etc. Do you have experience with or an opinion about how good those engines are?

Frank:

If the transcript should be searchable but not necessarily readable, lower accuracies will be sufficient. Errors happen overproportionally in short words like "is" that are needed for readability but you don't care about for searching. The downside is, however, that for searching, the topic-specific words like names and terminology are often the ones you are interested in. If your vocabulary contains them, you are fine, otherwise it will not work. Then you will need to resort to phonetic search techniques.

I do not know how good the mentioned engines are. State-of-the-art research systems achieve ~90% accuracy for broadcast news transcription.

News is, however, a rather narrow domain, and a broadcast news recognizer will not work well for say entertainment, sports, or talk shows.

Jonathan:

Related to this is another question I've been wondering about: Suppose you pointed an engine at a video like the interview example above, but instead of using it to produce a transcript of the interview you were only interested in finding instances of a well defined list of keywords.

This would be useful in indexing and searching libraries of audio content also. Would that be an easier problem to solve for speaker independent (i.e. untrained) speech recognition?

Frank:

You need to distinguish between frequent (common) and (uncommon) infrequent keywords. Frequent words are often rather short, and recognizers can only reliably recognize them considering also the words surrounding them. For those, a keyword spotter as you describe would not work well. Infrequent words, on the other hand, are also often longer, and surrounding word context is less helpful as those words are rare and thus reliable statistics is not available anyway. For those, the method can work well.