After the launch event this Monday, I spent the rest of my week deviating a bit from my normal schedule and participated in the Microsoft BizSpark Incubation Week for Windows 7 at the Microsoft Technology Center (MTC) in Waltham. Sanjay Jain did a great job orchestrating the event and capturing the day-by-day progress of the five start-ups involved in the effort, so I’d like to focus this post on the geekier parts of my week.
As one of the advisors, I ended up floating between two teams, Panopto and Answer Point Medical Systems. They are two completely different applications, at different stages of development, serving distinct audiences, but they ended up sharing a need for speech-to-text capabilities. Of course, I was aware there was some built-in speech recognition in Windows, but attributed it to an accessibility feature where you speak stilted English to (hopefully) make the applications do what you want. With a little digging though, I found that it’s a pretty rich capability with an unmanaged (and even a managed!) API for handling speech – rivaling some of the 3rd-party speech recognition products.
The built-in speech recognition capabilities (look for Windows Speech Recognition in the Windows Vista or Windows 7 Start menu) actually go a long way to enabling you to incorporate speech into your application. When you run Windows Speech Recognition the first time, you’re encouraged to complete the tutorial, which actual doubles as training the recognizer to your voice.
There’s additional training you can also do, including spelling and recording individual words and phrases. This latter capability came in pretty handy as my colleague, Bob Familiar, was working with the Answer Point folks to capture a doctor’s verbal diagnosis in a medical record. As you might expect, words like hematoma don’t come out the way you want… I got ‘he told FEMA’ and ‘seem a Tacoma’ for example, before training for the word ‘hematoma’.
The built-in speech analyzer automatically outputs to whatever text-enabled application you happen to be in. In this case, Answer Point was building a WPF (and touch-enabled!) application, so the text was output to a FlowDocument element. And if your application isn’t text-enabled, as it appears Visual Studio is not, Windows 7 introduces the dictation scratchpad, which captures your text and has a simple interface to enable you to copy the text in to the document.
The guys from Panopto though had a completely different requirement for speech-to-text capability. Their application captures live events, like training and presentations, via multiple feeds – video, slides, audio, and transcript – and coalesces the content in a single viewing experience. But here’s the cool part, you can search the content, and the viewer will automatically bring you to the relevant portion of the recording – again, keeping the slideware, video and audio all in sync. They’ve been relying on a third-party manual transcription service, so while at Incubation Week we took a shot a automating this. Here’s roughly what we did:
Armed with this, the guys from Panopto made a pretty good foray into speech indexing, especially given the timeframe. Looking forward, we’re going to take a further look at what Microsoft Research has been doing with project MAVIS in terms of audio indexing.
Anyway. it was a great week allowing me to go a little deeper with customers than I normally get to and in a technology area in which I was completely oblivious!