Welcome to MSDN Blogs Sign in | Join | Help

Win big with an accessible user interface design (and save the planet)

If you're a student with cool ideas about accessible user interfaces - using speech recognition or synthesis, for example - consider the special award at the Imagine Cup.

At this year's competition, Microsoft is calling on young programmers, artists and technologists around the world to "imagine a world where technology enables a sustainable environment." And there's an Interface Design challenge to create user interfaces that are unique and forward-thinking, including the special award if your UI solution targets accessibility users and makes the best use of at least 2 features from the following list:

    • Microsoft Accessibility Architecture
      • (UI Automation, Magnification APIs)
    • Windows Vista Accessibility Parameters
      • (e.g. high contrast, show sounds)
    • Windows Vista Features in the Ease of Access Control Panel
    • Microsoft Speech Technologies
      • (speech recognition, synthesis)
    • Internet Explorer accessibility
      • (e.g. zoom, personal style sheets)

This is a great opportunity to explore the future. I see accessibility technology as a stealth engine of mainstream progress. When you look behind the kimono of accessibility, there's some really exciting stuff in the platform - speech, magnification, the automation APIs. And these components are becoming powerful enablers of user interface advances across much broader scenarios.

How? The fundamental design challenge. Addressing accessibility requirements means thinking about user interactions in a different way. It means considering different modalities of input and output; separating content from presentation; and putting user interactions in an intentional context. And, interestingly, these kinds of considerations are becoming fundamental to the increasing demands of today's digital lifestyle. We're not just expecting more in terms of content and the functionality, we are interacting with a greater number of devices in very different environments - and the UI challenges that arise can often be met by solutions that leverage accessibility platforms or technologies.

And ('and'...! like you needed 'and'...) in addition to the cool factor, you get to multiply the noble and worthy feelings of making a real difference in people's lives with the total kudos of saving the environment.

Due date for online submissions is May 2, 2008.  Detailed guidelines here. And let me know if you make a submission, I'd love to hear about it.

What a year for speech recognition at Microsoft

Yeah, yeah, the year in review, what a crushingly unoriginal idea for a post. But wait - this is worth it. 2007 was a huge year for speech recognition products at Microsoft. I think we'll look back on it as a real turning point. Here's how it shaped up.

(Going into the year, Exchange Server 2007 had just shipped with Unified Messaging, including Outlook Voice Access that gives you access over the phone to email, calendar and other useful features. It's a significant integration of speech technology into the heart of a high-volume server product. Huge posters had been up on campus for months, and inside Exchange, they called it 'the sizzle on the steak'.

Meanwhile the teams in Windows, Speech Server, automotive and core technology are hard at work... )

January 2007

Windows Vista ships with Windows Speech Recognition built into the operating system in eight different languages. Now this is a significant investment in the voice user interface as a means of commanding and dictation for desktop users. The entire desktop is speech-enabled under the 'say what you see' metaphor; correction and selection are easy; and the system adapts to your voice and your typical word usage as time goes on. Since the release of WSR, many media reviews have been overwhelmingly positive - check out Rob Chambers' blog (one of the driving forces behind speech in Vista) for links and discussions.

 

March 2007

Microsoft announces intent to acquire Tellme Networks. Steve Ballmer says it all:

“Speech is universal, simple and holds incredible promise as a key interface for computing. Tellme brings to Microsoft the talent, technology and proven experience in speech that will enable us to deliver a new wave of products and revolutionize human-computer interaction.”

(Incidentally, CNET has a nice inside look at the discussions in Building 34 on Superbowl day between Steve Ballmer and Mike McCue, Tellme CEO, that led up to the deal.)

Also in March - Microsoft Response Point is launched out of Microsoft Research. Response Point is a new way for small business to manage their phone systems - inexpensive, easy to set up and easy to use. All thanks to VoIP and the speech technology that underlies the user interface.

tellme logoMay 2007 

The acquisition of TellMe closes.

September 2007

In the mobile space, Tellme announces a deal with Sprint to incorporate Tellme's voice search technology with Live Search into certain phones.

Meanwhile, the first Ford cars hit the market in the USA with Sync - hands-free speech technology for voice dialing, messaging and media control within the car.

October 2007

Office Communications Server 2007 is released as the flagship of Microsoft's Unified Communications strategy. Bundled with OCS 2007 is the latest version of Microsoft Speech Server - now called Office Communications Server 2007 Speech Server (oh yes). It's a significant upgrade from Speech Server 2004, including native VoIP support, graphical dialog editing, conversational grammars, and rich data mining and tuning tools.

And - what a month - Live Search for Windows Mobile goes live with speech recognition. The speech team blog has more details of the kinds of searches possible. And you don't even need a mobile phone to make free 411 calls using the Live Search speech technology. Insider details from Long Zheng's interview with Program Manager Oliver Scholz.

So what's to come in 2008?

Let me say only that we have not been sitting around (well, actually, that's not quite true, I have been sitting around for the last month, since I was out on paternity leave. Only it wasn't really sitting around, there was a lot to do in terms of coping with the newborn's data streams and all that, but I wasn't building software, that's what I meant, now let me rescue my point) - all the teams behind these releases have been planning and executing on the next waves since even before the dates above, so huge momentum has already built in a number of areas, old and new, and we'll start to see evidence of this as the year progresses.  

And - did I mention that we're hiring in a number of speech technology-related areas? Please contact me for details if you're interested.

Getting attached

According to a report in the Seattle Times yesterday, 21 out of 30 serious users of the Roomba vacuum-cleaning robot give their machine a name. More than half assign it a gender (male) and others have been known to dress it up.

What kind of human-machine relationship is going on here? The researchers behind the report believe "emotional design" is fundamental. But there has to be much more to it than that. I know a lot of people who have developed strong emotional attachments to probably the most popular recent icon of emotional design - Apple's iPod - but none of them has assigned their machine a name or a gender. (And let's face it, the Roomba, while not unattractive in a squashed espresso-maker sort of way, is no Galatea.)

So is it functional - is the machine simply so useful that it becomes an indispensable part of the family? I'd put the fridge at the top of the utility list, closely followed by the stove, et cetera, et cetera, and I've never been tempted to anthropomorphize over my kitchen machines. Even the computer - which can be on the receiving end of the other side of emotional attachment - only gets a name because the OS needs one.

Perhaps a simpler thing going on here is that the Roomba has become a virtual pet. Users project the same kind of feelings onto it as they do pets. It scoots around the floor, doing its business, content and unburdening in its own little world of floor navigation. You can put your feet up and watch it at play - and take comfort or solace from its unquestioning devotion as it cleans for you. (Imagine if it just sat in the middle of the room and sucked up a maelstrom of dust from the floor without moving. Impressive, but not emotional.)

This is a very different kind of response than to gadget-love. Mp3 players and cell phones tend to serve as fashion objects, showcases of their owners' pride in their tastes. The vacuum robot is a more homely, introverted object of affection, and the attachment seems to run much deeper. There's a lesson here for software design - especially those of us in artificial intelligence technologies - anthropomorphism isn't just for humans.

Posted by Stephen Potter | 5 Comments
Filed under:

Extracting session audio from OCS 2007 Speech Server logs

The ability to extract the audio for an entire call (both prompts and recognitions) from the Speech Server 2007 logs is a really useful feature for a number of analysis and tuning scenarios. Since the topic has surfaced a few times on the Beta newsgroups, here's a summary of how to do it.

1. Ensure your logging parameters for the server are set (via the Trace Logging tab in the Administration Console) to log Application Events and All audio for: 100% of calls.

2. When you import the log data into the database with the MssLogToDatabase utility, be sure to specify the flag /audio:session.

3. In Analytics and Tuning Studio, connect to the database and in the Session List View or Session Detail View select the call that you want to hear. Hit the Play Session Audio button to play it back using your default .wav player, or hit the Export Session Audio button to export to file.

Some notes:

1. The logging of prompt audio is not available with TIM deployments of Speech Server (i.e. you need to be running a VoIP gateway).

2. The prompt audio as recorded does not take account of any dynamic changes to timing or volume. (So if, for example, the caller accelerated or decelerated playback while the prompt was underway, you won't hear the changes, and you may notice timing mismatches on the concatenated output file. You can check for these cases by looking for the relevant event signalling the prompt rate change in the session details.)

3. Some scenarios may call for access to the audio data in environments outside Analytics and Tuning Studio. For some sample code that illustrates how to extract the audio programmatically from an OCS 2007 database, see this article. (Extracting session audio is not possible with the command line utility MssContentExtract.exe, which be used only for extracting recognition audio.)

Posted by Stephen Potter | 2 Comments
Filed under:

Bored medical student impressed by speech recognition

Sullen student, bored by x-rays of terrible chest diseases, is mesmerised by speech recognition, mutters that's so cool.

Can't argue with that.

(And can't resist real stories with a whiff of The Onion.)

Posted by Stephen Potter | 1 Comments
Filed under:

Thinking aloud

Windows Vista Speech ToolbarSpeech recognition in Vista works well for me. Dictation accuracy is very high - especially since I flicked the switch to train it on my emails and documents - and the correction experience is smooth and efficient. But I hardly use it. I find it very difficult to dictate to my computer.

WTF? Typing is easier than speaking?

Yes. I can't speak the way I write. If I'm typing, I'll begin a sentence without knowing how it's going to end, future phrases will form as I finish typing previous ones. I'll pause, go back, select text, delete it, write again. It feels almost as if hitting the keys is a direct extension of the thought process. 

If I'm dictating, there's no such flow. I'll begin a sentence (having got over the minor panic triggered by the Listening microphone UI over a blank document) and the phrase will appear - correctly - on my page, but saying it out loud has corrupted the thoughts that would have helped complete it. I have to step back and think hard about what should come next. Every phrase forces a little mental reboot.

I think what's going on - in addition to my inability to think in complete sentences - is that my thought processes while typing have habituated themselves to the gated speeds of my motor operations. The extra time is valuable and they use it to do the work of sounding out the current phrase in context, thinking up the next phrase, and so on. So typing actually greases the wheels of forming the right words and sentences. That's missing when I dictate, and I find it quite paralyzing.

Surely many keyboard users, even very slow typists, will be reluctant to move to speech recognition systems - even high accuracy systems - because of the advantages of the thinking on-the-side that we seem to do while hitting keys. The standard words-per-minute (wpm) metric for text input is usually measured over copying pre-existing texts. The text creation part is assumed to be equal to each. But my wpm drops dramatically when I dictate, because the creation processes that work with typing are unavailable to me. But what does the wpm look like when you do have them available?

Many users of speech recognition (and of transcription-taking secretaries) have obviously overcome this. Most visibly, Richard Powers, a novelist who wrote the 2006 National Book Award winner using speech recognition on his TabletPC, trained himself after years of typing:

I needed weeks to get over the oddness of auditioning myself in an empty room, to trust to the flow of speech, to learn to hear myself think all over again. 

He broke through - and argues now that typing is the obstacle to the thought process:

What could be less conducive to thought’s cadences than stopping every time your short-term memory fills to pass those large-scale musical phrases through your fingers, one tedious letter at a time? You’d be hard-pressed to invent a greater barrier to cognitive flow.

Is the grass really greener over there? I'd love to know. If anyone has done it, please send a comment. I'm also going to try it out for myself. Over the next few weeks, I'll be training myself to think aloud.

Posted by Stephen Potter | 3 Comments
Filed under:

How to punish a speech recognition system

We've all had frustrating experiences with speech recognition systems, and as a race we're not beyond punishing virtual beings the same way we would punish people. So, what to do when that voicebot won't behave? Teach it a lesson! Here are some tips on how to get your own back on a telephony speech recognition system.

1. Play loud noise in the background. Music, car engines, crowded bar noise... all good.  Systems typically calibrate background noise levels at the start of the call as a baseline against which to separate the speech signal. Blasting noise right up its input channel at start-up is going to give the system such a distorted view of your audio world, it won't have a hope at picking out your voice. For extra points, play loud music and get the song recognized instead of your voice: (How may I help you?)  I can't get no... (I think you said 'account get new', is that right?) ...Satisfaction... (Got it!)

2. Speak long utterances without a pause. Great way to tie up system resources! Speech recognition doesn't come cheap in terms of CPU, and the longer you can make it process your big shiny audio, the sweeter your revenge. Pick up a newspaper, start reading and keep going without taking a breath. Keep it up for long enough and the system will eventually bail with a 'babble timeout' - you win.

3. Stay silent. The stealth-mode way to confuse the system. There it is, listening hard, straining at the lowest levels of the audio stack for your voice - but don't speak or make a noise. (Tip: put the phone on mute.) You might be tempted to chuckle during the silences, but keep your nerve, and laugh inside at every "I didn't catch that". It won't be long before the system just hangs up in perplexity.

4. Shout as loud as you can. This causes 'clipping' to the audio - basically, you're exceeding the expected amplitude of a bunch of frequencies in your signal, which flattens the waveform and introduces all kinds of distortion. Recognize that!

5. Pretend you're different people as the session progresses. Bit subtle this one, but in order to improve accuracy, speech recognizers like to decide early on what kind of speaker you are - male/female, child/adult, etc., and assume that you won't change. Nice try, reco-bot. This futile assumption can be wiped on the floor simply by first pretending to be a middle-aged man and then suddenly a twelve-year old girl! (You might want to practice voices beforehand.)  A fun variant of this is to get different kinds of people together, and hand the phone between them at each dialog turn - great party game.

6. Play "Dialog-Turn-the-Tables". This one is not only very satisfactory to do a number of times in a single call, it also has the potential to mislead the underlying data analysis algorithms that try to improve accuracy. The idea is to answer the system's questions with some information (so you might say for example I'm in Seattle), but then when the system tries to confirm it (Am I right with Seattle?), you can triumphantly say No! if it's right, and Yes! if it's wrong. You are messing with that heap of code, big time.

7. Chirp DTMF. DTMF (a.k.a. 'touch-tone') chirping is a skill that requires simultaneously humming and whistling a pair of different tones in order to mimic a keypress. This takes a lot of practice, but stick at it - the payoffs are big. Imagine: the system asks you to "Press or say '1'..."  but you do neither, you chirp #! Your voice just snubbed the SR engine and shoved it to the DTMF recognizer with a tone that was out-of-grammar! Beautiful!

Note: these techniques should be applied only when you have no interest in the outcome of your call (or in what an analyst of the audio logs of your call might think of you). If you want the system to provide information, conduct a transaction or put you through to an operator, don't do these things. Speech recognition engines are fragile, graceful things of beauty that will improve with love, patience, and lots of training data. Speak normally in a quiet environment, and do what you're told.

Posted by Stephen Potter | 6 Comments
Filed under:

The noise of email

I've just switched groups at Microsoft, and for the first few days it felt strangely quiet in my new office. The acoustic background hadn't changed much: I could hear voices in the offices nearby, people still passed by in the hallway chatting (some stopped to welcome me), my phone still rang. But there was an odd silence. I realized my email inbox had gone quiet.

I'd removed myself from my old group's mailing lists, and hadn't yet been added to any lists in the new group. Only mail addressed directly to me was coming in. It was like a soundtrack had stopped playing. The silence was broken only by my own dialogues.

After a while, team status reports began to appear, document reviews, meeting requests. A little later, I was getting broader, group-wide notifications about security training, server maintenance and building structure examinations. By then, I had propagated back up the mailing list hierarchy of a new team, group and division and was reading about acquisitions and executive shuffles. The background music had started up again.

Posted by Stephen Potter | 3 Comments
Filed under:

Speech recognition in 1968

Respect your elders - here's a video of the state of the art of speech recognition research at Stanford in 1968.

I thought it was a hoax at first, with the synth music and the board titles, the fuzzy waveforms, and somebody actually acting out "I scream" in contrast to a photo of "ice cream". Then a bearded bloke comes in and casually throws some French at the system! But the science is explained with a graphical demonstration of the cutting edge waveform segment mapping  technology, and it gets kind of gripping, in a quaint sort of way. The application of a control interface to the famous block-moving robots in the Stanford AI labs (with the 30-40 second latency between command and action) tops it off.

This was a generation ago. We've come a long way since then, eh?

Right?

Haven't we?

(Thanks to awesom-o at the Artifical Intelligence and Robotics blog.)

Goodbye Karen

Sad news from Cambridge this week- Karen Spärck Jones has died, aged 71. I took her classes on NL systems in my M.Phil,, and she was the advisor on my thesis (a text generation program to describe images). Karen was active in developing real systems since the early days of A.I., and it's hard to find an NL or IE research paper these days that doesn't credit at least one of her papers as background.

I will never forget her reviewing a draft of my thesis, a couple of days before it was due. I had worked very hard on it, and when I emailed it to her, I was proud of its depth and clarity. Next morning, I walked into Karen's office. "Sit down," she said severely, "you've got a lot of work to do."  For the next two hours she took every paragraph of every chapter apart. Above all, I had too much discussion on the theoretical challenges of discourse generation, and not nearly enough on the practical merits and drawbacks of my program. Her feedback was incisive and profound (and bluntly delivered...). In the 48 hours remaining, I turned my paper from a ramble in the text generation landscape into a clean, focused description of my solution.

I still have that draft copy somewhere, the pages sliced and scarred by her pencil, every margin and white space battered by now-illegible exclamations. Genius.

Speech Server 2007 Public Beta available

(Tap-tap. Is this working? OK. Hh-hmm.)

We have shipped the Speech Server 2007 Public Beta!

It's available here: http://www.microsoft.com/downloads/details.aspx?FamilyId=4F4D3AA4-8223-406C-B74F-DB2DE928D8B2

Since Speech Server is now technically part of Microsoft Office Communications Server 2007, its shiny new name is Office Communications Server 2007 Speech Server. (I know. Let a marketeer post on that one. OCS2K7SS for short? Um, nice. Let's stick with Speech Server 2007, you and me.) In any case, this is a "Beta Refresh" of Speech Server that adds new features and bug fixes since the last (private) Beta in 2006.

More to come - especially on the analytics and tuning features that my team delivered in a milestone of intense coding and round-the-clock testing (literally), fuelled by team dinners, a high-latency caffeine machine and hourly refreshes of the bug glide path...

Posted by Stephen Potter | 1 Comments
Filed under:

The importance of your call

Continuing an irreverent dictionary of voicebot vocabulary:

yourcallisimportanttous, cl.

Obsequious attempt to make the caller feel better about a dud place in the cattle queue. Actually means the opposite. Often followed by pleasestayontheline and yourcallwillbehandledshortly, and repeated ad nauseum.

orig. (doublespeak) "Your call is important to us!"

 

Posted by Stephen Potter | 0 Comments
Filed under:

Investing in voice

Big time! This kind of brainpower and expertise, on this kind of scale... the future is here. Welcome, Tellme!

Decomposition

Reading Nicholas Carr's dissection of the blogosphere this morning as "a vast, earth-engirdling digestive track, breaking down the news of the day into ever finer particles of meaning (and ever more concentrated toxins)" I am inspired to do my bit as a bacterium here in my little crease of the bowel.

For customer service systems, the punters' views aren't getting any better: SR is the easy target again for comment rage here, and the over-casual persona  of a major telephone company hits the wrong note with this customer. (These kind of reactions are becoming a theme [1] [2].)

But we've also seen an uptake of interest in speech technology in the mainstream media - the New York Times has recently covered speech recognition in Windows Vista, audio search and voicemail transcription, and the Wall Street Journal also weighed in last week on video search start-ups.

So what? Acceptance. While we'll always laugh at recognition errors and rail at poor voice UIs, the foothold of speech technology is more firmly established than ever before in a growing number of environments. This is why I'm excited about Speech Server can do within the rising tide of unified communications - at the end of the day, this is an enabling technology that allows people and systems to connect. Without it, there would be a phone ringing in an empty room, words of wisdom lost in unsearched audio, a whole class of users disenfrachised from productivity. With it, there is the opportunity to build applications that simplify and enrich people's interactions with each other and with the data that the world runs on.

Anyway, that was the courtesy flush. As the next release of Speech Server 2007 approaches, I'll be blogging more frequently about features - especially dialog design and tuning tools - and also on application ideas that open new scenarios in communications. I'd love to get your comments along the way, online or off.

Posted by Stephen Potter | 1 Comments
Filed under:

Prompt 1: Take n

"He never works when he's in a bad mood..." says the the Boston Globe in a profile of successful voice talent Tom Glynn. And given the amount of thought and work that he puts into every single prompt, it's hard to fault that.

More Posts Next page »
 
Page view tracker