Welcome to MSDN Blogs Sign in | Join | Help

How to punish a speech recognition system

We've all had frustrating experiences with speech recognition systems, and as a race we're not beyond punishing virtual beings the same way we would punish people. So, what to do when that voicebot won't behave? Teach it a lesson! Here are some tips on how to get your own back on a telephony speech recognition system.

1. Play loud noise in the background. Music, car engines, crowded bar noise... all good.  Systems typically calibrate background noise levels at the start of the call as a baseline against which to separate the speech signal. Blasting noise right up its input channel at start-up is going to give the system such a distorted view of your audio world, it won't have a hope at picking out your voice. For extra points, play loud music and get the song recognized instead of your voice: (How may I help you?)  I can't get no... (I think you said 'account get new', is that right?) ...Satisfaction... (Got it!)

2. Speak long utterances without a pause. Great way to tie up system resources! Speech recognition doesn't come cheap in terms of CPU, and the longer you can make it process your big shiny audio, the sweeter your revenge. Pick up a newspaper, start reading and keep going without taking a breath. Keep it up for long enough and the system will eventually bail with a 'babble timeout' - you win.

3. Stay silent. The stealth-mode way to confuse the system. There it is, listening hard, straining at the lowest levels of the audio stack for your voice - but don't speak or make a noise. (Tip: put the phone on mute.) You might be tempted to chuckle during the silences, but keep your nerve, and laugh inside at every "I didn't catch that". It won't be long before the system just hangs up in perplexity.

4. Shout as loud as you can. This causes 'clipping' to the audio - basically, you're exceeding the expected amplitude of a bunch of frequencies in your signal, which flattens the waveform and introduces all kinds of distortion. Recognize that!

5. Pretend you're different people as the session progresses. Bit subtle this one, but in order to improve accuracy, speech recognizers like to decide early on what kind of speaker you are - male/female, child/adult, etc., and assume that you won't change. Nice try, reco-bot. This futile assumption can be wiped on the floor simply by first pretending to be a middle-aged man and then suddenly a twelve-year old girl! (You might want to practice voices beforehand.)  A fun variant of this is to get different kinds of people together, and hand the phone between them at each dialog turn - great party game.

6. Play "Dialog-Turn-the-Tables". This one is not only very satisfactory to do a number of times in a single call, it also has the potential to mislead the underlying data analysis algorithms that try to improve accuracy. The idea is to answer the system's questions with some information (so you might say for example I'm in Seattle), but then when the system tries to confirm it (Am I right with Seattle?), you can triumphantly say No! if it's right, and Yes! if it's wrong. You are messing with that heap of code, big time.

7. Chirp DTMF. DTMF (a.k.a. 'touch-tone') chirping is a skill that requires simultaneously humming and whistling a pair of different tones in order to mimic a keypress. This takes a lot of practice, but stick at it - the payoffs are big. Imagine: the system asks you to "Press or say '1'..."  but you do neither, you chirp #! Your voice just snubbed the SR engine and shoved it to the DTMF recognizer with a tone that was out-of-grammar! Beautiful!

Note: these techniques should be applied only when you have no interest in the outcome of your call (or in what an analyst of the audio logs of your call might think of you). If you want the system to provide information, conduct a transaction or put you through to an operator, don't do these things. Speech recognition engines are fragile, graceful things of beauty that will improve with love, patience, and lots of training data. Speak normally in a quiet environment, and do what you're told.

Published Tuesday, June 12, 2007 5:45 PM by Stephen Potter
Filed under:

Comments

Thursday, June 14, 2007 3:21 PM by DuncanH

# re: How to punish a speech recognition system

The very title of this note speaks volumes - you can only punish something that has feeling - and speech rec systems don't.

If you are trying to tell people what they do that makes speech rec systems fail to work, I don't think your article is anywhere near complete. If you are trying to mock speech rec systems, it's often not hard - and, as ever, it's lazy designers who you should be castigating, not the systems themselves.

Your suggestions are like taking a left when your sat nav says turn right - the machine recalculates and plays the cool voice prompts that you interpret as the machine telling you when to turn next. Sure, great fun when you first miss a turn, and very easy to personify, but the novelty soon wears off.

Speech rec is a tool and like all tools users need to learn how to use it. Want it to work? Just try your hardest to answer the question in as few words as possible, don't try to help the machine and practice practice practice. Still doesn't work for you? Maybe it never will.

DuncanH@tangent-telecom.com

Thursday, June 14, 2007 9:29 PM by Stephen Potter

# re: How to punish a speech recognition system

Hi Duncan - thanks for the comment!

This post wasn't intended to be taken seriously. What is interesting to me is that humans do punish things that don't have feelings - see the study where people deliberately tortured an avatar in the belief that it would learn from punishment: http://www.plosone.org/article/fetchArticle.action?articleURI=info%3Adoi%2F10.1371%2Fjournal.pone.0000039.

Consider this in the context of the rage that people often bring to automated telephony systems, and one's imagination will take it from there.

Stephen

Thursday, July 05, 2007 12:30 PM by Patricia Collins

# re: How to punish a speech recognition system

Perhaps some won't recognize your tongue-in-cheek/reverse-psychology blog, which in fact says, "If you're trying to get some real work done with a speech recognition system, here are some things to avoid...."  With increasing frequency, people with repetitive stress injuries to their hands and risks are relying on this technology and need all the help they can get making it work well for them. We're also getting (far too many) young disabled veterans who have served in Afghanistan or Iraq whose voices are their remaining primary means of communication with computers. They will depend on researchers to continue to improve speech recognition systems, and in the meanwhile to get the most out of the speech recognition tools we do have. For those working on these problems, here's another difficulty that needs to be addressed: The U.S. has always been home to many first-generation immigrants and others whose English is heavily accented. When using telephones, these people can quickly become very frustrated trying to accomplish their tasks because there is no alternative to "speaking into the mike" and having the system ignorantly respond that it can't understand what the user is saying. Just try reaching a human being. Perhaps the accent is *delightful* to the ear, but just doesn't fit the limited variability in the training data. The distance between the speaker's pronunciation and the system's expectations is just too big. Speech recognition has long been an area in which researchers felt they were a constant decade away from discovering solutions. For now, any implementation that doesn't allow for a human connection (i.e., someone with a good ear for understanding what someone else is saying) when recognition isn't going well (by the user's standards) is going to be a source of customer satisfaction problems (e.g., making appointments by telephone, getting to a technical support person via a labyrinth of selections).

Friday, July 06, 2007 11:00 AM by Máquinas Falantes

# Currente calamo (#7)

Para encerrar os posts da semana, não podia deixar de partilhar convosco dois posts de dois blogs que

Friday, July 06, 2007 11:52 AM by Noticias externas

# Currente calamo (#7)

Para encerrar os posts da semana, não podia deixar de partilhar convosco dois posts de dois blogs que

Friday, July 06, 2007 4:49 PM by Stephen Potter

# re: How to punish a speech recognition system

Hi Patricia, thanks for your observations. I agree that there are still considerable challenges in terms of meeting a broader set of needs, coping with broader variations in speech, and for systems to integrate human support much better than they do today. And I really think we will make progress on all those fronts in the next few years...

Stephen

New Comments to this post are disabled
 
Page view tracker