I recently returned from the Pacific Northwest Computing & Communications Conference, where (among many other technologies) the most frequently hawked piece of software was speech recognition. No less than three different companies were within cat swinging distance of each other, trying to pitch the strengths of their specific products. While the capacity with which these systems put the spoken word to screen is really amazing, what is more interesting is where speech recognition has been, what you can do with it now and, sadly, what you can’t do with it.
Most people’s first introduction to a reasonably realistic computer that understood natural spoken instructions was the HAL 9000 from Stanley Kubrick’s 1969 film, 2001: A Space Odyssey. In it, two human crewmen interact with an overwhelmingly flexible computer. HAL is so perfect that it can simultaneously massage Frank Poole, go through the motions of playing Dave Boma at chess, as well as maintain every system on the ship Discovery. But HAL has been programmed with instructions to lie to Frank and Dave about the true reason for their mission, and slowly goes homicidally berserk.
An earlier, less fanciful story about natural language computers was Robert A. Heinlein’s 1964 Hugo Award winner, The Moon is a Harsh Mistress, which features the computer Mycroft Holmes, that progresses through the story from barely self aware machine to completely sentient person. When his systems analyst, Manuel Davis, begins speaking to "Mike" he uses discreet speech instructions, where - every - word - must - be - separated - by - a - pause. Mike eventually progresses to natural language or "continuous" recognition, which sounds the way people speak in every day conversation.
I’ve made this little foray into literature and film to demonstrate a very important point: This ain’t new, folks. The idea has been around for a long time. Only now have we reached the level of computing power that puts this type of technology within striking distance of the average user. While IBM was demonstrating limited vocabulary (zero through nine only!) discreet speech recognition with the Shoebox system at the same time Heinlein’s novel was published, the hardware required to drive such a device was overwhelming in the extreme. Now the hardware to meet the requirements can be purchased with your VISA card.
When first presented with this technology at the conference, my God given skepticism kicked in. I harassed the salesman at his seminar on Dragon System’s Naturally Speaking product such that he finally confessed the system wasn’t yet "perfected to a level that science fiction buffs would find acceptable." Dragon’s Naturally Speaking product boasted an accuracy rate of 95%, though a recent review by PC Magazine only yielded a 89% test in their labs. My point to him, and to the audience, was that if I were to turn in an article to my editor that if I were to turn in an article to our editor that was only 95% accurate, I probably wouldn’t be writing for that publication for long. Pin a medal on me...anywhere.
But Mr. Salesman had apparently been through this mill before, because he shot right back, "How fast do you type, sir?" Caught off guard, I admitted about 60 words per minute. He then went in for the kill. "Does that 60 wpm include editing and correcting typos as you go?" Ouch. Pin the tale on the Marshall.
His point, now that my jackass ears have returned to normal length, is that speech recognition is not designed to take over for traditional interface devices such as keyboards and mice. It is meant to augment them, so that you can work faster by having another input device. Just as the overwhelming miscellany of controls on a Boeing 747 allows a pilot to set that behemoth down on a runway without giving you whiplash, an increase in the number of input/output devices allows more accurate interaction to occur with a tool such as a computer. The human mind and nervous system vastly outstrips even the most advanced computer’s ability to produce answers based on speculative data. Anyone who has ever watched a daredevil juggle four roaring chain-saws has been witness to this fact.
Speech Recognition for the PC/Windows/Intel platform falls out into two previously discussed types of recognition: discreet and continuous. But what is done with that input data depends on the software’s "mode." In the speech recognition arena, these modes are: Dictation, Editing, and Command & Control.
Dictation is just what it sounds like: You talk, the computer types what it thinks you said on to your screen in a word processor or other application. Notice the weasel phrase in the previous sentence: "what it thinks you said." In the seminar, the salesman told the system to "Go to Sleep", that is, to stop recognizing what he was saying. The Dragon system didn’t turn off, however, because it had misinterpreted the phrase as a series of words instead of a command, which it printed in foot high letters on the projection screen: "Gold two sheep." Embarrassing.
You’re laughing now, but don’t blame it solely on the unsophisticated nature of computers. When a friend of mine heard an ad for a movie about two years ago, she turned to me after it was over and commented that "Brain-fart" seemed like a pretty stupid name for a film. The movie was, of course, Braveheart. As this anecdote demonstrates, the difference between what is heard and what is said can be miles apart, even when both parties agree on what the sound was that acted as the translation medium between them. It matters not whether the parties are flesh and blood or silicone and steel.
Editing is when you use the voice recognition to rearrange, spell check, or otherwise "wordsmith" typed or recognized speech within a document. If I wanted to replace all of the occurrences of the phrase "computer" in this article with the word "microcomputer" I would say, "Replace Computer, Replace With Microcomputer. Replace All." The reason I have to keep using the word "replace" is because Microsoft Word’s dialog boxes and button names each have "replace" in them. What I said was the equivalent of taking your hand off the keyboard, pointing the mouse at the Edit drop down menu, choosing "replace", then moving my hands back to the keyboard, typing "computer", using the mouse again to click into the Replace With field, using the keyboard to type "Microcomputer" and then going back to the mouse to click Replace All. Whew!
The argument can certainly be made that, through the use of hotkeys, I can do all of these commands from the keyboard. While that is true, it is usually a hard-core power user who knows all the hotkeys for a specific application. The speech driven editing capacity may be helpful for some, but for even the best power users, the increase in efficiency would probably be negligible. Thus editing with speech seems helpful only for those who can’t use a keyboard.
Finally, there is the Command and Control feature. This is where the voice recognition gets interesting, and if you are not careful, very addicting. Command and Control or "C&C" offers you the opportunity to invoke program navigation functions in any program, or within the Operating System. You simply say the commands as you would click them in the drop down menus. "File, Save As, LEFTHANDSPANNER.DWG" If you really want your office-mates to do a double take, program a series of commands into a voice macro and say "Bad Dog!" Their eyes will bug out when your computer invokes an undo.
As someone who works with HTML code all day long, I could really dig a tool like this. I frequently must replace a certain phrase or name or code bite within a hundred different documents. My editor, Allaire HomeSite, does most of the work, but if I could program a voice macro to step through the lengthy confirmation windows, it would really speed things up. Imagine: "Replace ‘Jumbojet’, With ‘SuperJumbo", Replace All, Recurse All Subdirectories." Since these commands appear in different windows, the macro could handle it all. I could do batch preprocessing without even looking up from my crossword puzzle. (You’re not reading this, right boss?)
Having covered what speech recognition can do for the average user, start thinking of ways you could use it. Such as: "Turn all layers off. Show layer 3, layer 4, layer 8. Make layer 8 cyan. Zoom Extents." Got the idea?
I do feel compelled to offer a caveat concerning accuracy. The Dragon Naturally Speaking product (which is considered the best of the best) won’t turn your computer into HAL or Mycroft Holmes anymore than putting wheels on your grandma will make her a wagon. The present speech recognition technologies are a massive step forward from where we were just five years ago, but they are not Nirvana. And they are not for everyone. A fellow UNIX jockey in my group scoffed at this technology. "I can type faster," he sneered. "I can even write a script to do the same thing faster, more accurately, and provide feedback on what happened. Why do I need this piece of Flash Gordon tripe?"
I ignored him later that day when he came over to complain about how carpal tunnel syndrome was ruining his hands.
Peace,
WebWalker