Will Styler's Homepage
Will Styler

Associate Teaching Professor of Linguistics at UC San Diego

Director of UCSD's Computational Social Science Program

A Speech Geek’s Review of Dragon Dictate 3

This was originally posted on my blog, Notes from a Linguistic Mystic in 2012. See all posts

Before I go any further, I want to point out that this isn’t a paid review, nor do I have any horses in the speech recognition race. I’m just a nerd who’s followed speech recognition for a long time and wants to share a really good experience I’ve had.

So, I mentioned last week that I recently started using Dragon Dictate 3, the Mac OS version of Nuance’s Dragon suite. So far, I’ve been very impressed with it, partly because I’ve been following speech recognition on the Mac platform for a long time (Starting with IBM ViaVoice around 1999 or 2000), and partly because I’m a phonetics geek, who understands just how difficult this task is. Given the price ($199), I know that this software isn’t a small investment, so I figured that putting my thoughts out there might be useful for somebody else who’s on the fence.

Setup and Microphones

Purchase and download were relatively easy, as I opted for a digital edition, and setup was quite simple as well. The price is very high, but for somebody who writes as much as I do, and was already starting to get a little bit of repetitive strain injury from constant typing, this is worth it. The initial set up took around five minutes, and reading it the very first text to calibrate the microphone took another five minutes or so on top of that. At that point, I was dictating my first emails.

It’s worth noting that at this point, nearly all the dictation I’ve done using the software has been using the built-in microphone on my computer, which, apparently, has some noise canceling ability built in. Generally, I’ve worked in relatively quiet rooms, and the best results I’ve gotten were sitting in the soundproof booth in our phonetics lab, so I’m tempted to say that background noise is the biggest limitation of that microphone. I also used an older USB headset microphone which came with a much earlier generation of the same software, but the results I’ve gotten using that microphone and the built-in microphone, in a quiet environment, have been about the same.

Accuracy

Clearly, the most important factor with any speech recognition software is its accuracy, and Dragon 3 is the best performer I’ve dealt with in terms of accuracy. This doesn’t mean to say it’s perfect, and in fact, in writing this review so far entirely by dictation, I probably had to correct the output at least 15 times (to this point). However, one thing that this version has gotten right is the correction interface. Rather than having to manually type in the corrections, it allows you to edit everything entirely by voice, even spelling out words that are quite difficult to get across to the software otherwise. More importantly still, it really does seem to be learning from the corrections. Low-frequency words which I use commonly (things like “schwa”, “vowels”, “phoneme”, and the name of the building my office is in) are missed the first few times, but after a few correctons they’re reliably recognized. Similarly, there are few “repeat issues” that I’ve had, and usually correcting something once or twice through the program fixes the issue for good. It’s also able to learn vocabulary from any plaintext document, which is also quite handy for improving recognition of odd words. The best kind of error is the kind that occurs only once, and the robust learning features made that a more frequent experience than I’ve ever had in the past.

One area that Nuance can definitely improve on is their handling of reduced forms in speech. “Want to” is reliably captured, but “wanna” is not reliably captured. Same thing with “gonna”, and even other contractions like “we’re” or “which’re” are reliably missed or misidentified by the software. This means that in all likelihood, there will always be an element of unnaturalness to your speech, a degree of abstraction between how you talk to humans and how you talk to the computer, which leaves the process a little bit unsatisfying. Similarly, their handling of grammatical words (like “of”, “to”, or “from”) which are phonetically reduced in speech (made much shorter and less distinct) leaves something to be desired, as these things are often missed altogether. Given where we’re at with the technology and the speed with which the recognition happens, this is quite forgivable.

Paradoxically though, I found the recognition accuracy tends to go up when I speak in longer, more connected sentences, rather tan saying something relatively quickly, waiting for it to finish, and then proceeding on with the next chunk of the sentence. I understand that their models are likely trained on connected text, but given the remaining inaccuracies, I think it’s a very natural tendency to want to talk in small chunks, and correct little errors as they come along, rather than going for a large sentence and hoping everything will work out for the best. This is counterintuitive, and requires you to think a little more consciously about what you’re about to say before you say it. I’m still getting the hang of this, but as I use longer sentences, my accuracy is going up.

One other victory is that Dragon’s non-speech noise (often called “biological sounds”) detection is much improved. I can cough or clear my throat with Dragon on and recording, and it remains silent and ready, whereas prior incarnations (MacSpeech Dictate, IBM ViaVoice) recorded coughs or throat-clears as “a”, or “we”, or something small like that. This seems like a small thing, but every correction can severely impact the flow of your thoughts.

Unfortunately, I’m becoming increasingly convinced that perfectly accurate speech recognition will likely require artificial intelligence, as it requires a great deal of knowledge of context and possibility. Distinguishing “I’m gonna take a walk from the Chinese restaurant” from “I’m gonna take a wok from the Chinese restaurant” is not a question of better recognizing speech, but better understanding language. As such, homophones like “who’s” and “whose”, “wont” and “want”, “there” and “they’re” will often be missed by Dragon. However, once again, the effort they put into the correction interface really shows, as the majority of the time, you can simply say “correct that” and pick whatever number corresponds to the proper version of the word. So, if your dream is uninterrupted dictation with absolute perfect accuracy, keep waiting, but they’re definitely getting better.

Practical use

The majority of dictation I do is actually responding to emails, related to a variety of projects and answering student questions regarding my class. As such, The majority of my dictation happens in various windows spread around the operating system, and for the most part, that works fine. However, if you’re doing any editing using the voice interface, whether that be using corrections in their tool, or their commands to move around through the text, not infrequently, the software will lose its place in the text. Next time you try to correct something, or to move your cursor, it’ll go someplace you don’t expect, or worse still, it’ll overwrite a previous chunk of text with your newest correction.

There are also a variety of other strange bugs that are still going on with text entry. When you’re working outside of the built-in notepad program, it’ll often add extra characters, and even, on occasion, print whatever you just said without using the letter “I”, which makes very little sense. So, if I’m going to be doing any kind of longform dictation, I’ve got in the habit of doing it inside the editor window of the dictation program itself, and copying things back in to whatever other window I was working in an as needed. This is frustrating and feels more like a bug than a technical limitation. As such, in many ways, it’s much more difficult to forgive. Speech recognition is hard, making words appear in a window is easy. I wish that Nuance would’ve squished a few more of those bugs before release, but I hope that’ll be repaired in future versions.

However, these little issues aside, this is still the best speech recognition program I’ve used. It’s locally done, see you don’t have to wait for results to come back from the Internet, and you don’t have to send things off in chunks, you can talk continuously. This makes a major difference, both in the mindset of the dictator, and in the usability of the software. If it tells you anything, I find myself keeping Dragon open, and in any situation where there’s nobody nearby who I would disturb by talking aloud, I’m more likely to dictate an email than I am to type it out.

Not perfect, but remarkably good

The biggest failing of this software is the price. $199 is very steep, and is only something I considered because it’s both a personal interest of mine and because I don’t want my hands to fall off because of the amount of typing I do. If you have any kind of RSI, or if you do a whole bunch of writing, this may be a price worth paying. But I still think that Nuance could do a whole lot better on the price, especially given that their products are showing up for free in so many other venues (Siri, among others).

The other major issue is that, this, like all dictation software, is designed for prose dictation in the general domain. It’s likely prohibitively time consuming to use it for computer programming (with lots of symbols and non-English-word terms), and if you work with specialized vocabulary (like medicine or law), you may need to either buy a different version of the software or spend a long time training it. That said, I’ve had little trouble adapting it to the vocabulary of phonetics and phonology by giving it large papers I’ve written to train on.

Speech recognition is an incredibly hard task, and as a phonetics geek, I’m constantly amazed at how good modern software has gotten at it. I’m quite happy with Dragon Dictate 3, and if you need it (and are willing to pay the price, you might be happy too). Nuance has some work to do yet on the implementation, but if this is the state of the art now, I can’t wait to see what speech recognition software will look like in 10 more years.

That’s the serious part of this review. Tomorrow, I’ll talk about the kinds of awesome you can find when you’re not playing along. Arr ewe red E 4 sum fun?

Edit: See my Dragon Dictate 3 Followup Review which details my continued feelings after using the software for almost a month