Turns out voice recognition software has improved to the point where it is significantly faster and more accurate at producing text on a mobile device than we are at typing on its keyboard. That’s according to a new study by Stanford University, the University of Washington and Baidu, the Chinese Internet giant. The study ran tests in English and Mandarin Chinese.
Baidu chief scientist Andrew Ng says this should not feel like defeat. “Humanity was never designed to communicate by using our fingers to poke at a tiny little keyboard on a mobile phone. Speech has always been a much more natural way for humans to communicate with each other,” he says.
Researchers set up a competition, pitting a Baidu program called Deep Speech 2 against 32 humans, ages 19 to 32. The humans took turns saying and then typing short phrases into an iPhone — like “buckle up for safety” and “wear a crown with many jewels” and “this person is a disaster.” They found the voice recognition software was three times faster.
Stanford computer scientist James Landay did not expect that. “The surprise for me was that it was that much better: three times faster! You would think everyone would be flocking to use it if they knew how much better it actually was.”
Voice recognition still gets a bad rap. That could be because of how people use it. Apple’s Siri, the beloved and befuddled personal assistant, has a hard time answering basic questions.
The Stanford University-University of Washington-Baidu team didn’t test query skills. They zoomed in on voice recognition software’s ability to type the spoken words. In English, they found the software’s error rate was 20.4 percent lower than humans typing on a keyboard; and in Mandarin Chinese, it was 63.4 percent lower.
Landay hopes these findings encourage people to revisit the idea of talking to their phone.
“People probably play with Siri and find oh, it didn’t give them the right answer. So they don’t think to use speech as a way to do their text messaging or their email or what not,” he says. “Using speech for those things is now working really well.”
Back in the 1990s, researchers found voice recognition tools were far less accurate than keyboard typing. Slang and ambient noise in a room tripped up the software.
In the last few years, that’s changed for a few reasons: Just like smartphone cameras with more megapixels can see us better, the built-in microphones can hear us better. Supercomputers are churning through data more effectively in a process called “deep learning.”
And there’s more training data to vacuum in and learn from. For example, Ng says, Baidu has five years’ worth of audio — unique recordings of people speaking that can play nonstop from now until 2021.
Last year, 65 percent of smartphone owners in the U.S. used voice assistants, according to the 2016 Internet Trends Report, a popular annual overview by tech investor Mary Meeker.
Many tech companies are betting that now is the inflection point and are hiring experts in the field of “natural language processing.” Google and Amazon are inviting developers to work on voice-driven products.
It’s easy to see how talking at your device would be far better than typing, say when you’re driving.
Baidu’s Ng imagines another scenario. He does not have children yet. But, he says, he looks forward to the day when his future grandchild comes home and asks, “Is it really true that when you were young, if you came home and you said something to your microwave oven — did it really just sit there and ignore you? That’s just so rude of the microwave.”
His co-author Landay reins him back and notes there are many moments — in a meeting, in bed with your partner sleeping — when typing still makes more sense than talking to one’s devices.