[12/13/16] Amazon’s Echo has made tangible the promise of an artificially intelligent personal assistant in every home. Those who own the voice-activated gadget (known colloquially as Alexa, after its female interlocutor) are prone to proselytizing “her” charms, applauding Alexa’s ability to call an Uber, order pizza or check a 10th-grader’s math homework. The company says more than 5,000 people a day profess their love for Alexa.
On the other hand, Alexa devotees also know that unless you speak to her very clearly . . . and . . . slowly, she’s likely to say: Sorry, I don’t have the answer to that question. “I love her. I hate her, I love her,” one customer wrote on Amazon’s website, while still awarding Alexa five stars. “You will very quickly learn how to talk to her in a way that she will understand and it’s not unlike speaking to a small frustrating toddler.”
Voice recognition has come a long way in the past few years. But it’s still not good enough to popularize the technology for everyday use and usher in a new era of human-machine interaction, allowing us to talk with all our gadgets—cars, washing machines, televisions. Despite advances in speech recognition, most people continue to swipe, tap and click. And probably will for the foreseeable future.
What’s holding back progress? Partly the artificial intelligence that powers the technology has room to improve. There’s also a serious deficit of data—specifically audio of human voices, speaking in multiple languages, accents and dialects in often noisy circumstances that can defeat the code.
So Amazon, Apple, Microsoft and China’s Baidu have embarked on a world-wide hunt for terabytes of human speech. Microsoft has set up mock apartments in cities around the globe to record volunteers speaking in a home setting. Every hour, Amazon uploads Alexa queries to a vast digital warehouse. Baidu is busily collecting every dialect in China. Then they take all that data and use it to teach their computers how to parse, understand and respond to commands and queries.
The challenge is finding a way to capture natural, real-world conversations. Even 95 percent accuracy isn’t enough, says Adam Coates, who runs Baidu’s artificial intelligence lab in Sunnyvale, California. “Our goal is to push the error rate down to 1 percent,” he says. “That’s where you can really trust the device to understand what you’re saying, and that will be transformative.”
Not so long ago, voice recognition was comically rudimentary. An early version of Microsoft’s technology running in Windows transcribed “mom” as “aunt” during a 2006 demo before an auditorium of analysts and investors. When Apple debuted Siri five years back, the personal assistant’s gaffes were widely mocked because it, too, routinely spat out incorrect results or didn’t hear the question correctly. When asked if Gillian Anderson is British, Siri provided a list of English restaurants. Now Microsoft says its speech engine makes the same number or fewer errors than professional transcribers, Siri is winning grudging respect, and Alexa has given us a tantalizing glimpse of the future.
Much of that progress owes a debt to the magic of neural networks, a form of artificial intelligence based loosely on the architecture of the human brain. Neural networks learn without being explicitly programmed but generally require an enormous breadth and diversity of data. The more a speech recognition engine consumes, the better it gets at understanding different voices and the closer it gets to the eventual goal of having a natural conversation in many languages and situations.