Google recently unveiled LaMDA, an experimental model that the company claims could boost the ability of its conversational AI assistants and allow for more natural conversations. LaMDA aims to eventually converse normally about almost anything without any kind of prior training. It’s one of a growing number of AI projects that could leave you wondering if you are talking to a human being. “My estimate is that within the next 12 months, users will start being exposed to and getting used to these new, more emotional voices,” James Kaplan, the CEO of MeetKai, a conversational AI virtual voice assistant and search engine, said in an email interview. “Once this happens, the synthesized speech of today will sound to users like the speech of the early 2000s sounds to us today.”
Voice Assistants With Character
Google’s LaMDA is built on Transformer, a neural network architecture invented by Google Research. Unlike other language models, Google’s LaMDA was trained on real dialogue. Part of the challenge to making natural-sounding AI speech is the open-ended nature of conversations, Google’s Eli Collins wrote in a blog post. “A chat with a friend about a TV show could evolve into a discussion about the country where the show was filmed before settling on a debate about that country’s best regional cuisine,” he added. Things are moving fast with robot speech. Eric Rosenblum, a managing partner at Tsingyuan Ventures, which invests in conversational AI, said that some of the most fundamental problems in computer-aided speech are virtually solved. For example, the accuracy rate in understanding speech is already extremely high in services such as transcriptions done by the software Otter.ai or medical notes taken by DeepScribe. “The next frontier, though, is much more difficult,” he added. “Retaining understanding of context, which is a problem that goes well beyond natural language processing, and empathy, such as computers interacting with humans need to understand frustration, anger, impatience, etc. Both of these issues are being worked on, but both are quite far from satisfactory.”
Neural Networks Are the Key
To generate life-like voices, companies are using technology like deep neural networks, a form of machine learning that classifies data through layers, Matt Muldoon, North American president at ReadSpeaker, a company that develops text to speech software, said in an email interview. “These layers refine the signal, sorting it into more complex classifications,” he added. “The result is synthetic speech that sounds uncannily like a human.” Another technology under development is Prosody Transfer, which involves combining the sound of one text-to-speech voice with the speaking style of another, Muldoon said. There’s also transfer learning, which reduces the amount of training data needed to produce a new neural text-to-speech voice. Kaplan said producing human-like speech also takes enormous amounts of processing power. Companies are developing neural accelerator chips, which are custom modules that work in conjunction with regular processors. “The next stage in this will be putting these chips into smaller hardware, as currently it is already done for cameras when AI for vision is required,” he added. “It will not be long before this type of computing capability is available in the headphones themselves.” One challenge to developing AI-driven speech is that everyone talks differently, so computers tend to have a hard time understanding us. “Think Georgia vs. Boston vs. North Dakota accents, and whether or not English is your primary language,” Monica Dema, who works on voice search analytics at MDinc, said in an email. “Thinking globally, it’s costly to do this for all the regions of Germany, China, and India, but that does not mean it isn’t or can’t be done.”