Meta said it has made significant progress in its effort to create more realistic AI-generated speech systems. The company’s AI team said it’s made advances in the ability to model expressive vocalizations, such as laughter, yawning, and cries, in addition to “spontaneous chit-chat” in real-time. “In any given conversation, people exchange chock-full of nonverbal signals, like intonations, emotional expression, pauses, accents, rhythms—all of which are important to human interactions,” the team wrote in the recent blog post. “But today’s AI systems fail to capture these rich, expressive signals because they learn only from written text, which captures what we say but not how we say it.”
Smarter Speech
In the blog post, Meta AI’s team said they are working to overcome the limitations of traditional AI systems that can’t understand non-verbal signals in speech, such as intonations, emotional expressions, pauses, accents, and rhythms. The systems are held back because they can only learn from written text. But Meta’s work differs from previous efforts because its AI models can use natural language processing models to capture the full nature of spoken language. Meta researchers say that the new models can allow AI systems to convey the sentiment they want to convey—such as boredom or irony. “In the near future, we will focus on applying textless techniques to build useful downstream applications without requiring either resource-intensive text labels or automatic speech recognition systems (ASR), such as question answering (e.g., “How’s the weather?”),” the team wrote in the blog post. “We believe prosody in speech can help better parse a sentence, which in turn facilitates understanding the intent and improves the performance of question answering.”
AI Powers Comprehension
Not only are computers getting better at communicating meaning, but AI is also being used to power improvements in speech recognition. Computer scientists have been working on computer speech recognition since at least 1952, when three Bell Labs researchers created a system that could recognize single numeric digits, the chief technology officer of AI Dynamics, Ryan Monsurate, said in an email to Lifewire. By the 1990s, speech recognition systems were commercially available but still had an error rate that was high enough to discourage use outside of very specific application domains such as healthcare. “Now that deep learning models have enabled ensemble models (like those from Microsoft) to attain superhuman performance at speech recognition, we have the technology to enable speaker-independent verbal communication with computers at scale,” Monsurate said. “The next stage will include lowering the cost so that everyone who uses Siri or Google’s AI assistants will have access to this level of speech recognition.” AI is useful for speech recognition because it can improve over time through learning, Ariel Utnik, the chief revenue officer and general manager at AI voice company Verbit.ai, told Lifewire in an email interview. For example, Verbit claims its in-house AI technology detects and filters out background noise and echoes and transcribes speakers regardless of accent to generate detailed, professional transcripts and captions from live and recorded video and audio. But Utnik said that most current speech recognition platforms are only 75-80% accurate. “AI will never fully replace humans as the personal review by transcribers, proofreaders, and editors is necessary to ensure a high quality and top accuracy final transcript,” he added. Better voice recognition could also be used to prevent hackers, Sanjay Gupta, the vice president global head of product and corporate development at voice recognition company Mitek Systems, said in an email. Research indicates that within two years, 20 percent of all successful account takeover attacks will use synthetic voice augmentation, he added. “This means as deep fake technology becomes more sophisticated, we need to simultaneously create advanced security that can combat these tactics alongside image and video deep fakes,” Gupta said. “Combatting voice spoofing requires liveness detection technology, capable of distinguishing between a live voice and a recorded, synthetic or computer-generated version of a voice.” Correction 04/05/2022: Corrected the spelling of Ryan Monsurate’s name in paragraph 9.