Speech synthesis technologies will drive the next wave of innovative voice applications

Deep learning is revolutionizing text-to-speech and speech synthesis technologies.

By Yishay Carmiel and Ben Lorica.

Introduction

Recent progress in natural language processing (NLP) and speech models have made voice applications accessible to companies across industries. From smartphone applications and personal assistants to sales and customer support to smart home speakers and appliances, voice applications have become part of daily life. Advancements in deep learning and artificial intelligence technologies—natural language understanding (NLU) in particular—have brought voice technologies to a broader developer audience and greatly expanded practical use case scenarios.

While much of the recent media coverage and attention-grabbing applications have focused on automatic speech recognition (ASR), in this post, we’ll focus on text to speech (TTS) and speech synthesis—technologies that create artificial human speech.

As consumers integrate speech technology into their daily lives with such products as smart speakers and smart home systems, they increasingly will demand higher quality and more realistic speech synthesis. We expect TTS and speech synthesis to drive the next wave of innovative voice applications. In 2018, Google Duplex introduced a system that previewed capabilities that speech synthesis and NLU technologies might be able to achieve. The system demonstrated personal assistants capable of booking appointments by interacting with humans through phone conversations. Assistive technology use cases in healthcare will also drive innovation in speech synthesis, bringing maturity to impaired communication aids and novel solutions for a wide range of disabilities. We expect to see highly advanced applications that can, for example, serve as the voice for a patient with damaged vocal chords, using their own voice.

At a high level, voice applications have three main components: speech recognition, speech profiling, and speech synthesis.

Three main components of voice applications. — Figure 1: There are three main components of voice applications.

Speech recognition is the translation of spoken language into text. It is also called automatic speech recognition, computer speech recognition, and speech to text. A major application of ASR is transcribing conversations. Other examples include “Hey, Siri” commands, such as “call home,” and voice interfaces requiring simple data entry (e.g., entering credit card numbers or call routing—”press or say ‘1’ for customer service”).

Speech profiling is the process of audio mining information from speech beyond recognition, including age, gender, emotion, the language spoken, speaker verification, etc. Applications include biometrics, sentiment analysis, and metadata extraction to improve customer intelligence initiatives.

Speech synthesis is the artificial creation of human speech. In this post we’ll occasionally use the term “speech synthesis” to refer to technologies that cut across TTS and speech synthesis. The more familiar term is “text-to-speech” but we’re opting for “speech synthesis” because we expect input sources in the future to include a range of formats including text and audio. An example of a system that can take audio input is Tencent’s PitchNet: a model that takes audio of one singing voice and converts it into audio of another voice singing the same content.

An Overview of Speech Synthesis Systems

Voice applications are complex. A close look at voice-enabled AI conversation applications, for instance, reveals the complexity of components on which such applications rely. Each of the components in Figure 1, for example, might require multiple machine learning (ML) models.

Figure 2: Components of voice-enabled AI conversation applications might involve multiple machine learning models.

The fact that speech technologies mainly interact with humans increases their complexity. Most machine learning systems use quantitative metrics to measure the effectiveness or accuracy of their outputs. If an ML system is designed to catalogue species of birds, for example, it’s straightforward to score the accuracy. The output of voice applications is meant for human listening and requires qualitative metrics evaluated by a panel of humans. A common metric is the Mean Opinion Score (MOS), where a panel of listeners rates the output of a speech synthesis system using a range of 1 (lowest perceived quality) to 5 (highest perceived quality).

Classical Text-to-Speech and Speech Synthesis systems

Until the mid-2010’s when voice applications began integrating deep learning technologies, systems for producing artificial voices had two main components: natural language understanding and signal generation.

Natural language understanding

Natural language understanding is the process of a machine learning the peculiarities of language and understanding language components in context. In English, consider “1939.” The proper rendering of 1939 into speech depends on the context. NLU’s job is to understand the context and properly parse “1939.”

My father was born in 1939 (“nineteen thirty-nine”)
Please press 1939 (“one-nine-three-nine”)
This computer costs $1939 (“one thousand nine hundred thirty-nine”)

Classical NLU systems first pass text through a normalization system which converts text into a canonical, standard form that can more easily be rendered into speech. Next, the system passes the converted text through additional systems that turn it into linguistic units of sound called phonemes. These units get assembled together using a prosody model that ensures words and sentences sound correct.

Signal generation

Signal generation is the process of taking the text output from the NLU component and turning it into a realistic sounding voice. Early signal generation systems used a voice concatenation approach that recorded uttered parts of speech, such as phonemes or combinations of phonemes, and concatenated them while maintaining prosody and intonation. In principle, this system could generate speech from any text presented to it. Concatenation systems are complex and produce outputs that aren’t as realistic as modern deep learning-based systems.

Figure 3: Speech synthesis systems are composed of several components that are built and trained separately.

The rise of deep learning for speech synthesis

Deep learning has revolutionized systems for producing artificial voices. Two main components, in particular, have seen significant innovation: the process of converting text to a high-level audio representation, and the process of converting those representations into speech. Two deep learning models, WaveNet in 2016 and Tacotron in 2017, provided such significant improvements over existing systems that deep learning has now become central to modern speech platforms.

WaveNet, a neural vocoder system from DeepMind, is a deep learning model trained on speech recordings. It is used to produce human-like voices. You can hear it in action on the Google Cloud Text-to-Speech service.
Tacotron is a deep learning-based text-to-speech model that synthesizes speech directly from text characters.

Since the introduction of WaveNet and Tacotron, major components of speech synthesis systems have come to rely on deep learning. Neither system, however, is an end-to-end (e2e) solution. Speech synthesis systems are composed of several components that are built and trained separately (see Figure 3). This means each component has to be trained individually, with its own training data. The speech synthesis system then relies on sequentially training each component in the pipeline. WaveNet and Tacotron each address one component of the end-to-end pipeline.

A 2020 paper from DeepMind introduced an end-to-end (e2e) speech synthesis model that takes an input (text, phonemes) and outputs raw speech waveforms. e2e systems are also being investigated in automatic speech recognition. While end-to-end speech synthesis systems have attracted the attention of many researchers, it’s important to note that we are in the earliest stages of realizing an e2e system that can take an input and fully process a reasonable human-like output in real time.

Looking Ahead

There are several active areas of research in speech synthesis. Each offers a myriad of applications and use cases, along with risk/reward trade-offs.

Adaptable speech synthesis

In today’s applications, it takes hours of recording and tuning for a speech synthesis system to output a specific person’s voice. What if we could take an existing, realistic sounding speech synthesis system and imbue it with specific voices with only minutes, or even seconds, of a person’s voice recording? A highly personalized, realistic speech synthesis system could revolutionize an unlimited number of applications. The reality of such a system isn’t as far off as it sounds: open source cloning systems that require only a few minutes of recorded audio of a speaker’s voice are already available, and are bound to improve over time. Microsoft recently announced an offering that allows companies to create custom synthetic voices.

Looking further ahead, we expect to see mainstream speech synthesis systems advanced enough to generate multilingual audio using the same voice. This would drive interesting applications for travel, media and movies, and public service initiatives, to name just a few use cases.

There are obvious risks of adaptable speech synthesis systems. Fraudulently impersonating someone—using audio deepfakes—is becoming trivial and opens up many potential nefarious applications. A Wall Street Journal article described how fraudsters were able to impersonate the voice of an executive and demand a wire transfer. As the technology becomes more sophisticated, blackhat users likely will become more sophisticated as well.

Real-time speech synthesis

These are systems that enable realistic conversations without pauses and that can respond to a human in real time, effectively masking the fact the human is talking to a computer. Combine real-time capabilities with advances in natural language understanding and we start seeing the level of realistic assistants hinted at in the Google Duplex demo. The risk here, again, is fraudulent impersonation.

Realism

Realistic conversations require more than just real-time responses. A conversation between humans involves banter, interruptions, imperfect phrasing, use of multi-lingual expressions or words, and more. A realistic speech synthesis system would achieve those conversation peculiarities using the proper emotional tone in the context of the conversation. This level of innovation would revolutionize personal assistants. It would, however, also revolutionize fraudulent impersonation and disinformation campaigns. Initial work in this area includes MelNet from Facebook Research, an end-to-end model that can be used to produce audio that mimics voices and speech patterns.

Summary

In this post, we provided an overview of speech synthesis technologies. The advent of deep learning has led to vastly more realistic systems that are already being used in many settings. We close this post with a few observations and predictions about technologies for producing artificial speech:

Speech synthesis and TTS will be central to the next wave of innovative voice applications. These technologies are rapidly getting democratized and are much more accessible to developers who don’t have backgrounds in machine learning or speech systems.
Real-time, adaptive, and realistic speech synthesis technologies can lead to disruptive applications in many domains, including customer service, healthcare (assistive technologies), and the public sector (public service announcements), just to name a few.
Smart assistants will benefit from continued progress in both speech synthesis and natural language understanding. The 2018 Google Duplex demonstration gave a sneak peek into the kinds of applications developers will be able to build in the near future.
While end-to-end systems are still very much in their infancy, we expect them to steadily improve over the next few years.
As with any rapidly improving technology, there are many potential negative applications of speech synthesis, including the use of audio deepfakes to impersonate someone. To better manage downside risks, developers of voice applications will need to work closely with other teams to incorporate and achieve Responsible AI.

Related Content:

Yishay Carmiel: “End-to-end deep learning models for speech applications”
Ben Lorica and Yishay Carmiel: “Got speech? These guidelines will help you get started building voice applications”
“Navigate the road to Responsible AI”
Yishay Carmiel: “Commercial speech recognition systems in the age of big data and deep learning”

Yishay Carmiel is an AI Leader at Avaya, and has years of experience in speech technologies and conversational applications.

Ben Lorica is co-chair of the Ray Summit, chair of the NLP Summit, and principal at Gradient Flow.

FREE report:

Download