April 5, 2020

WaveNet and Tacotron aren't TTS systems

Deep learning models for speech synthesis, such as Google's WaveNet and Tacotron, are not complete text-to-speech systems.

WaveNet and Tacotron aren't TTS systems

Summary: Deep learning models for speech synthesis, such as Google's WaveNet and Tacotron, are not complete text-to-speech systems. They are each just one part of a large pipeline of models and heuristics that together form a text-to-speech engine. WaveNet is not a text-to-speech engine. Tacotron isn't either.

In the past few years, researchers have designed many neural network architectures for synthesizing audio. The most commonly referenced ones are Google's WaveNet and Tacotron, but there are many, many others, such as Google's WaveRNN, Baidu's Deep Voice papers, NVidia's WaveGlow, SampleRNN, Microsoft's FastSpeech, and many others.

These papers are announced to great fanfare on company websites and tech news sites, with great audio samples and interesting demos. This gives the impression that the paper describes a complete text-to-speech system. Google even brands its neural vocoder voices in Google Cloud as WaveNet voices, which can obscure the fact that a significant part of the TTS engine pipeline is shared between the two engines. As a result, a lot of people online end up (understandably!) confused and refer to a "WaveNet TTS system" or "Tacotron TTS system", or assume that a Github repo with a re-implementation of one of these can be used to build a complete speech synthesis engine.

This blog post is an attempt to rectify this (slight) misconception.

What is a text-to-speech engine?

A text-to-speech engine is a piece of software which converts text into speech (audio). This process is typically separated into a pipeline, where each step in the pipeline is its own model or set of models. An example pipeline might include:

  • Normalization: Converting non-spoken tokens (numbers, dates, etc) to spoken words, such as "1901" to "nineteen oh one" or "5/12" to "may twelfth".
  • Part-of-Speech Tagging: Labeling words by their part of speech.
  • Phoneme Conversion: Converting words to a phonetic representation, such as IPA.
  • High-Level Audio Synthesis: Converting the phonemes into a high-level representation of audio, such as mel spectrograms, F0, spectral envelope, LSP or LPC coefficients, etc.
  • Waveform Synthesis: Converting the high-level representation into a final audio waveform.

This list does not include miscellaneous things such as networking, request parsing, audio encoding, etc.

A complete TTS engine has to do (more or less) all of these things and connect them all together.

What are WaveNet and Tacotron?

WaveNet and Tacotron are neural network models that address one step of the above pipeline. Specifically, WaveNet is a neural vocoder, and is responsible for the "waveform synthesis" step of the pipeline. Tacotron is a sequence-to-sequence model for spectrogram synthesis, and addresses the "high level audio synthesis" step.

Now that we have these distinctions, we can ask more specifically: What models have been developed for each stage of this pipeline?

Disclaimer

I personally know the authors of many of the papers I listed above and worked on several of them myself. I'm not listing them in any order or trying to promote my own papers. These are just the systems that came to mind. Don't judge me for my choices here.

Additionally, I'm one of the founders of Voicery, where we build custom text-to-speech engines based on models similar to the ones above. If you have any questions or are looking to deploy your own models like these, check out our demos and get in touch.