6 min Read

What is Text to Speech

Exploring the discovery, uses and working of Text to Speech Systems.

Kaushal Choudhary

Senior Developer Advocate

What is Text-to-Speech

Text-to-speech is a form of speech synthesis that converts any string of characters (text) into spoken output (audio). A simple Text-to-Speech (TTS) system, contains two components:

A Text Analysis System - which encodes the text signal into a hidden state that encodes the meaning, intensity of expression of text
A Speech Synthesis System - which decodes this form as speech

In this article, we will learn more in-depth about the two components and also go through the code to run a text-to-speech system end-to-end.

History

In the 1930s, Bell Labs developed the Vocoder, which analyzed speech into fundamental tones. Homer Dudley created the Voder, the first keyboard-operated voice synthesizer. The late 1950s saw the rise of computer-based speech synthesis, culminating in 1968 when Noriko Umeda and colleagues at the Electrotechnical Laboratory introduced the first general English text-to-speech system.

Fast forward to 2006, Google launched WaveNet, a deep learning model that generates audio waveforms sample by sample. Its autoregressive design predicts each audio sample based on prior samples, using a fully convolutional neural network with varying dilation factors to exponentially expand its receptive field, resulting in realistic, high-fidelity audio.

Google WaveNet.

Steps Involved

To understand the working of a typical Text-to-Speech (TTS) system, let's break it down step by step.

First, the input is a sequence of ASCII characters, which can vary in length. To simplify processing, we split the text into sentences using a sentence splitting algorithm. Even if there's only one sentence, we still identify sentence boundaries to ensure robustness.
Next, we break each sentence into tokens, which are usually words but can also include numbers, dates, or punctuation. This step is called tokenization and helps structure the text for further processing.
Each token is assigned a semiotic class, which helps decode non-language symbols like numbers or abbreviations. This process, known as text normalization, converts symbols into readable words.
Following this, a basic prosodic analysis is performed. Phonetic transcription is assigned to each word and divided into prosodic units like phrases and sentences. This is called text-to-phoneme (or grapheme-to-phoneme) conversion. Together, phonetic transcriptions and prosody form the linguistic representation used in speech synthesis.
In the synthesis phase, words are encoded as phonemes for compactness. Unit selection synthesis, a type of Concatenative Synthesis, uses a database of pre-recorded speech to match the input phonemes. Signal processing stitches together the selected speech fragments, forming a continuous speech waveform.

This concludes the TTS process, where text is transformed into natural-sounding speech. If you did not understand all the details, don't worry, we will be explaining them through further articles.

Hands On Exercise

Tacotron 2 is an end-to-end text-to-speech (TTS) model designed to generate high-quality, natural-sounding speech directly from text without additional prosody information. It is known for producing high-fidelity audio with minimal latency, making it a popular choice for speech synthesis applications.

Tacotron-2-Model

Today, we going to generate speech using Tacotron 2 on Google Colab. Complete code can be found here.

Note

This implementation requires GPU runtime, so you can use the "Connect" button in the top right corner, to change it to "T4 GPU".

Google-Colab

Step 1 - Install necessary libraries

%%bash
pip3 install deep_phonemizer torch torchaudio soundfile

Step 2 - Check if your GPU is connected

You can check if the GPU runtime is working by using this command.

import torch
import torchaudio

torch.random.manual_seed(0)
device = "cuda"  if torch.cuda.is_available() else  "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)

the output should be something like:

2.4.1+cu121
2.4.1+cu121
cuda

In case it is not, then that means that your torch installation has some issue or your runtime is not connected to a GPU.

Step 3 - Text pre-processing

The pre-trained Tacotron2 model requires a specific set of symbol tables, which are also provided by torchaudio. However, we will first implement the encoding manually for better understanding.

symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in  enumerate(symbols)}
symbols = set(symbols)

def text_to_sequence(text):
	"""
		Args:
		text: the text input for the model

		Returns: the mapping of input text to symbols
	"""
	return [look_up[s] for s in text.lower() if s in symbols]

	text = "Hello world! From the team of Smallest dot ai.
	print(text_to_sequence(text))

We can use two different type of encodings, character-based and phoneme-based . You can choose to run either one of 3.1 or 3.2.

Step 3.1 - Character Based-Encoding

processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()
text = "Hello World, from the smallest dot ai team"
processed, lengths = processor(text)

print([processor.tokens[i] for i in processed[0, : lengths[0]]])

The output will like this : ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', ',', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', ' ', 's', 'm', 'a', 'l', 'l', 'e', 's', 't', ' ', 'd', 'o', 't', ' ', 'a', 'i', ' ', 't', 'e', 'a', 'm']

As you can see, this breaks the text down into characters i.e. separating every letter or symbol.

Step 3.2 - Phoneme-based encoding

Phoneme-based encoding is similar to character-based encoding, but it uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme) model. Similar to the case of character-based encoding, the encoding process is expected to match what a pretrained Tacotron2 model is trained on.

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
text = "Hello World, from the smallest dot ai team"

with torch.inference_mode():
    processed, lengths = processor(text)

print([processor.tokens[i] for i in processed[0, : lengths[0]]])

Output: ['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', ',', ' ', 'F', 'R', 'AH', 'M', ' ', 'DH', 'AH', ' ', 'S', 'M', 'AO', 'L', 'AH', 'S', 'T', ' ', 'D', 'AA', 'T', ' ', 'AY', ' ', 'T', 'IY', 'M']

Compared to character-based encoding, this output provides the phonetic versions of the text instead of simply splitting the text based on different characters. You can read more about phonemes here.

Step 4 - Spectrogram generation

We will now generate spectrograms - a representation of the frequency spectrum of a speech signal over time. You can read more about spectograms here.

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
text = "Hello World, from the smallest dot ai team"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)

lengths = lengths.to(device)
spec, _, _ = tacotron2.infer(processed, lengths)

_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")

Step 5 - Plot the Spectogram

def plot():
	fig, ax = plt.subplots(3, 1)
	for i in range(3):
		with torch.inference_mode():
			spec, spec_lengths, _ = tacotron2.infer(processed, lengths)

	ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")

plot()

Spectrogram

Do you see something interesting? Notice any patterns?

Step 6 - Using Vocoders for Audio Generation

Vocoders convert the spectogram into time-domain waveforms so that they can be synthesized by audio devices and heard by human ears. In this section, we will try converting the spectogram into the waveforms using 2 different vocoders.

Step 6.1 - Running the GriffinLim Vocoder

First we will use the Vocoder from Tacotron 1 Model

bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
processor = bundle.get_text_processor()

tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
	processed, lengths = processor(text)
	processed = processed.to(device)
	lengths = lengths.to(device)
	spec, spec_lengths, _ = tacotron2.infer(processed, lengths)

waveforms, lengths = vocoder(spec, spec_lengths)

Step 6.2 - Running the WaveGlow Vocoder

Optionally, we can also run the Waveglow vocoder.

waveglow = torch.hub.load(
"NVIDIA/DeepLearningExamples:torchhub",
"nvidia_waveglow",
model_math="fp32",
pretrained=False,
)

checkpoint = torch.hub.load_state_dict_from_url(
	"https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/	nvidia_waveglowpyt_fp32_20190306.pth", # noqa: E501
	progress=False,
	map_location=device,
	)

state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}
waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()

with torch.no_grad():
	waveforms = waveglow.infer(spec)

Step 7 - Plotting the waveforms with corresponding spectograms

def plot(waveforms, spec, sample_rate):
	waveforms = waveforms.cpu().detach()
	fig, [ax1, ax2] = plt.subplots(2, 1)
	ax1.plot(waveforms[0])
	ax1.set_xlim(0, waveforms.size(-1))
	ax1.grid(True)
	ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
	return IPython.display.Audio(waveforms[0:1], rate=sample_rate)

plot(waveforms, spec, vocoder.sample_rate)

Step 8 - Save the audio files

import soundfile as sf
# Detach the waveforms from the torch Tensor and convert to numpy
waveforms = waveforms.cpu().numpy()

# Save the waveform to a file (assuming mono channel and sample rate of 22050 Hz)
sf.write('output_audio_wg.wav', waveforms[0], samplerate=22050)

print("Audio saved as 'output_audio_wg.wav'")

Generated Speech from different Vocoders.

GriffinLim Vocoder -
Audio from GriffinLim Vocoder
WaveGlow Vocoder -
Audio from WaveGlow Vocoder

Now we understand how a typical Text-to-Speech pipeline works, the different ways in which text can be encoded and how different vocoders produce speeches of varied fidelity.

Conclusion

Text-to-speech (TTS) systems have advanced from rule-based methods to deep learning models like WaveNet and Tacotron. Traditional methods struggled with naturalness and required large databases, while deep learning improves flexibility and audio quality, achieving near-human accuracy.

For high-fidelity, natural-sounding voice synthesis with efficient, low-cost inference, Waves by the smallest.ai team is an excellent solution. Here’s the same text generated using the Waves platform:

Audio from Waves TTS

Generate speech for your text - Try Waves now!