Oct 4, 2024 • 6 min Read
What is Text to Speech
Exploring the discovery, uses and working of Text to Speech Systems.
Kaushal Choudhary
Senior Developer Advocate
What is Text-to-Speech
Text-to-speech is a form of speech synthesis that converts any string of characters (text) into spoken output (audio). A simple Text-to-Speech (TTS) system, contains two components:
- A Text Analysis System - which encodes the text signal into a hidden state that encodes the meaning, intensity of expression of text
- A Speech Synthesis System - which decodes this form as speech
In this article, we will learn more in-depth about the two components and also go through the code to run a text-to-speech system end-to-end.
History
In the 1930s, Bell Labs developed the Vocoder, which analyzed speech into fundamental tones. Homer Dudley created the Voder, the first keyboard-operated voice synthesizer. The late 1950s saw the rise of computer-based speech synthesis, culminating in 1968 when Noriko Umeda and colleagues at the Electrotechnical Laboratory introduced the first general English text-to-speech system.
Fast forward to 2006, Google launched WaveNet, a deep learning model that generates audio waveforms sample by sample. Its autoregressive design predicts each audio sample based on prior samples, using a fully convolutional neural network with varying dilation factors to exponentially expand its receptive field, resulting in realistic, high-fidelity audio.
Steps Involved
To understand the working of a typical Text-to-Speech (TTS) system, let's break it down step by step.
-
First, the input is a sequence of ASCII characters, which can vary in length. To simplify processing, we split the text into sentences using a sentence splitting algorithm. Even if there's only one sentence, we still identify sentence boundaries to ensure robustness.
-
Next, we break each sentence into tokens, which are usually words but can also include numbers, dates, or punctuation. This step is called tokenization and helps structure the text for further processing.
-
Each token is assigned a semiotic class, which helps decode non-language symbols like numbers or abbreviations. This process, known as text normalization, converts symbols into readable words.
-
Following this, a basic prosodic analysis is performed. Phonetic transcription is assigned to each word and divided into prosodic units like phrases and sentences. This is called text-to-phoneme (or grapheme-to-phoneme) conversion. Together, phonetic transcriptions and prosody form the linguistic representation used in speech synthesis.
-
In the synthesis phase, words are encoded as phonemes for compactness. Unit selection synthesis, a type of Concatenative Synthesis, uses a database of pre-recorded speech to match the input phonemes. Signal processing stitches together the selected speech fragments, forming a continuous speech waveform.
This concludes the TTS process, where text is transformed into natural-sounding speech. If you did not understand all the details, don't worry, we will be explaining them through further articles.
Hands On Exercise
Tacotron 2 is an end-to-end text-to-speech (TTS) model designed to generate high-quality, natural-sounding speech directly from text without additional prosody information. It is known for producing high-fidelity audio with minimal latency, making it a popular choice for speech synthesis applications.
Today, we going to generate speech using Tacotron 2 on Google Colab. Complete code can be found here.
Note
This implementation requires GPU runtime, so you can use the "Connect" button in the top right corner, to change it to "T4 GPU".
Step 1 - Install necessary libraries
%%bash
pip3 install deep_phonemizer torch torchaudio soundfile
Step 2 - Check if your GPU is connected
You can check if the GPU
runtime is working by using this command.
import torch
import torchaudio
torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(torch.__version__)
print(torchaudio.__version__)
print(device)
the output should be something like:
2.4.1+cu121
2.4.1+cu121
cuda
In case it is not, then that means that your torch
installation has some issue or your runtime is not connected to a GPU.
Step 3 - Text pre-processing
The pre-trained Tacotron2 model requires a specific set of symbol tables, which are also provided by torchaudio
. However, we will first implement the encoding manually for better understanding.
symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)
def text_to_sequence(text):
"""
Args:
text: the text input for the model
Returns: the mapping of input text to symbols
"""
return [look_up[s] for s in text.lower() if s in symbols]
text = "Hello world! From the team of Smallest dot ai.
print(text_to_sequence(text))
We can use two different type of encodings, character-based
and phoneme-based
. You can choose to run either one of 3.1
or 3.2
.
Step 3.1 - Character Based-Encoding
processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()
text = "Hello World, from the smallest dot ai team"
processed, lengths = processor(text)
print([processor.tokens[i] for i in processed[0, : lengths[0]]])
The output will like this : ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', ',', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', ' ', 's', 'm', 'a', 'l', 'l', 'e', 's', 't', ' ', 'd', 'o', 't', ' ', 'a', 'i', ' ', 't', 'e', 'a', 'm']
As you can see, this breaks the text down into characters i.e. separating every letter or symbol.
Step 3.2 - Phoneme-based encoding
Phoneme-based encoding is similar to character-based encoding, but it uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme) model. Similar to the case of character-based encoding, the encoding process is expected to match what a pretrained Tacotron2 model is trained on.
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
text = "Hello World, from the smallest dot ai team"
with torch.inference_mode():
processed, lengths = processor(text)
print([processor.tokens[i] for i in processed[0, : lengths[0]]])
Output: ['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', ',', ' ', 'F', 'R', 'AH', 'M', ' ', 'DH', 'AH', ' ', 'S', 'M', 'AO', 'L', 'AH', 'S', 'T', ' ', 'D', 'AA', 'T', ' ', 'AY', ' ', 'T', 'IY', 'M']
Compared to character-based encoding, this output provides the phonetic versions of the text instead of simply splitting the text based on different characters. You can read more about phonemes here.
Step 4 - Spectrogram generation
We will now generate spectrograms - a representation of the frequency spectrum of a speech signal over time. You can read more about spectograms here.
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
text = "Hello World, from the smallest dot ai team"
with torch.inference_mode():
processed, lengths = processor(text)
processed = processed.to(device)
lengths = lengths.to(device)
spec, _, _ = tacotron2.infer(processed, lengths)
_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
Step 5 - Plot the Spectogram
def plot():
fig, ax = plt.subplots(3, 1)
for i in range(3):
with torch.inference_mode():
spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
plot()
Do you see something interesting? Notice any patterns?
Step 6 - Using Vocoders for Audio Generation
Vocoders convert the spectogram into time-domain waveforms so that they can be synthesized by audio devices and heard by human ears. In this section, we will try converting the spectogram into the waveforms using 2 different vocoders.
Step 6.1 - Running the GriffinLim Vocoder
First we will use the Vocoder from Tacotron 1 Model
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)
with torch.inference_mode():
processed, lengths = processor(text)
processed = processed.to(device)
lengths = lengths.to(device)
spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)
Step 6.2 - Running the WaveGlow Vocoder
Optionally, we can also run the Waveglow vocoder.
waveglow = torch.hub.load(
"NVIDIA/DeepLearningExamples:torchhub",
"nvidia_waveglow",
model_math="fp32",
pretrained=False,
)
checkpoint = torch.hub.load_state_dict_from_url(
"https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/ nvidia_waveglowpyt_fp32_20190306.pth", # noqa: E501
progress=False,
map_location=device,
)
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}
waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()
with torch.no_grad():
waveforms = waveglow.infer(spec)
Step 7 - Plotting the waveforms with corresponding spectograms
def plot(waveforms, spec, sample_rate):
waveforms = waveforms.cpu().detach()
fig, [ax1, ax2] = plt.subplots(2, 1)
ax1.plot(waveforms[0])
ax1.set_xlim(0, waveforms.size(-1))
ax1.grid(True)
ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
return IPython.display.Audio(waveforms[0:1], rate=sample_rate)
plot(waveforms, spec, vocoder.sample_rate)
Step 8 - Save the audio files
import soundfile as sf
# Detach the waveforms from the torch Tensor and convert to numpy
waveforms = waveforms.cpu().numpy()
# Save the waveform to a file (assuming mono channel and sample rate of 22050 Hz)
sf.write('output_audio_wg.wav', waveforms[0], samplerate=22050)
print("Audio saved as 'output_audio_wg.wav'")
Generated Speech from different Vocoders.
- GriffinLim Vocoder -
Audio from GriffinLim Vocoder
- WaveGlow Vocoder -
Audio from WaveGlow Vocoder
Now we understand how a typical Text-to-Speech pipeline works, the different ways in which text can be encoded and how different vocoders produce speeches of varied fidelity.
Conclusion
Text-to-speech (TTS) systems have advanced from rule-based methods to deep learning models like WaveNet and Tacotron. Traditional methods struggled with naturalness and required large databases, while deep learning improves flexibility and audio quality, achieving near-human accuracy.
For high-fidelity, natural-sounding voice synthesis with efficient, low-cost inference, Waves by the smallest.ai team is an excellent solution. Here’s the same text generated using the Waves platform:
Audio from Waves TTS
Generate speech for your text - Try Waves now!
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
Top 5 Speechify Alternatives for High-Quality Audio-Books
Explore the Top 5 Speechify Alternatives for audiobook creation: Compare pricing, audio quality, latency, and use case fit to find the best TTS for your needs.
Top 5 Alternatives to ElevenLabs in TTS
Explore top ElevenLabs alternatives like Smallest.ai, Cartesia, Resemble AI, Speechify, and FakeYou. Compare latency, pricing, fidelity, and use cases.
Smallest AI vs Cartesia
Compare Smallest.ai vs Cartesia for TTS and Voice Cloning. Explore differences in voice quality, speed, emotional context, API features, and pricing.