Sept 10, 2024 • 8 min Read
StyleTTS 2 : TTS model using Style Vector and Diffusion Modelling
Learn how to run StyleTTS 2 on Google Colab with this step-by-step guide.
Kaushal Choudhary
Senior Developer Advocate
What is StyleTTS 2?
StyleTTS 2 is a non-autoregressive TTS framework using a style encoder to reference audio, enabling natural and expressive speech similar to human. It uses large Speech Language Models (SLMs) to effectively capture accents, emotions and tones in different voices. It is also faster than other diffusion models, as it requires only a style vector to sample the whole speech. It uses pre-trained (speech language models) SLMs as a discriminator and a novel differentiable modelling approach which sustains the naturalness of synthesized speech. It also surpasses human recordings on the LJSpeech benchmark.
Concept
StyleTTS 2 builds on StyleTTS, which was a non-autoregressive model using a style vector from reference audio, enabling natural and expressive speech generation. Its goal was to synthesize speech that can capture full para-linguistic information of the input text. It used ADain to capture the style of the audio in a style vector and a monotonic aligner, as non-autoregressive models lacked it.
But, styleTTS had a two stage training process, dependence on a reference audio, limited expressiveness due to deterministic generation.
And, StyleTTS 2 improves upon this, by introducing an end-to-end training process, with direct waveform synthesis and most importantly the style vector is sampled through a diffusion model removing the use of reference audio.
In StyleTTS 2, the speech is modeled as a conditional distribution of , through a latent variable that follows the distribution . We refer to this variable as the generalized speech style, representing any characteristic in speech beyond the phonetic content . These characteristics include, but are not limited to, prosody, lexical stress, formant transitions, and speaking rate. After training the Comparative Mean Opinion Score (CMOS) for naturalness and similarity of the model was , to put in context for VITS model has a CMOS of , where p is value from Wilcoxon test.
Notebook Walkthrough
Let's walk-through a simple demo of StyleTTS 2 to better understand how this state-of-the-art text to speech model works.
Step 1 : Setting Up CUDA for GPU Acceleration
For the GPU to be utilized in the program, the runtime is to be changed to T4 GPU
. After Selecting the runtime, run this :
!nvidia-smi
which would display something like this, the GPU Name
, MemoryUsage
and Process Name
is important, and we can see that no running process found
is being displayed for now.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
and to see CUDA, you can use
import torch
torch.cuda.is_available()
this should display True
, meaning the GPU is active and CUDA
is available.
Step 2 : Cloning the StyleTTS 2 Repository
Clone the Repo. Make sure to clone the master branch as bug fixes and updates are regularly merged.
%%shell
git clone https://github.com/yl4579/StyleTTS2.git
Step 3 : Installing Dependencies
!pip install SoundFile torchaudio munch torch pydub pyyaml librosa nltk matplotlib accelerate transformers phonemizer einops einops-exts tqdm typing-extensions git+https://github.com/resemble-ai/monotonic_align.git
Installing the LJSpeech
pre-trained model from Huggingface.
%%shell
sudo apt-get install espeak-ng
git-lfs clone https://huggingface.co/yl4579/StyleTTS2-LJSpeech
mv StyleTTS2-LJSpeech/Models .
Step 4 : Setup Utility Functions
The functions length_to_mask
, preprocess
and compute_style
will help in preprocessing audio, compute the style vector and mask texts.
def length_to_mask(lengths):
mask = (
torch.arange(lengths.max())
.unsqueeze(0)
.expand(lengths.shape[0], -1)
.type_as(lengths)
)
mask = torch.gt(mask + 1, lengths.unsqueeze(1))
return mask
def preprocess(wave):
wave_tensor = torch.from_numpy(wave).float()
mel_tensor = to_mel(wave_tensor)
mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
return mel_tensor
def compute_style(ref_dicts):
reference_embeddings = {}
for key, path in ref_dicts.items():
wave, sr = librosa.load(path, sr=24000)
audio, index = librosa.effects.trim(wave, top_db=30)
if sr != 24000:
audio = librosa.resample(audio, sr, 24000)
mel_tensor = preprocess(audio).to(device)
with torch.no_grad():
ref = model.style_encoder(mel_tensor.unsqueeze(1))
reference_embeddings[key] = (ref.squeeze(1), audio)
return reference_embeddings
Step 5 : Load the Models
We will load the phonemizer , the pretrained Automatic Speech Recognition (ASR) model, BERT model and diffusion models.
# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(
language="en-us",
preserve_punctuation=True,
with_stress=True,
words_mismatch="ignore",
)
config = yaml.safe_load(open("Models/LJSpeech/config.yml"))
# load pretrained ASR model
ASR_config = config.get("ASR_config", False)
ASR_path = config.get("ASR_path", False)
text_aligner = load_ASR_models(ASR_path, ASR_config)
# load pretrained F0 model
F0_path = config.get("F0_path", False)
pitch_extractor = load_F0_models(F0_path)
# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get("PLBERT_dir", False)
plbert = load_plbert(BERT_path)
model = build_model(
recursive_munch(config["model_params"]), text_aligner, pitch_extractor, plbert
)
_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]
params_whole = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location="cpu")
params = params_whole["net"]
for key in model:
if key in params:
print("%s loaded" % key)
try:
model[key].load_state_dict(params[key])
except:
from collections import OrderedDict
state_dict = params[key]
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model[key].load_state_dict(new_state_dict, strict=False)
# except:
# _load(params[key], model[key])
_ = [model[key].eval() for key in model]
from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule
sampler = DiffusionSampler(
model.diffusion.diffusion,
sampler=ADPM2Sampler(),
sigma_schedule=KarrasSchedule(
sigma_min=0.0001, sigma_max=3.0, rho=9.0
), # empirical parameters
clamp=False,
)
Step 6 : Inference
Let's set up the Inference. This function will help us use the input text and create speech from the model.
def inference(text, noise, diffusion_steps=5, embedding_scale=1):
text = text.strip()
text = text.replace('"', "")
ps = global_phonemizer.phonemize([text])
ps = word_tokenize(ps[0])
ps = " ".join(ps)
tokens = textcleaner(ps)
tokens.insert(0, 0)
tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)
with torch.no_grad():
input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
text_mask = length_to_mask(input_lengths).to(tokens.device)
t_en = model.text_encoder(tokens, input_lengths, text_mask)
bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
s_pred = sampler(
noise,
embedding=bert_dur[0].unsqueeze(0),
num_steps=diffusion_steps,
embedding_scale=embedding_scale,
).squeeze(0)
s = s_pred[:, 128:]
ref = s_pred[:, :128]
d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)
x, _ = model.predictor.lstm(d)
duration = model.predictor.duration_proj(x)
duration = torch.sigmoid(duration).sum(axis=-1)
pred_dur = torch.round(duration.squeeze()).clamp(min=1)
pred_dur[-1] += 5
pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
c_frame = 0
for i in range(pred_aln_trg.size(0)):
pred_aln_trg[i, c_frame : c_frame + int(pred_dur[i].data)] = 1
c_frame += int(pred_dur[i].data)
# encode prosody
en = d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device)
F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
out = model.decoder(
(t_en @ pred_aln_trg.unsqueeze(0).to(device)),
F0_pred,
N_pred,
ref.squeeze().unsqueeze(0),
)
return out.squeeze().cpu().numpy()
Step 7 : Synthesize Speech
To synthesize speech, we need the text input
first. Enter any text you like.
# @title Input Text {display-mode: "form"}
# synthesize text
text = "StyleTTS 2 is a non-autoregressive TTS framework using a style encoder to reference audio, enabling natural and expressive speech similar to human." # @param {type: "string"}
Step 7.1
To show the different way of inference, and how each can make a difference, we will be going through different steps and Hyperparameter tuning
. Here, we will use 5
Diffusion steps. Here, diffusion_steps=5
means that the model will take 5 iterative steps during the sampling process to transform the noise
input into a more structured output (in this case, speech representation).
from pydub import AudioSegment
start = time.time()
noise = torch.randn(1, 1, 256).to(device)
wav = inference(text, noise, diffusion_steps=5, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}") # Convert tensor to numpy array and scale
# Convert tensor to numpy array
# Create an AudioSegment from the numpy array
audio_segment = AudioSegment(
wav.tobytes(),
frame_rate=24000,
sample_width=wav.dtype.itemsize, # 2 bytes for int16
channels=1,
)
# Save the audio segment as a WAV file
audio_segment.export("output.wav", format="wav")
import IPython.display as ipd
display(ipd.Audio(wav, rate=24000))
StyleTTS2 with `diffusion_steps=5`
Step 7.2
Here, we will use diffusion_steps=10
.
from pydub import AudioSegment
start = time.time()
noise = torch.randn(1, 1, 256).to(device)
wav = inference(text, noise, diffusion_steps=10, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}") # Convert tensor to numpy array and scale
# Convert tensor to numpy array
# Create an AudioSegment from the numpy array
audio_segment = AudioSegment(
wav.tobytes(),
frame_rate=24000,
sample_width=wav.dtype.itemsize, # 2 bytes for int16
channels=1,
)
# Save the audio segment as a WAV file
audio_segment.export("output.wav", format="wav")
import IPython.display as ipd
display(ipd.Audio(wav, rate=24000))
StyleTTS2 with diffusion_steps=10
In diffusion models, more steps typically lead to higher-quality output but take longer, while fewer steps may result in faster inference but potentially lower quality.
Step 8
To increase the expressiveness
, range of displaying emotions, we can alter the embedding_scale
hyperparameter.
Step 8.1
The embedding_scale=1
indicates that the text embedding is applied as is, without scaling. This would provide a balanced contribution of the text embedding to the diffusion process.
texts = {}
texts['Happy'] = "We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands."
texts['Sad'] = "I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence."
texts['Angry'] = "The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations!"
texts['Surprised'] = "I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?"
for k, v in texts.items():
noise = torch.randn(1,1,256).to(device)
wav = inference(v, noise, diffusion_steps=10, embedding_scale=1)
print(k + ": ")
display(ipd.Audio(wav, rate=24000, normalize=False))
Happy
Sad
Angry
Suprised
Step 8.2
Using embedding_scale=2
increases the influence of the text embedding on the generated speech. This higher scaling factor
means that the emotional or stylistic features derived from the text are amplified during the diffusion process, making the emotions (e.g., "Happy," "Sad," "Angry," etc.) more pronounced in the resulting audio.
texts = {}
texts['Happy'] = "We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands."
texts['Sad'] = "I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence."
texts['Angry'] = "The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations!"
texts['Surprised'] = "I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?"
for k,v in texts.items():
noise = torch.randn(1,1,256).to(device)
wav = inference(v, noise, diffusion_steps=10, embedding_scale=2) # embedding_scale=2 for more pronounced emotion
print(k + ": ")
display(ipd.Audio(wav, rate=24000, normalize=False))
Happy
Sad
Angry
Suprised
Listen to more audio samples from the Author, here.
You can find the Colab Notebook here.
Conclusion
Unlike traditional autoregressive models, StyleTTS 2 introduces a new approach that expands our grasp of how Diffusion Models and Adversarial training enhance text-to-speech systems. Here, we learned how to set up the StyleTTS 2 model, generate customizable speech using hyperparameter tuning.
For a real-time, cost-efficient and easy to use TTS, you can try Waves to convert text into speech, it has free tier options also. Try it out Now!