Join our discord for early access to new features!Join discord for early access!Join Now

Sept 10, 20248 min Read

StyleTTS 2 : TTS model using Style Vector and Diffusion Modelling

Learn how to run StyleTTS 2 on Google Colab with this step-by-step guide.

cover image

Kaushal Choudhary

Senior Developer Advocate

cover image

What is StyleTTS 2?

StyleTTS 2 is a non-autoregressive TTS framework using a style encoder to reference audio, enabling natural and expressive speech similar to human. It uses large Speech Language Models (SLMs) to effectively capture accents, emotions and tones in different voices. It is also faster than other diffusion models, as it requires only a style vector to sample the whole speech. It uses pre-trained (speech language models) SLMs as a discriminator and a novel differentiable modelling approach which sustains the naturalness of synthesized speech. It also surpasses human recordings on the LJSpeech benchmark.

Concept

StyleTTS 2 builds on StyleTTS, which was a non-autoregressive model using a style vector from reference audio, enabling natural and expressive speech generation. Its goal was to synthesize speech that can capture full para-linguistic information of the input text. It used ADain to capture the style of the audio in a style vector and a monotonic aligner, as non-autoregressive models lacked it.

My image

But, styleTTS had a two stage training process, dependence on a reference audio, limited expressiveness due to deterministic generation.

And, StyleTTS 2 improves upon this, by introducing an end-to-end training process, with direct waveform synthesis and most importantly the style vector is sampled through a diffusion model removing the use of reference audio.

My image

In StyleTTS 2, the speech xx is modeled as a conditional distribution of p(xt)=p(xt,s)p(st)dsp(x|t) = \int p(x|t, s) p(s|t) \, ds, through a latent variable ss that follows the distribution p(st)p(s|t). We refer to this variable as the generalized speech style, representing any characteristic in speech beyond the phonetic content tt. These characteristics include, but are not limited to, prosody, lexical stress, formant transitions, and speaking rate. After training the Comparative Mean Opinion Score (CMOS) for naturalness and similarity of the model was +1.07(p<<0.01)+1.07 (p << 0.01), to put in context for VITS model has a CMOS of +0.45(p==0.009)+0.45 (p==0.009), where p is value from Wilcoxon test.

 Notebook Walkthrough

Let's walk-through a simple demo of StyleTTS 2 to better understand how this state-of-the-art text to speech model works.

Step 1 : Setting Up CUDA for GPU Acceleration

For the GPU to be utilized in the program, the runtime is to be changed to T4 GPU. After Selecting the runtime, run this :

!nvidia-smi

which would display something like this, the GPU Name, MemoryUsage and Process Name is important, and we can see that no running process found is being displayed for now.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

and to see CUDA, you can use

import torch
torch.cuda.is_available()

this should display True, meaning the GPU is active and CUDA is available.

Step 2 : Cloning the StyleTTS 2 Repository

Clone the Repo.  Make sure to clone the master branch as bug fixes and updates are regularly merged.

%%shell
git clone https://github.com/yl4579/StyleTTS2.git

Step 3 : Installing  Dependencies

!pip install SoundFile torchaudio munch torch pydub pyyaml librosa nltk matplotlib accelerate transformers phonemizer einops einops-exts tqdm typing-extensions git+https://github.com/resemble-ai/monotonic_align.git

Installing the LJSpeech pre-trained model from Huggingface.

%%shell
sudo apt-get install espeak-ng
git-lfs clone https://huggingface.co/yl4579/StyleTTS2-LJSpeech
mv StyleTTS2-LJSpeech/Models .

Step 4 :  Setup Utility Functions

The functions length_to_mask, preprocess and compute_style will help in preprocessing audio, compute the style vector and mask texts.

def length_to_mask(lengths):
    mask = (
        torch.arange(lengths.max())
        .unsqueeze(0)
        .expand(lengths.shape[0], -1)
        .type_as(lengths)
    )
    mask = torch.gt(mask + 1, lengths.unsqueeze(1))
    return mask


def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor


def compute_style(ref_dicts):
    reference_embeddings = {}
    for key, path in ref_dicts.items():
        wave, sr = librosa.load(path, sr=24000)
        audio, index = librosa.effects.trim(wave, top_db=30)
        if sr != 24000:
            audio = librosa.resample(audio, sr, 24000)
        mel_tensor = preprocess(audio).to(device)
        with torch.no_grad():
            ref = model.style_encoder(mel_tensor.unsqueeze(1))
        reference_embeddings[key] = (ref.squeeze(1), audio)
    return reference_embeddings

Step 5 : Load the Models

We will load the phonemizer , the pretrained Automatic Speech Recognition (ASR) model, BERT model and diffusion models.

# load phonemizer
import phonemizer

global_phonemizer = phonemizer.backend.EspeakBackend(
    language="en-us",
    preserve_punctuation=True,
    with_stress=True,
    words_mismatch="ignore",
)
config = yaml.safe_load(open("Models/LJSpeech/config.yml"))

# load pretrained ASR model
ASR_config = config.get("ASR_config", False)
ASR_path = config.get("ASR_path", False)
text_aligner = load_ASR_models(ASR_path, ASR_config)

# load pretrained F0 model
F0_path = config.get("F0_path", False)
pitch_extractor = load_F0_models(F0_path)

# load BERT model
from Utils.PLBERT.util import load_plbert

BERT_path = config.get("PLBERT_dir", False)
plbert = load_plbert(BERT_path)
model = build_model(
    recursive_munch(config["model_params"]), text_aligner, pitch_extractor, plbert
)
_ = [model[key].eval() for key in model]

_ = [model[key].to(device) for key in model]
params_whole = torch.load("Models/LJSpeech/epoch_2nd_00100.pth", map_location="cpu")
params = params_whole["net"]
for key in model:
    if key in params:
        print("%s loaded" % key)
        try:
            model[key].load_state_dict(params[key])
        except:
            from collections import OrderedDict

            state_dict = params[key]
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = k[7:]  # remove `module.`
                new_state_dict[name] = v

                # load params
                model[key].load_state_dict(new_state_dict, strict=False)
                # except:
                # _load(params[key], model[key])
_ = [model[key].eval() for key in model]

from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule

sampler = DiffusionSampler(
    model.diffusion.diffusion,
    sampler=ADPM2Sampler(),
    sigma_schedule=KarrasSchedule(
        sigma_min=0.0001, sigma_max=3.0, rho=9.0
    ),  # empirical parameters
    clamp=False,
)

Step 6 : Inference

Let's set up the Inference. This function will help us use the input text and create speech from the model.

def inference(text, noise, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    text = text.replace('"', "")
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = " ".join(ps)

    tokens = textcleaner(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(tokens.device)
        text_mask = length_to_mask(input_lengths).to(tokens.device)
        t_en = model.text_encoder(tokens, input_lengths, text_mask)

        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(
            noise,
            embedding=bert_dur[0].unsqueeze(0),
            num_steps=diffusion_steps,
            embedding_scale=embedding_scale,
        ).squeeze(0)

        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask)
        x, _ = model.predictor.lstm(d)

        duration = model.predictor.duration_proj(x)
        duration = torch.sigmoid(duration).sum(axis=-1)

        pred_dur = torch.round(duration.squeeze()).clamp(min=1)
        pred_dur[-1] += 5

        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame : c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device)
        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

        out = model.decoder(
            (t_en @ pred_aln_trg.unsqueeze(0).to(device)),
            F0_pred,
            N_pred,
            ref.squeeze().unsqueeze(0),
        )
	return out.squeeze().cpu().numpy()

Step 7 : Synthesize Speech

To synthesize speech, we need the text input first. Enter any text you like.

# @title Input Text {display-mode: "form"}
# synthesize text
text = "StyleTTS 2 is a non-autoregressive TTS framework using a style encoder to reference audio, enabling natural and expressive speech similar to human." # @param {type: "string"}

Step 7.1

To show the different way of inference, and how each can make a difference, we will be going through different steps and Hyperparameter tuning. Here, we will use 5 Diffusion steps. Here, diffusion_steps=5 means that the model will take 5 iterative steps during the sampling process to transform the noise input into a more structured output (in this case, speech representation).

from pydub import AudioSegment

start = time.time()
noise = torch.randn(1, 1, 256).to(device)
wav = inference(text, noise, diffusion_steps=5, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")  # Convert tensor to numpy array and scale

# Convert tensor to numpy array
# Create an AudioSegment from the numpy array

audio_segment = AudioSegment(
    wav.tobytes(),
    frame_rate=24000,
    sample_width=wav.dtype.itemsize,  # 2 bytes for int16
    channels=1,
)

# Save the audio segment as a WAV file
audio_segment.export("output.wav", format="wav")

import IPython.display as ipd

display(ipd.Audio(wav, rate=24000))

StyleTTS2 with `diffusion_steps=5`

Step 7.2

Here, we will use diffusion_steps=10.

from pydub import AudioSegment

start = time.time()
noise = torch.randn(1, 1, 256).to(device)
wav = inference(text, noise, diffusion_steps=10, embedding_scale=1)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")  # Convert tensor to numpy array and scale

# Convert tensor to numpy array
# Create an AudioSegment from the numpy array
audio_segment = AudioSegment(
    wav.tobytes(),
    frame_rate=24000,
    sample_width=wav.dtype.itemsize,  # 2 bytes for int16
    channels=1,
)
# Save the audio segment as a WAV file
audio_segment.export("output.wav", format="wav")

import IPython.display as ipd

display(ipd.Audio(wav, rate=24000))

StyleTTS2 with diffusion_steps=10

In diffusion models, more steps typically lead to higher-quality output but take longer, while fewer steps may result in faster inference but potentially lower quality.

Step 8

To increase the expressiveness, range of displaying emotions, we can alter the embedding_scale hyperparameter.

Step 8.1

The embedding_scale=1 indicates that the text embedding is applied as is, without scaling. This would provide a balanced contribution of the text embedding to the diffusion process.

texts = {}
texts['Happy'] = "We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands."
texts['Sad'] = "I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence."
texts['Angry'] = "The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations!"
texts['Surprised'] = "I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?"

for k, v in texts.items():
	noise = torch.randn(1,1,256).to(device)
	wav = inference(v, noise, diffusion_steps=10, embedding_scale=1)
	print(k + ": ")
	display(ipd.Audio(wav, rate=24000, normalize=False))

Happy

Sad

Angry

Suprised

Step 8.2

Using embedding_scale=2 increases the influence of the text embedding on the generated speech. This higher scaling factor means that the emotional or stylistic features derived from the text are amplified during the diffusion process, making the emotions (e.g., "Happy," "Sad," "Angry," etc.) more pronounced in the resulting audio.

texts = {}
texts['Happy'] = "We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands."
texts['Sad'] = "I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence."
texts['Angry'] = "The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations!"
texts['Surprised'] = "I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?"
 
for k,v in texts.items():
	noise = torch.randn(1,1,256).to(device)
	wav = inference(v, noise, diffusion_steps=10, embedding_scale=2) # embedding_scale=2 for more pronounced emotion
	print(k + ": ")

	display(ipd.Audio(wav, rate=24000, normalize=False))

Happy

Sad

Angry

Suprised

Listen to more audio samples from the Author, here.

You can find the Colab Notebook here.

Conclusion

Unlike traditional autoregressive models, StyleTTS 2 introduces a new approach that expands our grasp of how Diffusion Models and Adversarial training enhance text-to-speech systems. Here, we learned how to set up the StyleTTS 2 model, generate customizable speech using hyperparameter tuning.

For a real-time, cost-efficient and easy to use TTS, you can try Waves to convert text into speech, it has free tier options also. Try it out Now!