Join our discord for early access to new features!Join discord for early access!Join Now

Oct 17, 20244 min Read

How to create custom voice using XTTS

Explore about XTTS and how to use it.

cover image

Kaushal Choudhary

Senior Developer Advocate

cover image

What is XTTS?

XTTS is a multilingual text-to-speech and voice-cloning model by Coqui.ai. It has Multi-lingual speech generation in 16 languages - English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko). XTTS supports Cross-language voice cloning, Streaming inference with < 200ms latency and has fine-tuning support. You can try it out here.

Concept

Zero-Shot Multi-speaker TTS systems are mostly single language or some models which support multiple languages are limited to high/medium resource languages. XTTS builds on TorToise model is trained on 16 languages, achieving SOTA results in all of them.

xtts-training architecture

The internal architecture consists of

(i) VQ-VAE: A Vector Quantised-Variational AutoEncoder with 13M parameters which receives a mel-spectrogram as input and encodes each frame with 1 codebook consisting of 8192 codes at a 21.53 Hz frame rate. In experiments, it was found that less frequent codes improved the model’s expressiveness.

(ii) Encoder: The GPT-2 encoder is a decoder-only trans- former that is composed of 443M parameters. It receives as input text tokens obtained via a 6681-token custom Byte-Pair Encoding (BPE) tokenizer and as output predicts the VQ-VAE audio codes.

(iii) Decoder: The decoder is based on the HiFi-GAN vocoder with 26M parameters. It receives the latent vectors out of the GPT-2 encoder. The decoder is conditioned with speaker embedding from the H/ASP model.

It was trained on 541 hours on LibriTTS-R and 1812.7 hours from LibriLight.  For other languages, most of the data are from the Common Voice Dataset.

Notebook Walk-through

We are going to use the XTTS-v2 model for our demo. It is a state-of-the-art Zero-Shot multi-speaker model capable of generating production grade realistic audio with a single reference audio of ~6s. Find the full notebook here.

Note : Connect to GPU for faster inference.

Step 1 : Installing Dependencies

Step 1.1 : Change the locale in the notebook

This is an optional step, but sometimes while installing dependencies, colab gives us an encoding error. This is a fix for that.

import locale
locale.getpreferredencoding = lambda: "UTF-8"

Step 1.2 : Install the portaudio library

!sudo apt install portaudio19-dev

Step 1.3 : Install python libs

pip install --quiet TTS scipy sounddevice wavio PyAudio ffmpeg-python

Step 2 :  Cloning the Repository

!git clone https://huggingface.co/coqui/XTTS-v2

Step 3 : Import Libraries

We will define all the imports here itself to keep our code clean and maintainable.

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

from IPython.display import Audio
from scipy.io.wavfile import write

import sounddevice as sd
from scipy.io.wavfile import write, wavfile
import wavio as wv
import os
import matplotlib.pyplot as plt

Step 4 :  Record your voice

Let's record and sample our own voice to be cloned, you can try out with different sampling_freq and longer duration for recording audio.

sampling_freq = 44100
duration = 5

for brevity, the utility function to record the audio is not presented here. you can see it in the notebook.

audio, sr = get_audio()

Step 5 :  Inference

As XTTS is a multilingual model, we are going to use it for multilingual text and audio.

Step 5.1 On English Text

Put your choice of text in text-to-speak and put your custom [your_voice_sample].wav file for reference_audio.

config = XttsConfig()
config.load_json("./XTTS-v2/config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./XTTS-v2/")
model.cuda()

text_to_speak = "How are you!"
reference_audios = ["./XTTS-v2/samples/[your_voice_sample].wav"]

outputs = model.synthesize(
	text_to_speak,
	config,
	speaker_wav=reference_audios,
	gpt_cond_len=3,
	language="en",
)

Step 5.2 On French Text

In the text-to-speak put the French word and use the French sample in reference_audio.

config = XttsConfig()
config.load_json("./XTTS-v2/config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./XTTS-v2/")
model.cuda()

text_to_speak = "comment vas-tu"
reference_audios = ["./XTTS-v2/samples/fr_sample.wav"]

outputs_2 = model.synthesize(
	text_to_speak,
	config,
	speaker_wav=reference_audios,
	gpt_cond_len=3,
	language="fr",
)

Step 6 :  Audio Output

Let's hear the generated Audio. English Audio

Audio(data=outputs_1['wav'], rate=24000)

Audio from XTTS-v2

French audio

Audio(data=outputs_2['wav'], rate=24000)

Audio from XTTS-v2

Step 7 :  Save the Audio

Let's save the Audio, and you can download it manually into the local computer.

out_dir = "./outputs"
out_file_path_1 = os.path.join(out_dir, "output1.wav")
out_file_path_2 = os.path.join(out_dir, "output2.wav")

# Create the output directory if it doesn't exist
if  not os.path.exists(out_dir):
	os.makedirs(out_dir)

# Save the first wav (english) file
wavfile.write(out_file_path_1, 24000, outputs_1['wav'])

# Save the second (french) wav file
wavfile.write(out_file_path_2, 24000, outputs_2['wav'])

Conclusion

XTTS is a powerful multilingual text-to-speech and voice cloning model designed to generate natural, realistic audio in 16 languages with minimal input data. Its ability to perform cross-language voice cloning and real-time speech synthesis with low latency makes it a versatile tool for a wide range of applications. Setting up custom voice cloning with XTTS is seamless and seamless integration into platforms like Hugging Face makes it easier to maintain models and datasets.