Oct 17, 2024 • 4 min Read
How to create custom voice using XTTS
Explore about XTTS and how to use it.
Kaushal Choudhary
Senior Developer Advocate
What is XTTS?
XTTS is a multilingual text-to-speech and voice-cloning model by Coqui.ai. It has Multi-lingual speech generation in 16 languages - English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko). XTTS supports Cross-language voice cloning, Streaming inference with < 200ms latency and has fine-tuning support. You can try it out here.
Concept
Zero-Shot Multi-speaker TTS systems are mostly single language or some models which support multiple languages are limited to high/medium resource languages. XTTS builds on TorToise model is trained on 16 languages, achieving SOTA results in all of them.
The internal architecture consists of
(i) VQ-VAE: A Vector Quantised-Variational AutoEncoder with 13M parameters which receives a mel-spectrogram as input and encodes each frame with 1 codebook consisting of 8192 codes at a 21.53 Hz frame rate. In experiments, it was found that less frequent codes improved the model’s expressiveness.
(ii) Encoder: The GPT-2 encoder is a decoder-only trans-
former that is composed of 443M
parameters. It receives as input text tokens obtained via a 6681-token custom Byte-Pair Encoding (BPE)
tokenizer and as output predicts the VQ-VAE audio codes.
(iii) Decoder: The decoder is based on the HiFi-GAN vocoder with 26M parameters. It receives the latent vectors out of the GPT-2 encoder. The decoder is conditioned with speaker embedding from the H/ASP model.
It was trained on 541
hours on LibriTTS-R and 1812.7
hours from LibriLight. For other languages, most of the data are from the Common Voice Dataset.
Notebook Walk-through
We are going to use the XTTS-v2 model for our demo. It is a state-of-the-art Zero-Shot multi-speaker model capable of generating production grade realistic audio with a single reference audio of ~6s
. Find the full notebook here.
Note : Connect to
GPU
for faster inference.
Step 1 : Installing Dependencies
locale
in the notebook
Step 1.1 : Change the This is an optional step, but sometimes while installing dependencies, colab gives us an encoding error. This is a fix for that.
import locale
locale.getpreferredencoding = lambda: "UTF-8"
Step 1.2 : Install the portaudio library
!sudo apt install portaudio19-dev
Step 1.3 : Install python libs
pip install --quiet TTS scipy sounddevice wavio PyAudio ffmpeg-python
Step 2 : Cloning the Repository
!git clone https://huggingface.co/coqui/XTTS-v2
Step 3 : Import Libraries
We will define all the imports here itself to keep our code clean and maintainable.
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from IPython.display import Audio
from scipy.io.wavfile import write
import sounddevice as sd
from scipy.io.wavfile import write, wavfile
import wavio as wv
import os
import matplotlib.pyplot as plt
Step 4 : Record your voice
Let's record and sample our own voice to be cloned, you can try out with different sampling_freq
and longer duration
for recording audio.
sampling_freq = 44100
duration = 5
for brevity, the utility function to record the audio is not presented here. you can see it in the notebook.
audio, sr = get_audio()
Step 5 : Inference
As XTTS
is a multilingual model, we are going to use it for multilingual text and audio.
Step 5.1 On English Text
Put your choice of text in text-to-speak
and put your custom [your_voice_sample].wav
file for reference_audio
.
config = XttsConfig()
config.load_json("./XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./XTTS-v2/")
model.cuda()
text_to_speak = "How are you!"
reference_audios = ["./XTTS-v2/samples/[your_voice_sample].wav"]
outputs = model.synthesize(
text_to_speak,
config,
speaker_wav=reference_audios,
gpt_cond_len=3,
language="en",
)
Step 5.2 On French Text
In the text-to-speak
put the French word and use the French sample in reference_audio
.
config = XttsConfig()
config.load_json("./XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./XTTS-v2/")
model.cuda()
text_to_speak = "comment vas-tu"
reference_audios = ["./XTTS-v2/samples/fr_sample.wav"]
outputs_2 = model.synthesize(
text_to_speak,
config,
speaker_wav=reference_audios,
gpt_cond_len=3,
language="fr",
)
Step 6 : Audio Output
Let's hear the generated Audio. English Audio
Audio(data=outputs_1['wav'], rate=24000)
Audio from XTTS-v2
French audio
Audio(data=outputs_2['wav'], rate=24000)
Audio from XTTS-v2
Step 7 : Save the Audio
Let's save the Audio, and you can download it manually into the local computer.
out_dir = "./outputs"
out_file_path_1 = os.path.join(out_dir, "output1.wav")
out_file_path_2 = os.path.join(out_dir, "output2.wav")
# Create the output directory if it doesn't exist
if not os.path.exists(out_dir):
os.makedirs(out_dir)
# Save the first wav (english) file
wavfile.write(out_file_path_1, 24000, outputs_1['wav'])
# Save the second (french) wav file
wavfile.write(out_file_path_2, 24000, outputs_2['wav'])
Conclusion
XTTS is a powerful multilingual text-to-speech and voice cloning model designed to generate natural, realistic audio in 16 languages with minimal input data. Its ability to perform cross-language voice cloning and real-time speech synthesis with low latency makes it a versatile tool for a wide range of applications. Setting up custom voice cloning with XTTS is seamless and seamless integration into platforms like Hugging Face makes it easier to maintain models and datasets.