Oct 17, 2024 • 4 min Read
F5 TTS : Flow Matching Text-to-Speech Model
Learn about how flow matching is used in F5-TTS and how to run it.
Kaushal Choudhary
Senior Developer Advocate
What is F5 TTS?
F5-TTS (Fairytaler that Fakes Fluent and Faithful speech with Flow matching) is a groundbreaking non-autoregressive Text-to-Speech system. Powered by Flow Matching and the Diffusion Transformer (DiT), it eliminates the need for a duration model, text encoder, or phoneme alignment. Instead, F5-TTS pads the text input with filler tokens, matching the length of the input speech, and then performs de-noising to generate high-quality speech.
Trained on a massive 100,000-hour multilingual dataset, F5-TTS showcases impressive zero-shot abilities, producing highly natural and expressive speech. Want to hear it in action? Check out the audio samples here.
Concept
Autoregressive models like StyleTTS-2, SunoAI Bark, and VALL-E are highly efficient and enable zero-shot text-to-speech (TTS). However, they face limitations such as slower inference, issues with speech tokenization, and phoneme alignment. In contrast, non-autoregressive models improve inference speed through parallel processing. End-to-end (E2) TTS models suffer from slow training and poor zero-shot performance due to mismatches in length between concatenated character sequences and input speech, which deeply entangle semantic and acoustic features.
As highlighted in a recent LinkedIn post by our CEO, Sudarshan Kamath, F5-TTS is making waves in the world of Text-to-Speech technology. Sudarshan proudly mentioned how F5-TTS delivers "great conversational audio in English and Chinese, generated non-autoregressively."
F5-TTS is a non-autoregressive model, that accelerates both training and inference while enhancing robustness. It also introduces Sway Sampling, a novel method that reduces the number of function evaluations (NFE) during inference, speeding up generation without sacrificing quality, and is applicable to other Continuous Flow Models (CFMs).
It is built on the Latent Diffusion Transformer (DiT) architecture, utilizing zero-initialized adaptive Layer Norm (adaLN-zero) and ConvNeXt V2 blocks for improved text-speech alignment. Instead of relying on phoneme-level alignment, it jointly learns semantic and acoustic features. Input—comprising character sequences, noisy speech, and masked speech—is processed independently before being concatenated, giving text more autonomy for in-context learning.
Unlike models such as Voicebox and E2 TTS, which depend on U-Net-style skip connections and phoneme predictors, F5-TTS method uses sinusoidal and rotary position embeddings to better align extended character sequences with speech. During inference, Sway Sampling shifts the distribution of flow steps based on a parameter, offering more accurate evaluations early on, ultimately enhancing the quality of synthesized speech.
Notebook Walk-through
We will use F5-TTS to sample our own custom voice, to see how efficiently it matches the flow of the Voice. Find the full notebook here.
Step 1 : Clone the Repository
!git clone https://github.com/SWivid/F5-TTS.git
%cd F5-TTS
Step 2 : Install Libraries
!pip install -r requirements.txt
Step 3 : Download Model Checkpoint
We can use .pt
or .safetensors
model. Here, we are going to use model_120000.pt
model.
!wget https://huggingface.co/SWivid/F5-TTS/resolve/main/F5TTS_Base/model_1200000.pt -P ckpts/F5TTS_Base
Step 4 : Listen to the Reference Audio
Let's first hear, how the reference audio sounds.
from IPython.display import Audio
Audio(data="tests/ref_audio/test_en_1_ref_short.wav", rate=24000)
Reference Audio
Step 5 : Inference on the above Audio
You can choose any --gen_text
you like.
!python inference-cli.py --model "F5-TTS" --ckpt_file "/content/F5-TTS/ckpts/F5TTS_Base/model_1200000.pt" --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" --ref_text "some call me nature others call me mother nature" --gen_text "So, this is F5 TTS, and it can match the flow of a speaker. How cool is that!"
Step 6 : Listen to the Generated Audio
from IPython.display import Audio
Audio(data="tests/out.wav", rate=24000)
Generated Voice
Step 7 : Custom Voice Sampling
Let the model use our own voice to generate speech. First we will record our voice.
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg
AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");
my_btn.appendChild(t);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);
var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;
var handleSuccess = function(stream) {
gumStream = stream;
var options = {
mimeType : 'audio/webm;codecs=opus'
};
recorder = new MediaRecorder(stream);
recorder.ondataavailable = function(e) {
var url = URL.createObjectURL(e.data);
var preview = document.createElement('audio');
preview.controls = true;
preview.src = url;
document.body.appendChild(preview);
reader = new FileReader();
reader.readAsDataURL(e.data);
reader.onloadend = function() {
base64data = reader.result;
}
};
recorder.start();
};
recordButton.innerText = "Recording... press to stop";
navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);
function toggleRecording() {
if (recorder && recorder.state == "recording") {
recorder.stop();
gumStream.getAudioTracks()[0].stop();
recordButton.innerText = "Saving the recording... pls wait!"
}
}
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
var data = new Promise(resolve=>{
recordButton.onclick = ()=>{
toggleRecording()
sleep(2000).then(() => {
resolve(base64data.toString())
});
}
});
</script>
"""
def get_audio():
display(HTML(AUDIO_HTML))
data = eval_js("data")
binary = b64decode(data.split(',')[1])
# Convert the audio from webm to wav format using ffmpeg
process = (ffmpeg.input('pipe:0').output('pipe:1', format='wav').run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
)
output, err = process.communicate(input=binary)
# Save the wav file
with open('tests/ref_audio/my_audio.wav', 'wb') as f:
f.write(output)
# Optionally, read the saved audio for further processing
sr, audio = wav_read(io.BytesIO(output))
return audio, sr
and run this
audio, sr = get_audio()
Step 8 : Inference on the Custom Voice
Now, we will use the custom voice to match the flow and generate the speech using the given text.
!python inference-cli.py --model "F5-TTS" --ckpt_file "/content/F5-TTS/ckpts/F5TTS_Base/model_1200000.pt" --ref_audio "tests/ref_audio/my_audio.wav" --ref_text "" --gen_text "So, this is F5 TTS, and it can match the flow of a speaker. How cool is that!" -o "outputs"
Step 9 : Final Voice Output
from IPython.display import Audio
Audio(data="outputs/out.wav", rate=24000)
Custom Voice
Conclusion
We saw how F5 TTS leverages Flow Matching and Diffusion Transformers to deliver a highly efficient, non-autoregressive TTS system that excels in speed and quality. Removing the need for phoneme alignment and introducing Sway Sampling, not only accelerates training and inference but also improves robustness and expressiveness in zero-shot scenarios. This novel approach enables more precise text-speech alignment, making F5-TTS a significant advancement in continuous flow models for speech synthesis. The above notebook provides an easy way to test out this novel approach for free.
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
Top 5 Speechify Alternatives for High-Quality Audio-Books
Explore the Top 5 Speechify Alternatives for audiobook creation: Compare pricing, audio quality, latency, and use case fit to find the best TTS for your needs.
Top 5 Alternatives to ElevenLabs in TTS
Explore top ElevenLabs alternatives like Smallest.ai, Cartesia, Resemble AI, Speechify, and FakeYou. Compare latency, pricing, fidelity, and use cases.
Smallest AI vs Cartesia
Compare Smallest.ai vs Cartesia for TTS and Voice Cloning. Explore differences in voice quality, speed, emotional context, API features, and pricing.