4 min Read

F5 TTS : Flow Matching Text-to-Speech Model

Learn about how flow matching is used in F5-TTS and how to run it.

Kaushal Choudhary

Senior Developer Advocate

What is F5 TTS?

F5-TTS (Fairytaler that Fakes Fluent and Faithful speech with Flow matching) is a groundbreaking non-autoregressive Text-to-Speech system. Powered by Flow Matching and the Diffusion Transformer (DiT), it eliminates the need for a duration model, text encoder, or phoneme alignment. Instead, F5-TTS pads the text input with filler tokens, matching the length of the input speech, and then performs de-noising to generate high-quality speech.

Trained on a massive 100,000-hour multilingual dataset, F5-TTS showcases impressive zero-shot abilities, producing highly natural and expressive speech. Want to hear it in action? Check out the audio samples here.

Concept

Autoregressive models like StyleTTS-2, SunoAI Bark, and VALL-E are highly efficient and enable zero-shot text-to-speech (TTS). However, they face limitations such as slower inference, issues with speech tokenization, and phoneme alignment. In contrast, non-autoregressive models improve inference speed through parallel processing. End-to-end (E2) TTS models suffer from slow training and poor zero-shot performance due to mismatches in length between concatenated character sequences and input speech, which deeply entangle semantic and acoustic features.

f5-tts architecture

As highlighted in a recent LinkedIn post by our CEO, Sudarshan Kamath, F5-TTS is making waves in the world of Text-to-Speech technology. Sudarshan proudly mentioned how F5-TTS delivers "great conversational audio in English and Chinese, generated non-autoregressively."

F5-TTS is a non-autoregressive model, that accelerates both training and inference while enhancing robustness. It also introduces Sway Sampling, a novel method that reduces the number of function evaluations (NFE) during inference, speeding up generation without sacrificing quality, and is applicable to other Continuous Flow Models (CFMs).

It is built on the Latent Diffusion Transformer (DiT) architecture, utilizing zero-initialized adaptive Layer Norm (adaLN-zero) and ConvNeXt V2 blocks for improved text-speech alignment. Instead of relying on phoneme-level alignment, it jointly learns semantic and acoustic features. Input—comprising character sequences, noisy speech, and masked speech—is processed independently before being concatenated, giving text more autonomy for in-context learning.

Unlike models such as Voicebox and E2 TTS, which depend on U-Net-style skip connections and phoneme predictors, F5-TTS method uses sinusoidal and rotary position embeddings to better align extended character sequences with speech. During inference, Sway Sampling shifts the distribution of flow steps based on a parameter, offering more accurate evaluations early on, ultimately enhancing the quality of synthesized speech.

Notebook Walk-through

We will use F5-TTS to sample our own custom voice, to see how efficiently it matches the flow of the Voice. Find the full notebook here.

Step 1 : Clone the Repository

!git clone https://github.com/SWivid/F5-TTS.git
%cd F5-TTS

Step 2 : Install Libraries

!pip install -r requirements.txt

Step 3 : Download Model Checkpoint

We can use .pt or .safetensors model. Here, we are going to use model_120000.pt model.

!wget https://huggingface.co/SWivid/F5-TTS/resolve/main/F5TTS_Base/model_1200000.pt -P ckpts/F5TTS_Base

Step 4 : Listen to the Reference Audio

Let's first hear, how the reference audio sounds.

from IPython.display import Audio

Audio(data="tests/ref_audio/test_en_1_ref_short.wav", rate=24000)

Reference Audio

Step 5 : Inference on the above Audio

You can choose any --gen_text you like.

!python inference-cli.py --model "F5-TTS" --ckpt_file "/content/F5-TTS/ckpts/F5TTS_Base/model_1200000.pt" --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" --ref_text "some call me nature others call me mother nature" --gen_text "So, this is F5 TTS, and it can match the flow of a speaker. How cool is that!"

Step 6 : Listen to the Generated Audio

from IPython.display import Audio
Audio(data="tests/out.wav", rate=24000)

Generated Voice

Step 7 : Custom Voice Sampling

Let the model use our own voice to generate speech. First we will record our voice.

from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);
var base64data = 0;

var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
	gumStream = stream;
	var options = {
		mimeType : 'audio/webm;codecs=opus'
};
recorder = new MediaRecorder(stream);
recorder.ondataavailable = function(e) {
	var url = URL.createObjectURL(e.data);
	var preview = document.createElement('audio');
	preview.controls = true;
	preview.src = url;
	document.body.appendChild(preview);

	reader = new FileReader();
	reader.readAsDataURL(e.data);
	reader.onloadend = function() {
		base64data = reader.result;
	  }
	};
recorder.start();
};

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);

function toggleRecording() {
if (recorder && recorder.state == "recording") 	{
	recorder.stop();
	gumStream.getAudioTracks()[0].stop();
	recordButton.innerText = "Saving the recording... pls wait!"
}
}
function sleep(ms) {
	return new Promise(resolve => setTimeout(resolve, ms));
}
var data = new Promise(resolve=>{
recordButton.onclick = ()=>{
toggleRecording()
sleep(2000).then(() => {
	resolve(base64data.toString())
});
}
});
</script>
"""
def get_audio():
	display(HTML(AUDIO_HTML))
	data = eval_js("data")
	binary = b64decode(data.split(',')[1])

	# Convert the audio from webm to wav format using ffmpeg
		process = (ffmpeg.input('pipe:0').output('pipe:1', format='wav').run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
		)
	output, err = process.communicate(input=binary)

	# Save the wav file
	with  open('tests/ref_audio/my_audio.wav', 'wb') as f:
		f.write(output)

	# Optionally, read the saved audio for further processing
	sr, audio = wav_read(io.BytesIO(output))

	return audio, sr

and run this

audio, sr = get_audio()

Step 8 : Inference on the Custom Voice

Now, we will use the custom voice to match the flow and generate the speech using the given text.

!python inference-cli.py --model "F5-TTS" --ckpt_file "/content/F5-TTS/ckpts/F5TTS_Base/model_1200000.pt" --ref_audio "tests/ref_audio/my_audio.wav" --ref_text "" --gen_text "So, this is F5 TTS, and it can match the flow of a speaker. How cool is that!" -o "outputs"

Step 9 : Final Voice Output

from IPython.display import Audio
Audio(data="outputs/out.wav", rate=24000)

Custom Voice

Conclusion

We saw how F5 TTS leverages Flow Matching and Diffusion Transformers to deliver a highly efficient, non-autoregressive TTS system that excels in speed and quality. Removing the need for phoneme alignment and introducing Sway Sampling, not only accelerates training and inference but also improves robustness and expressiveness in zero-shot scenarios. This novel approach enables more precise text-speech alignment, making F5-TTS a significant advancement in continuous flow models for speech synthesis. The above notebook provides an easy way to test out this novel approach for free.