Sept 14, 2024 • 5 min Read
Meta's Seamless M4T
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
Kaushal Choudhary
Senior Developer Advocate
What is SeamlessM4T?
Meta is pushing boundaries with SeamlessM4T, a cutting-edge, Massively Multilingual & Multimodal Machine Translation model inspired by the Babel Fish from The Hitchhiker's Guide to the Galaxy. This unified model handles speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation, along with automatic speech recognition (ASR), supporting over 100 languages.
SeamlessM4T addresses key challenges in translation by focusing on low-resource languages, translating both to and from English (X-eng, eng-X), and integrating multiple translation tasks into one system. Built using 1 million hours of speech data and a corpus of 470,000 hours of aligned translations, it improves upon existing models by offering multitask capabilities across various languages. It also builds on Meta's previous projects, including No Language Left Behind (NLLB), the Universal Speech Translator, and Massively Multilingual Speech, enhancing translation and speech recognition for over 1,100 languages.
Concept
Today's systems, often relying on cascaded models, separate automatic speech recognition (ASR), text translation (T2TT), and text-to-speech (TTS) into distinct stages. While widely used, they suffer from error propagation, compounding mistakes at each stage, and struggle with low-resource languages. Direct speech-to-text translation (S2TT) models have made strides in addressing these issues, particularly with end-to-end approaches showing better results, especially in English translations.
Enter SeamlessM4T—a cutting-edge solution designed to close the gap between direct and cascaded S2TT models. It pushes the boundaries of multilingual and multimodal translation by combining a robust speech representation model with a powerful multilingual T2TT engine. This innovative system builds a stronger direct X2T model (speech and text translation into text), delivering impressive results across diverse languages and modalities.
SeamlessM4T, as outlined in the research paper, brings together four core components:
- SeamlessM4T-NLLB: A massively multilingual text-to-text translation model.
- w2v-BERT 2.0: A speech representation model leveraging unlabeled audio data.
- T2U: A text-to-unit model converting text to unit sequences.
- Multilingual HiFi-GAN: A vocoder synthesizing speech from unit sequences.
SeamlessM4T enables speech-to-speech translation through UnitY, a two-pass framework that first translates speech into text, then converts it to acoustic units. Unlike traditional cascaded models, UnitY optimizes its components jointly, minimizing errors that typically propagate across stages and addressing domain mismatches. By relying on intermediate semantic representations, it resolves the challenges of mapping multimodal inputs to diverse target languages.
At the heart of this multitask framework lies the X2T (Into-Text Translation and Transcription) model. This dual-encoder, sequence-to-sequence system leverages a Conformer-based encoder for speech input and a Transformer-based encoder for text, both tied to a unified text decoder. The X2T model is rigorously trained on S2TT datasets, pairing source speech with target text translations.
In trials, SeamlessM4T was initially trained from scratch on VoxLingua107 data, achieving a classification error rate of 5.25% at epoch 30, surpassing open-sourced benchmarks like the VL107 model on HuggingFace, which clocked in at 7%.
SeamlessM4T: Notebook Walk-through
We are going to use the SeamlesM4T-v2
model for text-to-speech, speech-to-speech and speech-to-text tasks.
Step 1 : Setting Up CUDA and GPU Acceleration
For GPU to be utilized in the program, the runtime is to be changed to T4 GPU
. After Selecting the runtime, run this :
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device
this should display cuda:0
, meaning GPU is active and CUDA
is available.
Step 2 : Installing Dependencies
!pip install --quiet git+https://github.com/huggingface/transformers sentencepiece datasets
Step 3 : Preprocessing
Let's load the Processor from HuggingFace
, which will be requrired from proce
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
Step 4 : Prepare the Audio
Here, we will take a sample Audio sample from Hindi language.
Note - We don't need to specify the source lang as the model recognizes it using ASR (Automatic Speech Recognition).
# let's load an audio sample from an Hindi speech corpus
from datasets import load_dataset
dataset = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
audio_sample = next(iter(dataset))["audio"]
print(f"Sampling rate: {audio_sample['sampling_rate']}")
Step 5 : Use the Processor to process audio inputs
Process the Audio inputs.
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)
Step 6 : Preparing Text
We will input text to translated and transformed into speech or translated text
.
text_inputs = processor(text = "Hey, This is seamless m4t from Meta.", src_lang="eng", return_tensors="pt").to(device)
Step 7 : Load the Model
Currently, seamless-m4t-v2-large
model is present, it's a 2.3B parameter model.
from transformers import SeamlessM4Tv2Model
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
Step 7.1 Put the model into GPU
model = model.to(device)
Step 7.2 Generate Audio from Text and Audio
Now, here's the tricky part. Let's understand, essentially wee are using the input text
and audio
to be translated, and simultaneously performing speech-to-speech (S2ST)
and text-to-speech (T2ST)
task.
audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
Step 8: Sample the Audio Files and Listen to it.
We need to sample the audio files into appropriate rate (16KHz) to listen to it. From text
from IPython.display import Audio
sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)
From Text
From Audio
Audio(audio_array_from_audio, rate=sample_rate)
From Audio
Step 9 : Save the Audio File
We can save the audio files as well.
import scipy
scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_text) # audio_array_from_audio
Step 10 : Translate from Audio and Text
Now, let's translate the Original Text and Audio to text, which is essentially speech-to-text (S2TT)
and text-to-text (T2TT)
tasks, you can specify the tgt_lang
of your choice.
# from audio
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")
# from text
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")
output would be
Translation from audio: Politicians said they found enough ambiguity in the Afghan constitution to unnecessarily determine the decisive vote.
Translation from text: C'est le M4T sans couture de Meta.
This concludes how we can perform all the four (S2TT, T2ST, S2ST, T2TT), Automatic Speech Recognition (ASR)with multilingual translation using SeamlessM4T.
Find the full Notebook here.
Conclusion
In conclusion, Meta's SeamlessM4T represents a significant leap in multilingual and multimodal translation, enabling effortless communication across speech and text for over 100 languages. By integrating multiple tasks into a unified system, SeamlessM4T effectively addresses the challenges of low-resource languages and improves upon traditional models with its end-to-end approach. This advancement showcases Meta's commitment to bridging linguistic gaps and enhancing global communication through cutting-edge AI technology.
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
Top 5 Speechify Alternatives for High-Quality Audio-Books
Explore the Top 5 Speechify Alternatives for audiobook creation: Compare pricing, audio quality, latency, and use case fit to find the best TTS for your needs.
Top 5 Alternatives to ElevenLabs in TTS
Explore top ElevenLabs alternatives like Smallest.ai, Cartesia, Resemble AI, Speechify, and FakeYou. Compare latency, pricing, fidelity, and use cases.
Smallest AI vs Cartesia
Compare Smallest.ai vs Cartesia for TTS and Voice Cloning. Explore differences in voice quality, speed, emotional context, API features, and pricing.