Join our discord for early access to new features!Join discord for early access!Join Now

Sept 14, 20245 min Read

Meta's Seamless M4T

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

cover image

Kaushal Choudhary

Senior Developer Advocate

cover image

What is SeamlessM4T?

Meta is pushing boundaries with SeamlessM4T, a cutting-edge, Massively Multilingual & Multimodal Machine Translation model inspired by the Babel Fish from The Hitchhiker's Guide to the Galaxy. This unified model handles speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation, along with automatic speech recognition (ASR), supporting over 100 languages.

SeamlessM4T addresses key challenges in translation by focusing on low-resource languages, translating both to and from English (X-eng, eng-X), and integrating multiple translation tasks into one system. Built using 1 million hours of speech data and a corpus of 470,000 hours of aligned translations, it improves upon existing models by offering multitask capabilities across various languages. It also builds on Meta's previous projects, including No Language Left Behind (NLLB), the Universal Speech Translator, and Massively Multilingual Speech, enhancing translation and speech recognition for over 1,100 languages.

Concept

Today's systems, often relying on cascaded models, separate automatic speech recognition (ASR), text translation (T2TT), and text-to-speech (TTS) into distinct stages. While widely used, they suffer from error propagation, compounding mistakes at each stage, and struggle with low-resource languages. Direct speech-to-text translation (S2TT) models have made strides in addressing these issues, particularly with end-to-end approaches showing better results, especially in English translations.

Enter SeamlessM4T—a cutting-edge solution designed to close the gap between direct and cascaded S2TT models. It pushes the boundaries of multilingual and multimodal translation by combining a robust speech representation model with a powerful multilingual T2TT engine. This innovative system builds a stronger direct X2T model (speech and text translation into text), delivering impressive results across diverse languages and modalities.

m4t-pre-trained

SeamlessM4T, as outlined in the research paper, brings together four core components:

  1. SeamlessM4T-NLLB: A massively multilingual text-to-text translation model.
  2. w2v-BERT 2.0: A speech representation model leveraging unlabeled audio data.
  3. T2U: A text-to-unit model converting text to unit sequences.
  4. Multilingual HiFi-GAN: A vocoder synthesizing speech from unit sequences.

m4t-unity

SeamlessM4T enables speech-to-speech translation through UnitY, a two-pass framework that first translates speech into text, then converts it to acoustic units. Unlike traditional cascaded models, UnitY optimizes its components jointly, minimizing errors that typically propagate across stages and addressing domain mismatches. By relying on intermediate semantic representations, it resolves the challenges of mapping multimodal inputs to diverse target languages.

At the heart of this multitask framework lies the X2T (Into-Text Translation and Transcription) model. This dual-encoder, sequence-to-sequence system leverages a Conformer-based encoder for speech input and a Transformer-based encoder for text, both tied to a unified text decoder. The X2T model is rigorously trained on S2TT datasets, pairing source speech with target text translations.

X2T Model Architecture

In trials, SeamlessM4T was initially trained from scratch on VoxLingua107 data, achieving a classification error rate of 5.25% at epoch 30, surpassing open-sourced benchmarks like the VL107 model on HuggingFace, which clocked in at 7%.

SeamlessM4T: Notebook Walk-through

We are going to use the SeamlesM4T-v2 model for text-to-speech, speech-to-speech and speech-to-text tasks.

Step 1 : Setting Up CUDA and GPU Acceleration

For GPU to be utilized in the program, the runtime is to be changed to T4 GPU. After Selecting the runtime, run this :

import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

this should display cuda:0, meaning GPU is active and CUDA is available.

Step 2 : Installing Dependencies

!pip install --quiet git+https://github.com/huggingface/transformers sentencepiece datasets

Step 3 : Preprocessing

Let's load the Processor from HuggingFace, which will be requrired from proce

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")

Step 4 : Prepare the Audio

Here, we will take a sample Audio sample from Hindi language.

Note - We don't need to specify the source lang as the model recognizes it using ASR (Automatic Speech Recognition).

# let's load an audio sample from an Hindi speech corpus
from datasets import load_dataset
dataset = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
audio_sample = next(iter(dataset))["audio"]

print(f"Sampling rate: {audio_sample['sampling_rate']}")

Step 5 : Use the Processor to process audio inputs

Process the Audio inputs.

audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)

Step 6 : Preparing Text

We will input text to translated and transformed into speech or translated text.

text_inputs = processor(text = "Hey, This is seamless m4t from Meta.", src_lang="eng", return_tensors="pt").to(device)

Step 7 : Load the Model

Currently, seamless-m4t-v2-large model is present, it's a 2.3B parameter model.

from transformers import SeamlessM4Tv2Model

model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

Step 7.1 Put the model into GPU

model = model.to(device)

Step 7.2 Generate Audio from Text and Audio

Now, here's the tricky part. Let's understand, essentially wee are using the input text and audio to be translated, and simultaneously performing speech-to-speech (S2ST) and text-to-speech (T2ST) task.

audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

Step 8: Sample the Audio Files and Listen to it.

We need to sample the audio files into appropriate rate (16KHz) to listen to it. From text

from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

From Text

From Audio

Audio(audio_array_from_audio, rate=sample_rate)

From Audio

Step 9 : Save the Audio File

We can save the audio files as well.

import scipy

scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_text) # audio_array_from_audio

Step 10 : Translate from Audio and Text

Now, let's translate the Original Text and Audio to text, which is essentially speech-to-text (S2TT) and text-to-text (T2TT) tasks, you can specify the tgt_lang of your choice.

# from audio
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

# from text
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")

output would be

Translation from audio: Politicians said they found enough ambiguity in the Afghan constitution to unnecessarily determine the decisive vote.
Translation from text: C'est le M4T sans couture de Meta.

This concludes how we can perform all the four (S2TT, T2ST, S2ST, T2TT), Automatic Speech Recognition (ASR)with multilingual translation using SeamlessM4T.

Find the full Notebook here.

Conclusion

In conclusion, Meta's SeamlessM4T represents a significant leap in multilingual and multimodal translation, enabling effortless communication across speech and text for over 100 languages. By integrating multiple tasks into a unified system, SeamlessM4T effectively addresses the challenges of low-resource languages and improves upon traditional models with its end-to-end approach. This advancement showcases Meta's commitment to bridging linguistic gaps and enhancing global communication through cutting-edge AI technology.