Join our discord for early access to new features!Join discord for early access!Join Now
Waves

Start now

Wed Jan 22 202513 min Read

TTS Benchmark 2025: Smallest.ai vs ElevenLabs Report

Smallest.ai Vs Eleven Labs - TTS benchmark, evaluating latency and speech quality to help users choose the best fit for their real-time voice synthesis needs.

cover image

Pooja Porwal

Head - Growth

cover image
Check out the summary on our podcast!

Abstract

TTS (Text-to-Speech) is a technology that converts written text into natural-sounding speech. It's widely used in applications like virtual assistants (e.g., Siri, Alexa), customer service automation, audiobooks, accessibility tools for the visually impaired, and in-car navigation systems

This benchmark report presents a comprehensive open-source comparison between Eleven Labs and Smallest.ai based on two critical performance metrics for real-time Text-to-Speech (TTS) applications: Latency and Quality. These parameters directly impact the effectiveness of TTS solutions in real-time systems such as customer support bots, voice assistants, and interactive media.

The results show Smallest.ai's unequivocal edge in latency and a significant advantage in speech quality (MOS) in a majority of real-world use cases.

Finally, we give references to the open-source code used to benchmark the latency and quality, ensuring complete transparency to anyone who wishes to replicate these benchmarks.

About the Companies

Smallest.ai is an emerging AI startup focused on low-latency, high-quality foundational multi-modal AI models that challenge the status quo. With an optimized API designed for real-time applications, it balances performance and efficiency.

Eleven Labs is a widely adopted speech AI platform known for high-fidelity voice synthesis, offering versatile models that appeal to developers seeking quality and flexibility.

Both platforms are strong contenders in the high-performance TTS space

Metrics for Benchmarking & Results

To effectively evaluate the TTS capabilities of both platforms, we focus on two key metrics: Latency and MOS (Mean Opinion Score). These metrics are essential for understanding a TTS system’s performance and user experience, especially in real-time applications.

1. Latency

Latency refers to the time it takes for the TTS system to convert input text into speech.

For auto-regressive models, this refers to the Time to First Byte and for non-autoregressive single-shot models this refers to the total Inference Time.

The biggest nuances in latencies arise from the fact that although inference latencies of the model itself might be less, the network latencies hugely affect the end-user experience.
A general rule of thumb is, the farther the data has to travel from user to server, the more time it will take to process the request.

At Smallest.ai, we measure latency by tracking the time it takes for the generated voice to reach the end user, ensuring accurate real-time performance.

We compare latencies for the Lightning TTS Model by Smallest.ai and the Flash v2.5 by Elevenlabs - both claiming to be one of the fastest in the world.


Latency Results

Here, we measure end-to-end inference time as both models can generate the entire audio in a single API request. Below is the code and the sentences used to generate the latency benchmarks.

1. Smallest.ai:
Average Latency: 340 ms (India)
Average Latency: 336 ms (US)

Code to get average latencies for Smallest.ai:

import requests
from datetime import datetime
import wave

## Enter the URL
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
SAMPLE_RATE = 24000 ## Can be changed to 8000, 16000, 48000
VOICE_ID = "emily"  ## List of supported voices can be found here: https://waves-docs.smallest.ai/waves-api

## Edit the payload
payload = {
    "text": "Hello, my name is Emily. I am a text-to-speech voice.",
    "voice_id": VOICE_ID,
    "sample_rate": SAMPLE_RATE,
    "speed": 1.0,
    "add_wav_header": True
}

## Edit the header - enter Token
headers = {
    "Authorization": f"Bearer {SMALLEST_API}",
    "Content-Type": "application/json"
}

## Send the reponse and save the audio
print(f"Sending the test {datetime.now()}")
latencies_smallest = []
for i in range(10):
    start_time = time.time()
    response = requests.request("POST", url, json=payload, headers=headers)

    if response.status_code == 200:
        # print(f"Saving the audio!! {response.status_code} Latency: {(datetime.now() - start_time).total_seconds()}")

        # Save the audio in bytes to a .wav file
        with open('waves_lightning_sample_audio_en.wav', 'wb') as wav_file:
            wav_file.write(response.content)

        print("Audio file saved as waves_lightning_sample_audio.wav: ", time.time() - start_time)
        latencies_smallest.append(time.time() - start_time)
    else:
        print(f"Error Occured with status code {response.status_code}")

print("Average Latency for Smallest: ", sum(latencies_smallest) / 10)

2. ElevenLabs:
Average Latency: 527 ms (India)
Average Latency: 350 ms (US)

Code to get average latencies for Elevenlabs:

import time

from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs

# Initialize the ElevenLabs client with the provided API key
client = ElevenLabs(
    api_key=ELEVENLABS_API_KEY,
)

def text_to_speech_file(text: str, save_audio: bool = False) -> str:
    """
    Converts text to speech using ElevenLabs API and saves the audio file.

    Parameters:
    text (str): The text to be converted to speech.
    save_audio (bool): Flag to save the audio file. Default is False.

    Returns:
    str: The path of the saved audio file.
    """
    # Calling the text-to-speech conversion API with detailed parameters
    response = client.text_to_speech.convert(
        voice_id="pNInz6obpgDQGcFmaJgB",  # Adam pre-made voice
        output_format="mp3_22050_32",
        text=text,
        model_id="eleven_flash_v2_5",  # Use the turbo model for low latency
        voice_settings=VoiceSettings(
            stability=0.0,
            similarity_boost=1.0,
            style=0.0,
            use_speaker_boost=True,
        ),
    )

    # Generating a unique file name for the output MP3 file
    save_file_path = "elevenlabs.mp3"

    # Writing the audio to a file
    with open(save_file_path, "wb") as f:
        for chunk in response:
            if chunk:
                f.write(chunk)

    print(f"{save_file_path}: A new audio file was saved successfully!")

    # Return the path of the saved audio file
    return save_file_path

# Initialize a list to store latencies
latencies_eleven_flash_v2_5 = []

# Measure the latency for 10 API requests
for i in range(10):
    start_time = time.time()
    text_to_speech_file("Hello, my name is Emily. I am a text-to-speech voice.")
    latencies_eleven_flash_v2_5.append(time.time() - start_time)

# Print the average latency
print("Average Latency for Flash V2_5 Model: ", sum(latencies_eleven_flash_v2_5) / 10)

This proves that Smallest.ai is faster across both India and US geographies in terms of latency.

However, simply having faster latencies, whilst having degraded audio quality, does not help the end user. Hence, we also compare both models for quality based on a widely accepted open-source quality benchmark.

2. MOS (Mean Opinion Score) Model

The Mean Opinion Score (MOS) is a widely used metric to evaluate the quality of synthetic speech. It uses a 5-point scale, where 5 being the highest quality and 1 being the lowest.

MOS assesses key aspects such as naturalness, intelligibility, and expressiveness of the synthesized voice. While subjective evaluation remains common in the industry, it is not the most precise approach.

To make this process more objective, we’ve refined it by incorporating 20 distinct categories for Hindi and English (2 commonly spoken languages across the world), chosen based on high demand requirements by enterprise customers. This ensures a more thorough and reliable measure of voice quality.

We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:

Category

Mean MOS Smallest

Mean MOS Elevenlabs

Small sentences

4.551

4.152

You are very talented.

Medium sentences

4.308

4.068

The beauty of this garden is unique and indescribable.

Long sentences

3.917

3.374

Your writing style is very impactful and thoughtful. Your words have the power to not only inspire the reader to read but also compel them to think deeply. This is a true testament to your writing skill.

Hard sentences

4.393

3.935

Curious koalas curiously climbed curious curious climbers

Mahabharat stories

4.286

3.941

The Mahabharata is a unique epic of Indian culture, a saga of religion, justice, and duty.

Time sentences

4.505

3.854

The meeting is scheduled for October 15th at 11 AM.

Number sentences

4.629

4.012

-4.56E-02

Sentences with medicine

4.345

3.773

Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.

Mix Languages

4.678

4.116

तुम्हें नहीं लगता? This isn't right!

Punctuation sentences

4.597

4.096

Don't you think? This isn't right!

Sentences with places

4.251

3.766

अलवर के पुराने किले में discovered chambers that naturally amplify spiritual consciousness through sacred geometry.

Places

4.685

4.216

Thiruvananthapuram

English in Hindi sentences

4.527

4.088

इस साल की revenue target 10 million dollars है।

Sentences with names

4.614

4.216

मोहन discovered the quantum nature of karma.

Hindi in English sentences

4.59

4.255

The meeting went well, lekin kuch points abhi bhi unclear hain.

Phonetic Hindi in English sentences

4.59

4.303

Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!

Sentence with numbers

4.472

4.216

I got 5.34% interest

Acronyms

4.033

4.047

POTUS

names

4.408

4.536

Indira

Sentence with date and time

3.915

4.062

६ जून २०२६ ०७:०० को पहला मानव-मशीन विवाह हुआ।

Below is the code used to generate audios for these sentences using Smallest and Elevenlabs:

import pandas as pd
import os
import pydub
import requests
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
from tqdm import tqdm

# Create directories for storing audio samples
os.makedirs('audio_samples/smallest', exist_ok=True)
os.makedirs('audio_samples/eleven', exist_ok=True)


elevenlabs_client = ElevenLabs(
    api_key=ELEVENLABS_API_KEY,
)

# Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--|        Sentence     |-|      Category         |--
#--|---------------------|-|-----------------------|--
#--|      Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--

tts_test_df = pd.read_csv('tts_test.csv')

# Function to generate audio using Smallest API
def generate_audio_smallest(text, filename):
    url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
    ## Edit the header - enter Token
    headers = {
        "Authorization": f"Bearer {SMALLEST_API}",
        "Content-Type": "application/json"
    }
    payload = {
        "text": text,
        "voice_id": "arnav",
        "sample_rate": 24000,
        "speed": 1.0,
        "add_wav_header": True
    }
    response = requests.request("POST", url, json=payload, headers=headers)
    if response.status_code == 200:
        with open(filename, 'wb') as wav_file:
            wav_file.write(response.content)
        print(f"Audio file saved as {filename}")
    else:
        print(f"Error Occured with status code {response.text}")

# Function to generate audio using Elevenlabs API
def generate_audio_eleven(text, filename):
    response = elevenlabs_client.text_to_speech.convert(
        voice_id="zT03pEAEi0VHKciJODfn",
        output_format="mp3_22050_32",
        text=text,
        model_id="eleven_flash_v2_5", # use the turbo model for low latency
        voice_settings=VoiceSettings(
            stability=0.0,
            similarity_boost=1.0,
            style=0.0,
            use_speaker_boost=True,
        ),
    )
    with open(filename, "wb") as f:
        for chunk in response:
            if chunk:
                f.write(chunk)
    print(f"Audio file saved as {filename}")

    # Convert mp3 to wav - for mos models to work properly
    sound = pydub.AudioSegment.from_mp3(filename)
    wav_filename = filename.replace('.mp3', '.wav')
    sound.export(wav_filename, format="wav")
    print(f"Audio file converted to {wav_filename}")

    # Delete the mp3 file
    os.remove(filename)
    print(f"Deleted the mp3 file {filename}")



# Iterate over each sentence in the dataframe and generate audio
for index, row in tts_df.iterrows():
    print(row)
    text = row['sentence']
    category = row['category']
    os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
    os.makedirs(f'audio_samples/eleven/{category}', exist_ok=True)
    smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
    eleven_filename = f"audio_samples/eleven/{category}/sentence_{index}.mp3"

    generate_audio_smallest(text, smallest_filename)
    generate_audio_eleven(text, eleven_filename)


WV MOS is an open-source library which uses pretrained Wav2Vec2 to extract features and predict the MOS scores. Below is the code used to generate MOS scores using WV MOS library:

import torch
print("Is gpu available: ", torch.cuda.is_available()) # Please make sure cuda is available

from wvmos import get_wvmos
from pathlib import Path
from tqdm import tqdm
import logging
import torch
import os


def evaluate_directory(directory, mos_model, extension):
    """Evaluate all audio files in a directory."""
    results = []
    dir_path = Path(directory)
    audio_files = sorted(dir_path.glob(f"*.{extension}"))

    # For WAV files, calculate directly
    for audio_path in tqdm(audio_files, desc=f"Evaluating {directory}"):
        try:
            score = float(mos_model.calculate_one(str(audio_path)))
            results.append({
                'file': audio_path.name,
                'provider': directory.split('/')[-2],
                'category': directory.split('/')[-1],
                'mos_score': score
            })
        except Exception as e:
            print(f"Error calculating MOS for {audio_path.name}: {str(e)}")
            results.append({
                'file': audio_path.name,
                'provider': directory.split('/')[-2],
                'category': directory.split('/')[-1],
                'mos_score': None
            })

    return results

def initialize_mos_model(model_name='wv-mos'):
    """Initialize a single MOS model with automatic CUDA detection."""
    print("Initializing MOS model...")
    cuda_available = torch.cuda.is_available()
    if cuda_available:
        print("CUDA is available. Using GPU for MOS calculation.")
    else:
        print("CUDA is not available. Using CPU for MOS calculation.")
    if model_name == 'wv-mos':
        return get_wvmos(cuda=cuda_available)
    elif model_name == 'ut-mos':
        return utmosv2.get_utmos(pretrained=True)
    else:
        return None

mos_model = initialize_mos_model(model_name='wv-mos')

tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()


# Evaluate Smallest.ai WAV files
for category in categories:
    if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
        all_results = []
        if os.path.exists(f'audio_samples/smallest/{category}'):
            results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
            all_results.extend(results)


        # Evaluate ElevenLabs MP3 files
        if os.path.exists(f'audio_samples/eleven/{category}'):
            results = evaluate_directory(f'audio_samples/eleven/{category}', mos_model, 'wav')
            all_results.extend(results)

        if not all_results:
            print("No audio files found in generated directories!")
        else:
            # Create results DataFrame
            results_df = pd.DataFrame(all_results)

            # Calculate summary statistics
            summary = results_df.groupby('provider')['mos_score'].agg([
                'count', 'mean', 'std', 'min', 'max'
            ]).round(3)

            # Save results
            os.makedirs('results', exist_ok=True)
            os.makedirs('results/wvmos', exist_ok=True)
            results_df.to_csv(f'results/wvmos/detailed_mos_scores_{category}.csv', index=False)
            summary.to_csv(f'results/wvmos/mos_summary_{category}.csv')

            # Print summary
            print("\nMOS Score Summary:")
            print(summary)

            # Print top/bottom samples (only include text if available)
            print("\nTop 1 Best Samples:")
            columns_to_show = ['provider', 'file', 'mos_score']
            if 'text' in results_df.columns:
                columns_to_show.insert(2, 'text')
            print(results_df.nlargest(1, 'mos_score')[columns_to_show])

            print("\nBottom 1 Samples:")
            print(results_df.nsmallest(1, 'mos_score')[columns_to_show])


UtMOSV2 is a model which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2:

#!GIT_LFS_SKIP_SMUDGE=1 pip install git+https://github.com/sarulab-speech/UTMOSv2.git -> Command to install UTMOSv2

import os
import csv
import pandas as pd


tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()

# Evaluate Smallest.ai WAV files
for category in categories:
    for provider in ['smallest', 'eleven']:
        os.makedirs('results', exist_ok=True)
        os.makedirs('results/utmos', exist_ok=True)
        if not os.path.exists('results/utmos/{provider}_{category}.csv'):
            mos = model.predict(input_dir=f"./audio_samples/{provider}/{category}")
            with open(f"/content/results/utmos/{provider}_{category}.csv", mode='w', newline='') as file:
                writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
                writer.writeheader()
                writer.writerows(mos)

            print(f"Data has been written to /content/results/utmos/{provider}_{category}.csv")


Below is the code used to generate a combined report using both MOS scores for each category:

import pandas as pd
import os

# Function to read UTMOS results
def read_utmos_results(file_path, category):
    df = pd.read_csv(file_path)
    df['provider'] = file_path.split('/')[-1].split('_')[0]
    df['category'] = category
    return df

# Function to read WVMOS results
def read_wvmos_results(file_path, category):
    df = pd.read_csv(file_path)
    df['category'] = category
    return df

# Read UTMOS results
utmos_results = []
for category in categories:
    for provider in ['smallest', 'eleven']:
        file_path = f'results/utmos/{provider}_{category}.csv'
        if os.path.exists(file_path):
            utmos_results.append(read_utmos_results(file_path, category))

utmos_df = pd.concat(utmos_results, ignore_index=True)
# # Calculate mean MOS scores for UTMOS
utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores

# # Read WVMOS results
wvmos_results = []
for category in categories:
    file_path = f'results/wvmos/mos_summary_{category}.csv'
    if os.path.exists(file_path):
        wvmos_results.append(read_wvmos_results(file_path, category))

wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df
# Merge UTMOS and WVMOS results
comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))

# Print the comparison
comparison_df

# Create a CSV with mean MOS for Eleven and Smallest for each category
mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = ['mean_mos_eleven', 'mean_mos_smallest']
mean_mos_comparison.to_csv('mean_mos_comparison.csv')

# Print the mean MOS comparison
mean_mos_comparison

As seen above Smallest.ai surpasses Elevenlabs in MOS scores in 17/20 categories, whilst only being marginally worse in the remaining categories.

Overall, across all categories, Smallest.ai has an average MOS score of 4.14 whereas Elevenlabs has an average MOS score of 3.83. Also the worst performing category for Smallest.ai is sentences with date-time, wherein the MOS score is 3.915; however, the worst performing category for Elevenlabs is Long sentences, wherein Elevenlabs has a MOS score of just 3.374. We clearly prove that Elevenlabs performs significantly worse on certain critical categories.

Conclusion

Our benchmark tests demonstrate that Smallest.ai not only surpasses Eleven Labs in terms of latency, providing faster response times essential for real-time applications, it also consistently achieves higher MOS scores, particularly in categories that involve complex, multilingual, and culturally nuanced content, establishing itself as the superior choice for high-quality, diverse TTS solutions.

Even in categories where Elevenlabs comes out as better, Smallest.ai is just marginally lacking. However, in categories where Smallest.ai is better, Elevenlabs is lagging far behind.

This establishes Smallest.ai as state-of-the-art in Text-to-Speech (Mic drop).

We hope that this benchmark helps potential buyers make an informed decision while considering the two companies and we hope to publish more detailed reports in the very near future.