Wed Jan 22 2025 • 13 min Read
TTS Benchmark 2025: Smallest.ai vs ElevenLabs Report
Smallest.ai Vs Eleven Labs - TTS benchmark, evaluating latency and speech quality to help users choose the best fit for their real-time voice synthesis needs.
Pooja Porwal
Head - Growth
Abstract
TTS (Text-to-Speech) is a technology that converts written text into natural-sounding speech. It's widely used in applications like virtual assistants (e.g., Siri, Alexa), customer service automation, audiobooks, accessibility tools for the visually impaired, and in-car navigation systems
This benchmark report presents a comprehensive open-source comparison between Eleven Labs and Smallest.ai based on two critical performance metrics for real-time Text-to-Speech (TTS) applications: Latency and Quality. These parameters directly impact the effectiveness of TTS solutions in real-time systems such as customer support bots, voice assistants, and interactive media.
The results show Smallest.ai's unequivocal edge in latency and a significant advantage in speech quality (MOS) in a majority of real-world use cases.
Finally, we give references to the open-source code used to benchmark the latency and quality, ensuring complete transparency to anyone who wishes to replicate these benchmarks.
About the Companies
Smallest.ai is an emerging AI startup focused on low-latency, high-quality foundational multi-modal AI models that challenge the status quo. With an optimized API designed for real-time applications, it balances performance and efficiency.
Eleven Labs is a widely adopted speech AI platform known for high-fidelity voice synthesis, offering versatile models that appeal to developers seeking quality and flexibility.
Both platforms are strong contenders in the high-performance TTS space
Metrics for Benchmarking & Results
To effectively evaluate the TTS capabilities of both platforms, we focus on two key metrics: Latency and MOS (Mean Opinion Score). These metrics are essential for understanding a TTS system’s performance and user experience, especially in real-time applications.
1. Latency
Latency refers to the time it takes for the TTS system to convert input text into speech.
For auto-regressive models, this refers to the Time to First Byte and for non-autoregressive single-shot models this refers to the total Inference Time.
The biggest nuances in latencies arise from the fact that although inference latencies of the model itself might be less, the network latencies hugely affect the end-user experience.
A general rule of thumb is, the farther the data has to travel from user to server, the more time it will take to process the request.
At Smallest.ai, we measure latency by tracking the time it takes for the generated voice to reach the end user, ensuring accurate real-time performance.
We compare latencies for the Lightning TTS Model by Smallest.ai and the Flash v2.5 by Elevenlabs - both claiming to be one of the fastest in the world.
Latency Results
Here, we measure end-to-end inference time as both models can generate the entire audio in a single API request. Below is the code and the sentences used to generate the latency benchmarks.
1. Smallest.ai:
Average Latency: 340 ms (India)
Average Latency: 336 ms (US)
Code to get average latencies for Smallest.ai:
import requests
from datetime import datetime
import wave
## Enter the URL
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
SAMPLE_RATE = 24000 ## Can be changed to 8000, 16000, 48000
VOICE_ID = "emily" ## List of supported voices can be found here: https://waves-docs.smallest.ai/waves-api
## Edit the payload
payload = {
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"voice_id": VOICE_ID,
"sample_rate": SAMPLE_RATE,
"speed": 1.0,
"add_wav_header": True
}
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
## Send the reponse and save the audio
print(f"Sending the test {datetime.now()}")
latencies_smallest = []
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
# print(f"Saving the audio!! {response.status_code} Latency: {(datetime.now() - start_time).total_seconds()}")
# Save the audio in bytes to a .wav file
with open('waves_lightning_sample_audio_en.wav', 'wb') as wav_file:
wav_file.write(response.content)
print("Audio file saved as waves_lightning_sample_audio.wav: ", time.time() - start_time)
latencies_smallest.append(time.time() - start_time)
else:
print(f"Error Occured with status code {response.status_code}")
print("Average Latency for Smallest: ", sum(latencies_smallest) / 10)
2. ElevenLabs:
Average Latency: 527 ms (India)
Average Latency: 350 ms (US)
Code to get average latencies for Elevenlabs:
import time
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs
# Initialize the ElevenLabs client with the provided API key
client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)
def text_to_speech_file(text: str, save_audio: bool = False) -> str:
"""
Converts text to speech using ElevenLabs API and saves the audio file.
Parameters:
text (str): The text to be converted to speech.
save_audio (bool): Flag to save the audio file. Default is False.
Returns:
str: The path of the saved audio file.
"""
# Calling the text-to-speech conversion API with detailed parameters
response = client.text_to_speech.convert(
voice_id="pNInz6obpgDQGcFmaJgB", # Adam pre-made voice
output_format="mp3_22050_32",
text=text,
model_id="eleven_flash_v2_5", # Use the turbo model for low latency
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
),
)
# Generating a unique file name for the output MP3 file
save_file_path = "elevenlabs.mp3"
# Writing the audio to a file
with open(save_file_path, "wb") as f:
for chunk in response:
if chunk:
f.write(chunk)
print(f"{save_file_path}: A new audio file was saved successfully!")
# Return the path of the saved audio file
return save_file_path
# Initialize a list to store latencies
latencies_eleven_flash_v2_5 = []
# Measure the latency for 10 API requests
for i in range(10):
start_time = time.time()
text_to_speech_file("Hello, my name is Emily. I am a text-to-speech voice.")
latencies_eleven_flash_v2_5.append(time.time() - start_time)
# Print the average latency
print("Average Latency for Flash V2_5 Model: ", sum(latencies_eleven_flash_v2_5) / 10)
This proves that Smallest.ai is faster across both India and US geographies in terms of latency.
However, simply having faster latencies, whilst having degraded audio quality, does not help the end user. Hence, we also compare both models for quality based on a widely accepted open-source quality benchmark.
2. MOS (Mean Opinion Score) Model
The Mean Opinion Score (MOS) is a widely used metric to evaluate the quality of synthetic speech. It uses a 5-point scale, where 5 being the highest quality and 1 being the lowest.
MOS assesses key aspects such as naturalness, intelligibility, and expressiveness of the synthesized voice. While subjective evaluation remains common in the industry, it is not the most precise approach.
To make this process more objective, we’ve refined it by incorporating 20 distinct categories for Hindi and English (2 commonly spoken languages across the world), chosen based on high demand requirements by enterprise customers. This ensures a more thorough and reliable measure of voice quality.
We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:
Category | Mean MOS Smallest | Mean MOS Elevenlabs | Examples |
---|---|---|---|
Small sentences | 4.551 | 4.152 | You are very talented. |
Medium sentences | 4.308 | 4.068 | The beauty of this garden is unique and indescribable. |
Long sentences | 3.917 | 3.374 | Your writing style is very impactful and thoughtful. Your words have the power to not only inspire the reader to read but also compel them to think deeply. This is a true testament to your writing skill. |
Hard sentences | 4.393 | 3.935 | Curious koalas curiously climbed curious curious climbers |
Mahabharat stories | 4.286 | 3.941 | The Mahabharata is a unique epic of Indian culture, a saga of religion, justice, and duty. |
Time sentences | 4.505 | 3.854 | The meeting is scheduled for October 15th at 11 AM. |
Number sentences | 4.629 | 4.012 | -4.56E-02 |
Sentences with medicine | 4.345 | 3.773 | Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment. |
Mix Languages | 4.678 | 4.116 | तुम्हें नहीं लगता? This isn't right! |
Punctuation sentences | 4.597 | 4.096 | Don't you think? This isn't right! |
Sentences with places | 4.251 | 3.766 | अलवर के पुराने किले में discovered chambers that naturally amplify spiritual consciousness through sacred geometry. |
Places | 4.685 | 4.216 | Thiruvananthapuram |
English in Hindi sentences | 4.527 | 4.088 | इस साल की revenue target 10 million dollars है। |
Sentences with names | 4.614 | 4.216 | मोहन discovered the quantum nature of karma. |
Hindi in English sentences | 4.59 | 4.255 | The meeting went well, lekin kuch points abhi bhi unclear hain. |
Phonetic Hindi in English sentences | 4.59 | 4.303 | Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi! |
Sentence with numbers | 4.472 | 4.216 | I got 5.34% interest |
Acronyms | 4.033 | 4.047 | POTUS |
names | 4.408 | 4.536 | Indira |
Sentence with date and time | 3.915 | 4.062 | ६ जून २०२६ ०७:०० को पहला मानव-मशीन विवाह हुआ। |
Below is the code used to generate audios for these sentences using Smallest and Elevenlabs:
import pandas as pd
import os
import pydub
import requests
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
from tqdm import tqdm
# Create directories for storing audio samples
os.makedirs('audio_samples/smallest', exist_ok=True)
os.makedirs('audio_samples/eleven', exist_ok=True)
elevenlabs_client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)
# Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--
tts_test_df = pd.read_csv('tts_test.csv')
# Function to generate audio using Smallest API
def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")
# Function to generate audio using Elevenlabs API
def generate_audio_eleven(text, filename):
response = elevenlabs_client.text_to_speech.convert(
voice_id="zT03pEAEi0VHKciJODfn",
output_format="mp3_22050_32",
text=text,
model_id="eleven_flash_v2_5", # use the turbo model for low latency
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
),
)
with open(filename, "wb") as f:
for chunk in response:
if chunk:
f.write(chunk)
print(f"Audio file saved as {filename}")
# Convert mp3 to wav - for mos models to work properly
sound = pydub.AudioSegment.from_mp3(filename)
wav_filename = filename.replace('.mp3', '.wav')
sound.export(wav_filename, format="wav")
print(f"Audio file converted to {wav_filename}")
# Delete the mp3 file
os.remove(filename)
print(f"Deleted the mp3 file {filename}")
# Iterate over each sentence in the dataframe and generate audio
for index, row in tts_df.iterrows():
print(row)
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/eleven/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
eleven_filename = f"audio_samples/eleven/{category}/sentence_{index}.mp3"
generate_audio_smallest(text, smallest_filename)
generate_audio_eleven(text, eleven_filename)
WV MOS is an open-source library which uses pretrained Wav2Vec2 to extract features and predict the MOS scores. Below is the code used to generate MOS scores using WV MOS library:
import torch
print("Is gpu available: ", torch.cuda.is_available()) # Please make sure cuda is available
from wvmos import get_wvmos
from pathlib import Path
from tqdm import tqdm
import logging
import torch
import os
def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))
# For WAV files, calculate directly
for audio_path in tqdm(audio_files, desc=f"Evaluating {directory}"):
try:
score = float(mos_model.calculate_one(str(audio_path)))
results.append({
'file': audio_path.name,
'provider': directory.split('/')[-2],
'category': directory.split('/')[-1],
'mos_score': score
})
except Exception as e:
print(f"Error calculating MOS for {audio_path.name}: {str(e)}")
results.append({
'file': audio_path.name,
'provider': directory.split('/')[-2],
'category': directory.split('/')[-1],
'mos_score': None
})
return results
def initialize_mos_model(model_name='wv-mos'):
"""Initialize a single MOS model with automatic CUDA detection."""
print("Initializing MOS model...")
cuda_available = torch.cuda.is_available()
if cuda_available:
print("CUDA is available. Using GPU for MOS calculation.")
else:
print("CUDA is not available. Using CPU for MOS calculation.")
if model_name == 'wv-mos':
return get_wvmos(cuda=cuda_available)
elif model_name == 'ut-mos':
return utmosv2.get_utmos(pretrained=True)
else:
return None
mos_model = initialize_mos_model(model_name='wv-mos')
tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()
# Evaluate Smallest.ai WAV files
for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)
# Evaluate ElevenLabs MP3 files
if os.path.exists(f'audio_samples/eleven/{category}'):
results = evaluate_directory(f'audio_samples/eleven/{category}', mos_model, 'wav')
all_results.extend(results)
if not all_results:
print("No audio files found in generated directories!")
else:
# Create results DataFrame
results_df = pd.DataFrame(all_results)
# Calculate summary statistics
summary = results_df.groupby('provider')['mos_score'].agg([
'count', 'mean', 'std', 'min', 'max'
]).round(3)
# Save results
os.makedirs('results', exist_ok=True)
os.makedirs('results/wvmos', exist_ok=True)
results_df.to_csv(f'results/wvmos/detailed_mos_scores_{category}.csv', index=False)
summary.to_csv(f'results/wvmos/mos_summary_{category}.csv')
# Print summary
print("\nMOS Score Summary:")
print(summary)
# Print top/bottom samples (only include text if available)
print("\nTop 1 Best Samples:")
columns_to_show = ['provider', 'file', 'mos_score']
if 'text' in results_df.columns:
columns_to_show.insert(2, 'text')
print(results_df.nlargest(1, 'mos_score')[columns_to_show])
print("\nBottom 1 Samples:")
print(results_df.nsmallest(1, 'mos_score')[columns_to_show])
UtMOSV2 is a model which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2:
#!GIT_LFS_SKIP_SMUDGE=1 pip install git+https://github.com/sarulab-speech/UTMOSv2.git -> Command to install UTMOSv2
import os
import csv
import pandas as pd
tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()
# Evaluate Smallest.ai WAV files
for category in categories:
for provider in ['smallest', 'eleven']:
os.makedirs('results', exist_ok=True)
os.makedirs('results/utmos', exist_ok=True)
if not os.path.exists('results/utmos/{provider}_{category}.csv'):
mos = model.predict(input_dir=f"./audio_samples/{provider}/{category}")
with open(f"/content/results/utmos/{provider}_{category}.csv", mode='w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
writer.writeheader()
writer.writerows(mos)
print(f"Data has been written to /content/results/utmos/{provider}_{category}.csv")
Below is the code used to generate a combined report using both MOS scores for each category:
import pandas as pd
import os
# Function to read UTMOS results
def read_utmos_results(file_path, category):
df = pd.read_csv(file_path)
df['provider'] = file_path.split('/')[-1].split('_')[0]
df['category'] = category
return df
# Function to read WVMOS results
def read_wvmos_results(file_path, category):
df = pd.read_csv(file_path)
df['category'] = category
return df
# Read UTMOS results
utmos_results = []
for category in categories:
for provider in ['smallest', 'eleven']:
file_path = f'results/utmos/{provider}_{category}.csv'
if os.path.exists(file_path):
utmos_results.append(read_utmos_results(file_path, category))
utmos_df = pd.concat(utmos_results, ignore_index=True)
# # Calculate mean MOS scores for UTMOS
utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores
# # Read WVMOS results
wvmos_results = []
for category in categories:
file_path = f'results/wvmos/mos_summary_{category}.csv'
if os.path.exists(file_path):
wvmos_results.append(read_wvmos_results(file_path, category))
wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df
# Merge UTMOS and WVMOS results
comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))
# Print the comparison
comparison_df
# Create a CSV with mean MOS for Eleven and Smallest for each category
mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = ['mean_mos_eleven', 'mean_mos_smallest']
mean_mos_comparison.to_csv('mean_mos_comparison.csv')
# Print the mean MOS comparison
mean_mos_comparison
As seen above Smallest.ai surpasses Elevenlabs in MOS scores in 17/20 categories, whilst only being marginally worse in the remaining categories.
Overall, across all categories, Smallest.ai has an average MOS score of 4.14 whereas Elevenlabs has an average MOS score of 3.83. Also the worst performing category for Smallest.ai is sentences with date-time, wherein the MOS score is 3.915; however, the worst performing category for Elevenlabs is Long sentences, wherein Elevenlabs has a MOS score of just 3.374. We clearly prove that Elevenlabs performs significantly worse on certain critical categories.
Conclusion
Our benchmark tests demonstrate that Smallest.ai not only surpasses Eleven Labs in terms of latency, providing faster response times essential for real-time applications, it also consistently achieves higher MOS scores, particularly in categories that involve complex, multilingual, and culturally nuanced content, establishing itself as the superior choice for high-quality, diverse TTS solutions.
Even in categories where Elevenlabs comes out as better, Smallest.ai is just marginally lacking. However, in categories where Smallest.ai is better, Elevenlabs is lagging far behind.
This establishes Smallest.ai as state-of-the-art in Text-to-Speech (Mic drop).
We hope that this benchmark helps potential buyers make an informed decision while considering the two companies and we hope to publish more detailed reports in the very near future.
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
Top 5 Cartesia Alternatives: Text-to-Speech (TTS)
Compare the Top 5 Cartesia Alternatives: Explore voice quality, latency, pricing, and features to find the best TTS and voice cloning solution for your needs.
Top 15 Alternatives to ElevenLabs in Text to Speech (TTS)
Explore top ElevenLabs alternatives like Smallest.ai, Cartesia, Resemble AI, Speechify, and FakeYou and more. Compare latency, pricing, fidelity, and use cases.
Smallest AI vs Cartesia
Compare Smallest.ai vs Cartesia for TTS and Voice Cloning. Explore differences in voice quality, speed, emotional context, API features, and pricing.