TTS Benchmark 2025: Smallest.ai vs ElevenLabs Report
Smallest.ai Vs Eleven Labs - TTS benchmark, evaluating latency and speech quality to help users choose the best fit for their real-time voice synthesis needs.

Akshat Mandloi
Updated on
December 26, 2025 at 11:37 AM
Enter the URL
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
SAMPLE_RATE = 24000 ## Can be changed to 8000, 16000, 48000
VOICE_ID = "emily" ## List of supported voices can be found here: https://waves-docs.smallest.ai/waves-api
Edit the payload
payload = {
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"voice_id": VOICE_ID,
"sample_rate": SAMPLE_RATE,
"speed": 1.0,
"add_wav_header": True
}
Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
Send the reponse and save the audio
print(f"Sending the test {datetime.now()}")
latencies_smallest = []
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)
print("Average Latency for Smallest: ", sum(latencies_smallest) / 10)2. ElevenLabs:Average Latency: 527 ms (India)Average Latency: 350 ms (US)Code to get average latencies for Elevenlabs:import time
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs
Initialize the ElevenLabs client with the provided API key
client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)
def text_to_speech_file(text: str, save_audio: bool = False) -> str:
"""
Converts text to speech using ElevenLabs API and saves the audio file.
Initialize a list to store latencies
latencies_eleven_flash_v2_5 = []
Measure the latency for 10 API requests
for i in range(10):
start_time = time.time()
text_to_speech_file("Hello, my name is Emily. I am a text-to-speech voice.")
latencies_eleven_flash_v2_5.append(time.time() - start_time)
Print the average latency
print("Average Latency for Flash V2_5 Model: ", sum(latencies_eleven_flash_v2_5) / 10)
This proves that Smallest.ai is faster across both India and US geographies in terms of latency. However, simply having faster latencies, whilst having degraded audio quality, does not help the end user. Hence, we also compare both models for quality based on a widely accepted open-source quality benchmark.2. MOS (Mean Opinion Score) ModelThe Mean Opinion Score (MOS) is a widely used metric to evaluate the quality of synthetic speech. It uses a 5-point scale, where 5 being the highest quality and 1 being the lowest.MOS assesses key aspects such as naturalness, intelligibility, and expressiveness of the synthesized voice. While subjective evaluation remains common in the industry, it is not the most precise approach.To make this process more objective, we’ve refined it by incorporating 20 distinct categories for Hindi and English (2 commonly spoken languages across the world), chosen based on high demand requirements by enterprise customers. This ensures a more thorough and reliable measure of voice quality.We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:CategoryMean MOS SmallestMean MOS ElevenlabsExamples(Full list available here)Small sentences4.5514.152You are very talented.Medium sentences4.3084.068The beauty of this garden is unique and indescribable.Long sentences3.9173.374Your writing style is very impactful and thoughtful. Your words have the power to not only inspire the reader to read but also compel them to think deeply. This is a true testament to your writing skill.Hard sentences4.3933.935Curious koalas curiously climbed curious curious climbersMahabharat stories4.2863.941The Mahabharata is a unique epic of Indian culture, a saga of religion, justice, and duty.Time sentences4.5053.854The meeting is scheduled for October 15th at 11 AM.Number sentences4.6294.012-4.56E-02Sentences with medicine4.3453.773Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.Mix Languages4.6784.116तुम्हें नहीं लगता? This isn't right!Punctuation sentences4.5974.096Don't you think? This isn't right!Sentences with places4.2513.766अलवर के पुराने किले में discovered chambers that naturally amplify spiritual consciousness through sacred geometry.Places4.6854.216ThiruvananthapuramEnglish in Hindi sentences4.5274.088इस साल की revenue target 10 million dollars है।Sentences with names4.6144.216मोहन discovered the quantum nature of karma.Hindi in English sentences4.594.255The meeting went well, lekin kuch points abhi bhi unclear hain.Phonetic Hindi in English sentences4.594.303Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!Sentence with numbers4.4724.216I got 5.34% interestAcronyms4.0334.047POTUSnames4.4084.536IndiraSentence with date and time3.9154.062६ जून २०२६ ०७:०० को पहला मानव-मशीन विवाह हुआ।Below is the code used to generate audios for these sentences using Smallest and Elevenlabs:import pandas as pd
import os
import pydub
import requests
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
from tqdm import tqdm
Create directories for storing audio samples
os.makedirs('audio_samples/smallest', exist_ok=True)
os.makedirs('audio_samples/eleven', exist_ok=True)
elevenlabs_client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)
Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--
tts_test_df = pd.read_csv('tts_test.csv')
Function to generate audio using Smallest API
def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")
Function to generate audio using Elevenlabs API
def generate_audio_eleven(text, filename):
response = elevenlabs_client.text_to_speech.convert(
voice_id="zT03pEAEi0VHKciJODfn",
output_format="mp3_22050_32",
text=text,
model_id="eleven_flash_v2_5", # use the turbo model for low latency
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
),
)
with open(filename, "wb") as f:
for chunk in response:
if chunk:
f.write(chunk)
print(f"Audio file saved as {filename}")
Iterate over each sentence in the dataframe and generate audio
for index, row in tts_test_df.iterrows():
print(row)
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/eleven/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
eleven_filename = f"audio_samples/eleven/{category}/sentence_{index}.mp3"
print("Is gpu available: ", torch.cuda.is_available()) # Please make sure cuda is available
from wvmos import get_wvmos
from pathlib import Path
from tqdm import tqdm
import logging
import torch
import os
def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))
def initialize_mos_model(model_name='wv-mos'):
"""Initialize a single MOS model with automatic CUDA detection."""
print("Initializing MOS model...")
cuda_available = torch.cuda.is_available()
if cuda_available:
print("CUDA is available. Using GPU for MOS calculation.")
else:
print("CUDA is not available. Using CPU for MOS calculation.")
if model_name == 'wv-mos':
return get_wvmos(cuda=cuda_available)
elif model_name == 'ut-mos':
return utmosv2.get_utmos(pretrained=True)
else:
return None
mos_model = initialize_mos_model(model_name='wv-mos')
tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()
Evaluate Smallest.ai WAV files
for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)
Enter the URL
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
SAMPLE_RATE = 24000 ## Can be changed to 8000, 16000, 48000
VOICE_ID = "emily" ## List of supported voices can be found here: https://waves-docs.smallest.ai/waves-api
Edit the payload
payload = {
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"voice_id": VOICE_ID,
"sample_rate": SAMPLE_RATE,
"speed": 1.0,
"add_wav_header": True
}
Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
Send the reponse and save the audio
print(f"Sending the test {datetime.now()}")
latencies_smallest = []
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)
print("Average Latency for Smallest: ", sum(latencies_smallest) / 10)2. ElevenLabs:Average Latency: 527 ms (India)Average Latency: 350 ms (US)Code to get average latencies for Elevenlabs:import time
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs
Initialize the ElevenLabs client with the provided API key
client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)
def text_to_speech_file(text: str, save_audio: bool = False) -> str:
"""
Converts text to speech using ElevenLabs API and saves the audio file.
Initialize a list to store latencies
latencies_eleven_flash_v2_5 = []
Measure the latency for 10 API requests
for i in range(10):
start_time = time.time()
text_to_speech_file("Hello, my name is Emily. I am a text-to-speech voice.")
latencies_eleven_flash_v2_5.append(time.time() - start_time)
Print the average latency
print("Average Latency for Flash V2_5 Model: ", sum(latencies_eleven_flash_v2_5) / 10)
This proves that Smallest.ai is faster across both India and US geographies in terms of latency. However, simply having faster latencies, whilst having degraded audio quality, does not help the end user. Hence, we also compare both models for quality based on a widely accepted open-source quality benchmark.2. MOS (Mean Opinion Score) ModelThe Mean Opinion Score (MOS) is a widely used metric to evaluate the quality of synthetic speech. It uses a 5-point scale, where 5 being the highest quality and 1 being the lowest.MOS assesses key aspects such as naturalness, intelligibility, and expressiveness of the synthesized voice. While subjective evaluation remains common in the industry, it is not the most precise approach.To make this process more objective, we’ve refined it by incorporating 20 distinct categories for Hindi and English (2 commonly spoken languages across the world), chosen based on high demand requirements by enterprise customers. This ensures a more thorough and reliable measure of voice quality.We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:CategoryMean MOS SmallestMean MOS ElevenlabsExamples(Full list available here)Small sentences4.5514.152You are very talented.Medium sentences4.3084.068The beauty of this garden is unique and indescribable.Long sentences3.9173.374Your writing style is very impactful and thoughtful. Your words have the power to not only inspire the reader to read but also compel them to think deeply. This is a true testament to your writing skill.Hard sentences4.3933.935Curious koalas curiously climbed curious curious climbersMahabharat stories4.2863.941The Mahabharata is a unique epic of Indian culture, a saga of religion, justice, and duty.Time sentences4.5053.854The meeting is scheduled for October 15th at 11 AM.Number sentences4.6294.012-4.56E-02Sentences with medicine4.3453.773Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.Mix Languages4.6784.116तुम्हें नहीं लगता? This isn't right!Punctuation sentences4.5974.096Don't you think? This isn't right!Sentences with places4.2513.766अलवर के पुराने किले में discovered chambers that naturally amplify spiritual consciousness through sacred geometry.Places4.6854.216ThiruvananthapuramEnglish in Hindi sentences4.5274.088इस साल की revenue target 10 million dollars है।Sentences with names4.6144.216मोहन discovered the quantum nature of karma.Hindi in English sentences4.594.255The meeting went well, lekin kuch points abhi bhi unclear hain.Phonetic Hindi in English sentences4.594.303Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!Sentence with numbers4.4724.216I got 5.34% interestAcronyms4.0334.047POTUSnames4.4084.536IndiraSentence with date and time3.9154.062६ जून २०२६ ०७:०० को पहला मानव-मशीन विवाह हुआ।Below is the code used to generate audios for these sentences using Smallest and Elevenlabs:import pandas as pd
import os
import pydub
import requests
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
from tqdm import tqdm
Create directories for storing audio samples
os.makedirs('audio_samples/smallest', exist_ok=True)
os.makedirs('audio_samples/eleven', exist_ok=True)
elevenlabs_client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)
Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--
tts_test_df = pd.read_csv('tts_test.csv')
Function to generate audio using Smallest API
def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")
Function to generate audio using Elevenlabs API
def generate_audio_eleven(text, filename):
response = elevenlabs_client.text_to_speech.convert(
voice_id="zT03pEAEi0VHKciJODfn",
output_format="mp3_22050_32",
text=text,
model_id="eleven_flash_v2_5", # use the turbo model for low latency
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
),
)
with open(filename, "wb") as f:
for chunk in response:
if chunk:
f.write(chunk)
print(f"Audio file saved as {filename}")
Iterate over each sentence in the dataframe and generate audio
for index, row in tts_test_df.iterrows():
print(row)
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/eleven/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
eleven_filename = f"audio_samples/eleven/{category}/sentence_{index}.mp3"
print("Is gpu available: ", torch.cuda.is_available()) # Please make sure cuda is available
from wvmos import get_wvmos
from pathlib import Path
from tqdm import tqdm
import logging
import torch
import os
def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))
def initialize_mos_model(model_name='wv-mos'):
"""Initialize a single MOS model with automatic CUDA detection."""
print("Initializing MOS model...")
cuda_available = torch.cuda.is_available()
if cuda_available:
print("CUDA is available. Using GPU for MOS calculation.")
else:
print("CUDA is not available. Using CPU for MOS calculation.")
if model_name == 'wv-mos':
return get_wvmos(cuda=cuda_available)
elif model_name == 'ut-mos':
return utmosv2.get_utmos(pretrained=True)
else:
return None
mos_model = initialize_mos_model(model_name='wv-mos')
tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()
Evaluate Smallest.ai WAV files
for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)
Related Blogs
How Insurance AI Chatbots Help Teams Serve Customers Better
Jan 13, 2026
Top 16+ RPA Use Cases Transforming the Banking Industry
Jan 13, 2026
Breaking Down AI in AML Transaction Monitoring From Detection to Voice
Jan 13, 2026
How AI Credit Risk Assessment Is Transforming Risk Management
Jan 13, 2026
6 Generative AI Use Cases Reshaping Insurance Ops
Jan 13, 2026









