Blogs

/

Top Vogent AI Alternative for 2025: Why Smallest AI Stands Out

TTS Benchmark 2025: Smallest.ai vs ElevenLabs Report

Smallest.ai Vs Eleven Labs - TTS benchmark, evaluating latency and speech quality to help users choose the best fit for their real-time voice synthesis needs.

Akshat Mandloi

Updated on

December 18, 2025 at 12:50 PM

Enter the URL

url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
SAMPLE_RATE = 24000 ## Can be changed to 8000, 16000, 48000
VOICE_ID = "emily" ## List of supported voices can be found here: https://waves-docs.smallest.ai/waves-api

Edit the payload

payload = {
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"voice_id": VOICE_ID,
"sample_rate": SAMPLE_RATE,
"speed": 1.0,
"add_wav_header": True
}

Edit the header - enter Token

headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}

Send the reponse and save the audio

print(f"Sending the test {datetime.now()}")
latencies_smallest = []
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)

if response.status_code == 200:
    # print(f"Saving the audio!! {response.status_code} Latency: {(datetime.now() - start_time).total_seconds()}")

    # Save the audio in bytes to a .wav file
    with open('waves_lightning_sample_audio_en.wav', 'wb') as wav_file:
        wav_file.write(response.content)

    print("Audio file saved as waves_lightning_sample_audio.wav: ", time.time() - start_time)
    latencies_smallest.append(time.time() - start_time)
else:
    print(f"Error Occured with status code {response.status_code}")

print("Average Latency for Smallest: ", sum(latencies_smallest) / 10)2. ElevenLabs:Average Latency: 527 ms (India)Average Latency: 350 ms (US)Code to get average latencies for Elevenlabs:import time

from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs

Initialize the ElevenLabs client with the provided API key

client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)

def text_to_speech_file(text: str, save_audio: bool = False) -> str:
"""
Converts text to speech using ElevenLabs API and saves the audio file.

Parameters:
text (str): The text to be converted to speech.
save_audio (bool): Flag to save the audio file. Default is False.

Returns:
str: The path of the saved audio file.
"""
# Calling the text-to-speech conversion API with detailed parameters
response = client.text_to_speech.convert(
    voice_id="pNInz6obpgDQGcFmaJgB",  # Adam pre-made voice
    output_format="mp3_22050_32",
    text=text,
    model_id="eleven_flash_v2_5",  # Use the turbo model for low latency
    voice_settings=VoiceSettings(
        stability=0.0,
        similarity_boost=1.0,
        style=0.0,
        use_speaker_boost=True,
    ),
)

# Generating a unique file name for the output MP3 file
save_file_path = "elevenlabs.mp3"

# Writing the audio to a file
with open(save_file_path, "wb") as f:
    for chunk in response:
        if chunk:
            f.write(chunk)

print(f"{save_file_path}: A new audio file was saved successfully!")

# Return the path of the saved audio file
return save_file_path

Initialize a list to store latencies

latencies_eleven_flash_v2_5 = []

Measure the latency for 10 API requests

for i in range(10):
start_time = time.time()
text_to_speech_file("Hello, my name is Emily. I am a text-to-speech voice.")
latencies_eleven_flash_v2_5.append(time.time() - start_time)

Print the average latency

print("Average Latency for Flash V2_5 Model: ", sum(latencies_eleven_flash_v2_5) / 10)
This proves that Smallest.ai is faster across both India and US geographies in terms of latency. However, simply having faster latencies, whilst having degraded audio quality, does not help the end user. Hence, we also compare both models for quality based on a widely accepted open-source quality benchmark.2. MOS (Mean Opinion Score) ModelThe Mean Opinion Score (MOS) is a widely used metric to evaluate the quality of synthetic speech. It uses a 5-point scale, where 5 being the highest quality and 1 being the lowest.MOS assesses key aspects such as naturalness, intelligibility, and expressiveness of the synthesized voice. While subjective evaluation remains common in the industry, it is not the most precise approach.To make this process more objective, we’ve refined it by incorporating 20 distinct categories for Hindi and English (2 commonly spoken languages across the world), chosen based on high demand requirements by enterprise customers. This ensures a more thorough and reliable measure of voice quality.We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:CategoryMean MOS SmallestMean MOS ElevenlabsExamples(Full list available here)Small sentences4.5514.152You are very talented.Medium sentences4.3084.068The beauty of this garden is unique and indescribable.Long sentences3.9173.374Your writing style is very impactful and thoughtful. Your words have the power to not only inspire the reader to read but also compel them to think deeply. This is a true testament to your writing skill.Hard sentences4.3933.935Curious koalas curiously climbed curious curious climbersMahabharat stories4.2863.941The Mahabharata is a unique epic of Indian culture, a saga of religion, justice, and duty.Time sentences4.5053.854The meeting is scheduled for October 15th at 11 AM.Number sentences4.6294.012-4.56E-02Sentences with medicine4.3453.773Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.Mix Languages4.6784.116तुम्हें नहीं लगता? This isn't right!Punctuation sentences4.5974.096Don't you think? This isn't right!Sentences with places4.2513.766अलवर के पुराने किले में discovered chambers that naturally amplify spiritual consciousness through sacred geometry.Places4.6854.216ThiruvananthapuramEnglish in Hindi sentences4.5274.088इस साल की revenue target 10 million dollars है।Sentences with names4.6144.216मोहन discovered the quantum nature of karma.Hindi in English sentences4.594.255The meeting went well, lekin kuch points abhi bhi unclear hain.Phonetic Hindi in English sentences4.594.303Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!Sentence with numbers4.4724.216I got 5.34% interestAcronyms4.0334.047POTUSnames4.4084.536IndiraSentence with date and time3.9154.062६ जून २०२६ ०७:०० को पहला मानव-मशीन विवाह हुआ।Below is the code used to generate audios for these sentences using Smallest and Elevenlabs:import pandas as pd
import os
import pydub
import requests
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
from tqdm import tqdm

Create directories for storing audio samples

os.makedirs('audio_samples/smallest', exist_ok=True)
os.makedirs('audio_samples/eleven', exist_ok=True)

elevenlabs_client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)

Read the Test CSV which has the following structure

#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--

tts_test_df = pd.read_csv('tts_test.csv')

Function to generate audio using Smallest API

def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")

Function to generate audio using Elevenlabs API

def generate_audio_eleven(text, filename):
response = elevenlabs_client.text_to_speech.convert(
voice_id="zT03pEAEi0VHKciJODfn",
output_format="mp3_22050_32",
text=text,
model_id="eleven_flash_v2_5", # use the turbo model for low latency
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
),
)
with open(filename, "wb") as f:
for chunk in response:
if chunk:
f.write(chunk)
print(f"Audio file saved as {filename}")

# Convert mp3 to wav - for mos models to work properly
sound = pydub.AudioSegment.from_mp3(filename)
wav_filename = filename.replace('.mp3', '.wav')
sound.export(wav_filename, format="wav")
print(f"Audio file converted to {wav_filename}")

# Delete the mp3 file
os.remove(filename)
print(f"Deleted the mp3 file {filename}")

Iterate over each sentence in the dataframe and generate audio

for index, row in tts_test_df.iterrows():
print(row)
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/eleven/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
eleven_filename = f"audio_samples/eleven/{category}/sentence_{index}.mp3"

generate_audio_smallest(text, smallest_filename)
generate_audio_eleven(text, eleven_filename)</code></pre><p><br /><a href="https://github.com/AndreevP/wvmos">WV MOS</a>  is an open-source library which uses pretrained Wav2Vec2 to extract features and predict the MOS scores. Below is the code used to generate MOS scores using WV MOS library:</p><pre><code class="language-python">

print("Is gpu available: ", torch.cuda.is_available()) # Please make sure cuda is available

from wvmos import get_wvmos
from pathlib import Path
from tqdm import tqdm
import logging
import torch
import os

def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))

# For WAV files, calculate directly
for audio_path in tqdm(audio_files, desc=f&quot;Evaluating {directory}&quot;):
    try:
        score = float(mos_model.calculate_one(str(audio_path)))
        results.append({
            &#039;file&#039;: audio_path.name,
            &#039;provider&#039;: directory.split(&#039;/&#039;)[-2],
            &#039;category&#039;: directory.split(&#039;/&#039;)[-1],
            &#039;mos_score&#039;: score
        })
    except Exception as e:
        print(f&quot;Error calculating MOS for {audio_path.name}: {str(e)}&quot;)
        results.append({
            &#039;file&#039;: audio_path.name,
            &#039;provider&#039;: directory.split(&#039;/&#039;)[-2],
            &#039;category&#039;: directory.split(&#039;/&#039;)[-1],
            &#039;mos_score&#039;: None
        })

return results

def initialize_mos_model(model_name='wv-mos'):
"""Initialize a single MOS model with automatic CUDA detection."""
print("Initializing MOS model...")
cuda_available = torch.cuda.is_available()
if cuda_available:
print("CUDA is available. Using GPU for MOS calculation.")
else:
print("CUDA is not available. Using CPU for MOS calculation.")
if model_name == 'wv-mos':
return get_wvmos(cuda=cuda_available)
elif model_name == 'ut-mos':
return utmosv2.get_utmos(pretrained=True)
else:
return None

mos_model = initialize_mos_model(model_name='wv-mos')

tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()

Evaluate Smallest.ai WAV files

for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)

    # Evaluate ElevenLabs MP3 files
    if os.path.exists(f&#039;audio_samples/eleven/{category}&#039;):
        results = evaluate_directory(f&#039;audio_samples/eleven/{category}&#039;, mos_model, &#039;wav&#039;)
        all_results.extend(results)

    if not all_results:
        print(&quot;No audio files found in generated directories!&quot;)
    else:
        # Create results DataFrame
        results_df = pd.DataFrame(all_results)

        # Calculate summary statistics
        summary = results_df.groupby(&#039;provider&#039;)[&#039;mos_score&#039;].agg([
            &#039;count&#039;, &#039;mean&#039;, &#039;std&#039;, &#039;min&#039;, &#039;max&#039;
        ]).round(3)

        # Save results
        os.makedirs(&#039;results&#039;, exist_ok=True)
        os.makedirs(&#039;results/wvmos&#039;, exist_ok=True)
        results_df.to_csv(f&#039;results/wvmos/detailed_mos_scores_{category}.csv&#039;, index=False)
        summary.to_csv(f&#039;results/wvmos/mos_summary_{category}.csv&#039;)

        # Print summary
        print(&quot;\nMOS Score Summary:&quot;)
        print(summary)

        # Print top/bottom samples (only include text if available)
        print(&quot;\nTop 1 Best Samples:&quot;)
        columns_to_show = [&#039;provider&#039;, &#039;file&#039;, &#039;mos_score&#039;]
        if &#039;text&#039; in results_df.columns:
            columns_to_show.insert(2, &#039;text&#039;)
        print(results_df.nlargest(1, &#039;mos_score&#039;)[columns_to_show])

        print(&quot;\nBottom 1 Samples:&quot;)
        print(results_df.nsmallest(1, &#039;mos_score&#039;)[columns_to_show])</code></pre><p><br /><a href="https://github.com/sarulab-speech/UTMOSv2">UtMOSV2</a> is a model which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2:</p><pre><code class="language-python">

import os
import csv
import pandas as pd

tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()

Evaluate Smallest.ai WAV files

for category in categories:
for provider in ['smallest', 'eleven']:
os.makedirs('results', exist_ok=True)
os.makedirs('results/utmos', exist_ok=True)
if not os.path.exists('results/utmos/{provider}{category}.csv'):
mos = model.predict(input_dir=f"./audio_samples/{provider}/{category}")
with open(f"/content/results/utmos/{provider}
{category}.csv", mode='w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
writer.writeheader()
writer.writerows(mos)

        print(f&quot;Data has been written to /content/results/utmos/{provider}_{category}.csv&quot;)</code></pre><p><br />Below is the code used to generate a combined report using both MOS scores for each category:</p><pre><code class="language-python">

import os

Function to read UTMOS results

def read_utmos_results(file_path, category):
df = pd.read_csv(file_path)
df['provider'] = file_path.split('/')[-1].split('_')[0]
df['category'] = category
return df

Function to read WVMOS results

def read_wvmos_results(file_path, category):
df = pd.read_csv(file_path)
df['category'] = category
return df

Read UTMOS results

utmos_results = []
for category in categories:
for provider in ['smallest', 'eleven']:
file_path = f'results/utmos/{provider}_{category}.csv'
if os.path.exists(file_path):
utmos_results.append(read_utmos_results(file_path, category))

utmos_df = pd.concat(utmos_results, ignore_index=True)

# Calculate mean MOS scores for UTMOS

utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores

# Read WVMOS results

wvmos_results = []
for category in categories:
file_path = f'results/wvmos/mos_summary_{category}.csv'
if os.path.exists(file_path):
wvmos_results.append(read_wvmos_results(file_path, category))

wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df

Merge UTMOS and WVMOS results

comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))

Print the comparison

comparison_df

Create a CSV with mean MOS for Eleven and Smallest for each category

mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = ['mean_mos_eleven', 'mean_mos_smallest']
mean_mos_comparison.to_csv('mean_mos_comparison.csv')

Print the mean MOS comparison

mean_mos_comparisonAs seen above Smallest.ai surpasses Elevenlabs in MOS scores in 17/20 categories, whilst only being marginally worse in the remaining categories. Overall, across all categories, Smallest.ai has an average MOS score of 4.14 whereas Elevenlabs has an average MOS score of 3.83. Also the worst performing category for Smallest.ai is sentences with date-time, wherein the MOS score is 3.915; however, the worst performing category for Elevenlabs is Long sentences, wherein Elevenlabs has a MOS score of just 3.374. We clearly prove that Elevenlabs performs significantly worse on certain critical categories.ConclusionOur benchmark tests demonstrate that Smallest.ai not only surpasses Eleven Labs in terms of latency, providing faster response times essential for real-time applications, it also consistently achieves higher MOS scores, particularly in categories that involve complex, multilingual, and culturally nuanced content, establishing itself as the superior choice for high-quality, diverse TTS solutions. Even in categories where Elevenlabs comes out as better, Smallest.ai is just marginally lacking. However, in categories where Smallest.ai is better, Elevenlabs is lagging far behind. This establishes Smallest.ai as state-of-the-art in Text-to-Speech (Mic drop).We hope that this benchmark helps potential buyers make an informed decision while considering the two companies and we hope to publish more detailed reports in the very near future.

Enter the URL

url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
SAMPLE_RATE = 24000 ## Can be changed to 8000, 16000, 48000
VOICE_ID = "emily" ## List of supported voices can be found here: https://waves-docs.smallest.ai/waves-api

Edit the payload

payload = {
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"voice_id": VOICE_ID,
"sample_rate": SAMPLE_RATE,
"speed": 1.0,
"add_wav_header": True
}

Edit the header - enter Token

headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}

Send the reponse and save the audio

print(f"Sending the test {datetime.now()}")
latencies_smallest = []
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)

if response.status_code == 200:
    # print(f&quot;Saving the audio!! {response.status_code} Latency: {(datetime.now() - start_time).total_seconds()}&quot;)

    # Save the audio in bytes to a .wav file
    with open(&#039;waves_lightning_sample_audio_en.wav&#039;, &#039;wb&#039;) as wav_file:
        wav_file.write(response.content)

    print(&quot;Audio file saved as waves_lightning_sample_audio.wav: &quot;, time.time() - start_time)
    latencies_smallest.append(time.time() - start_time)
else:
    print(f&quot;Error Occured with status code {response.status_code}&quot;)

print("Average Latency for Smallest: ", sum(latencies_smallest) / 10)2. ElevenLabs:Average Latency: 527 ms (India)Average Latency: 350 ms (US)Code to get average latencies for Elevenlabs:import time

from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs

Initialize the ElevenLabs client with the provided API key

client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)

def text_to_speech_file(text: str, save_audio: bool = False) -> str:
"""
Converts text to speech using ElevenLabs API and saves the audio file.

Parameters:
text (str): The text to be converted to speech.
save_audio (bool): Flag to save the audio file. Default is False.

Returns:
str: The path of the saved audio file.
&quot;&quot;&quot;
# Calling the text-to-speech conversion API with detailed parameters
response = client.text_to_speech.convert(
    voice_id=&quot;pNInz6obpgDQGcFmaJgB&quot;,  # Adam pre-made voice
    output_format=&quot;mp3_22050_32&quot;,
    text=text,
    model_id=&quot;eleven_flash_v2_5&quot;,  # Use the turbo model for low latency
    voice_settings=VoiceSettings(
        stability=0.0,
        similarity_boost=1.0,
        style=0.0,
        use_speaker_boost=True,
    ),
)

# Generating a unique file name for the output MP3 file
save_file_path = &quot;elevenlabs.mp3&quot;

# Writing the audio to a file
with open(save_file_path, &quot;wb&quot;) as f:
    for chunk in response:
        if chunk:
            f.write(chunk)

print(f&quot;{save_file_path}: A new audio file was saved successfully!&quot;)

# Return the path of the saved audio file
return save_file_path

Initialize a list to store latencies

latencies_eleven_flash_v2_5 = []

Measure the latency for 10 API requests

for i in range(10):
start_time = time.time()
text_to_speech_file("Hello, my name is Emily. I am a text-to-speech voice.")
latencies_eleven_flash_v2_5.append(time.time() - start_time)

Print the average latency

print("Average Latency for Flash V2_5 Model: ", sum(latencies_eleven_flash_v2_5) / 10)
This proves that Smallest.ai is faster across both India and US geographies in terms of latency. However, simply having faster latencies, whilst having degraded audio quality, does not help the end user. Hence, we also compare both models for quality based on a widely accepted open-source quality benchmark.2. MOS (Mean Opinion Score) ModelThe Mean Opinion Score (MOS) is a widely used metric to evaluate the quality of synthetic speech. It uses a 5-point scale, where 5 being the highest quality and 1 being the lowest.MOS assesses key aspects such as naturalness, intelligibility, and expressiveness of the synthesized voice. While subjective evaluation remains common in the industry, it is not the most precise approach.To make this process more objective, we’ve refined it by incorporating 20 distinct categories for Hindi and English (2 commonly spoken languages across the world), chosen based on high demand requirements by enterprise customers. This ensures a more thorough and reliable measure of voice quality.We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:CategoryMean MOS SmallestMean MOS ElevenlabsExamples(Full list available here)Small sentences4.5514.152You are very talented.Medium sentences4.3084.068The beauty of this garden is unique and indescribable.Long sentences3.9173.374Your writing style is very impactful and thoughtful. Your words have the power to not only inspire the reader to read but also compel them to think deeply. This is a true testament to your writing skill.Hard sentences4.3933.935Curious koalas curiously climbed curious curious climbersMahabharat stories4.2863.941The Mahabharata is a unique epic of Indian culture, a saga of religion, justice, and duty.Time sentences4.5053.854The meeting is scheduled for October 15th at 11 AM.Number sentences4.6294.012-4.56E-02Sentences with medicine4.3453.773Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.Mix Languages4.6784.116तुम्हें नहीं लगता? This isn't right!Punctuation sentences4.5974.096Don't you think? This isn't right!Sentences with places4.2513.766अलवर के पुराने किले में discovered chambers that naturally amplify spiritual consciousness through sacred geometry.Places4.6854.216ThiruvananthapuramEnglish in Hindi sentences4.5274.088इस साल की revenue target 10 million dollars है।Sentences with names4.6144.216मोहन discovered the quantum nature of karma.Hindi in English sentences4.594.255The meeting went well, lekin kuch points abhi bhi unclear hain.Phonetic Hindi in English sentences4.594.303Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!Sentence with numbers4.4724.216I got 5.34% interestAcronyms4.0334.047POTUSnames4.4084.536IndiraSentence with date and time3.9154.062६ जून २०२६ ०७:०० को पहला मानव-मशीन विवाह हुआ।Below is the code used to generate audios for these sentences using Smallest and Elevenlabs:import pandas as pd
import os
import pydub
import requests
from elevenlabs.client import ElevenLabs
from elevenlabs import VoiceSettings
from tqdm import tqdm

Create directories for storing audio samples

os.makedirs('audio_samples/smallest', exist_ok=True)
os.makedirs('audio_samples/eleven', exist_ok=True)

elevenlabs_client = ElevenLabs(
api_key=ELEVENLABS_API_KEY,
)

Read the Test CSV which has the following structure

#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--

tts_test_df = pd.read_csv('tts_test.csv')

Function to generate audio using Smallest API

def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")

Function to generate audio using Elevenlabs API

def generate_audio_eleven(text, filename):
response = elevenlabs_client.text_to_speech.convert(
voice_id="zT03pEAEi0VHKciJODfn",
output_format="mp3_22050_32",
text=text,
model_id="eleven_flash_v2_5", # use the turbo model for low latency
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
),
)
with open(filename, "wb") as f:
for chunk in response:
if chunk:
f.write(chunk)
print(f"Audio file saved as {filename}")

# Convert mp3 to wav - for mos models to work properly
sound = pydub.AudioSegment.from_mp3(filename)
wav_filename = filename.replace(&#039;.mp3&#039;, &#039;.wav&#039;)
sound.export(wav_filename, format=&quot;wav&quot;)
print(f&quot;Audio file converted to {wav_filename}&quot;)

# Delete the mp3 file
os.remove(filename)
print(f&quot;Deleted the mp3 file {filename}&quot;)

Iterate over each sentence in the dataframe and generate audio

for index, row in tts_test_df.iterrows():
print(row)
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/eleven/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
eleven_filename = f"audio_samples/eleven/{category}/sentence_{index}.mp3"

generate_audio_smallest(text, smallest_filename)
generate_audio_eleven(text, eleven_filename)</code></pre><p><br /><a href="https://github.com/AndreevP/wvmos">WV MOS</a>  is an open-source library which uses pretrained Wav2Vec2 to extract features and predict the MOS scores. Below is the code used to generate MOS scores using WV MOS library:</p><pre><code class="language-python">

print("Is gpu available: ", torch.cuda.is_available()) # Please make sure cuda is available

from wvmos import get_wvmos
from pathlib import Path
from tqdm import tqdm
import logging
import torch
import os

def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))

# For WAV files, calculate directly
for audio_path in tqdm(audio_files, desc=f&quot;Evaluating {directory}&quot;):
    try:
        score = float(mos_model.calculate_one(str(audio_path)))
        results.append({
            &#039;file&#039;: audio_path.name,
            &#039;provider&#039;: directory.split(&#039;/&#039;)[-2],
            &#039;category&#039;: directory.split(&#039;/&#039;)[-1],
            &#039;mos_score&#039;: score
        })
    except Exception as e:
        print(f&quot;Error calculating MOS for {audio_path.name}: {str(e)}&quot;)
        results.append({
            &#039;file&#039;: audio_path.name,
            &#039;provider&#039;: directory.split(&#039;/&#039;)[-2],
            &#039;category&#039;: directory.split(&#039;/&#039;)[-1],
            &#039;mos_score&#039;: None
        })

return results

def initialize_mos_model(model_name='wv-mos'):
"""Initialize a single MOS model with automatic CUDA detection."""
print("Initializing MOS model...")
cuda_available = torch.cuda.is_available()
if cuda_available:
print("CUDA is available. Using GPU for MOS calculation.")
else:
print("CUDA is not available. Using CPU for MOS calculation.")
if model_name == 'wv-mos':
return get_wvmos(cuda=cuda_available)
elif model_name == 'ut-mos':
return utmosv2.get_utmos(pretrained=True)
else:
return None

mos_model = initialize_mos_model(model_name='wv-mos')

tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()

Evaluate Smallest.ai WAV files

for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)

    # Evaluate ElevenLabs MP3 files
    if os.path.exists(f&#039;audio_samples/eleven/{category}&#039;):
        results = evaluate_directory(f&#039;audio_samples/eleven/{category}&#039;, mos_model, &#039;wav&#039;)
        all_results.extend(results)

    if not all_results:
        print(&quot;No audio files found in generated directories!&quot;)
    else:
        # Create results DataFrame
        results_df = pd.DataFrame(all_results)

        # Calculate summary statistics
        summary = results_df.groupby(&#039;provider&#039;)[&#039;mos_score&#039;].agg([
            &#039;count&#039;, &#039;mean&#039;, &#039;std&#039;, &#039;min&#039;, &#039;max&#039;
        ]).round(3)

        # Save results
        os.makedirs(&#039;results&#039;, exist_ok=True)
        os.makedirs(&#039;results/wvmos&#039;, exist_ok=True)
        results_df.to_csv(f&#039;results/wvmos/detailed_mos_scores_{category}.csv&#039;, index=False)
        summary.to_csv(f&#039;results/wvmos/mos_summary_{category}.csv&#039;)

        # Print summary
        print(&quot;\nMOS Score Summary:&quot;)
        print(summary)

        # Print top/bottom samples (only include text if available)
        print(&quot;\nTop 1 Best Samples:&quot;)
        columns_to_show = [&#039;provider&#039;, &#039;file&#039;, &#039;mos_score&#039;]
        if &#039;text&#039; in results_df.columns:
            columns_to_show.insert(2, &#039;text&#039;)
        print(results_df.nlargest(1, &#039;mos_score&#039;)[columns_to_show])

        print(&quot;\nBottom 1 Samples:&quot;)
        print(results_df.nsmallest(1, &#039;mos_score&#039;)[columns_to_show])</code></pre><p><br /><a href="https://github.com/sarulab-speech/UTMOSv2">UtMOSV2</a> is a model which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2:</p><pre><code class="language-python">

import os
import csv
import pandas as pd

tts_test_df = pd.read_csv('tts_test.csv')
categories = tts_test_df['category'].unique()

Evaluate Smallest.ai WAV files

for category in categories:
for provider in ['smallest', 'eleven']:
os.makedirs('results', exist_ok=True)
os.makedirs('results/utmos', exist_ok=True)
if not os.path.exists('results/utmos/{provider}{category}.csv'):
mos = model.predict(input_dir=f"./audio_samples/{provider}/{category}")
with open(f"/content/results/utmos/{provider}
{category}.csv", mode='w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
writer.writeheader()
writer.writerows(mos)

        print(f&quot;Data has been written to /content/results/utmos/{provider}_{category}.csv&quot;)</code></pre><p><br />Below is the code used to generate a combined report using both MOS scores for each category:</p><pre><code class="language-python">

import os

Function to read UTMOS results

def read_utmos_results(file_path, category):
df = pd.read_csv(file_path)
df['provider'] = file_path.split('/')[-1].split('_')[0]
df['category'] = category
return df

Function to read WVMOS results

def read_wvmos_results(file_path, category):
df = pd.read_csv(file_path)
df['category'] = category
return df

Read UTMOS results

utmos_results = []
for category in categories:
for provider in ['smallest', 'eleven']:
file_path = f'results/utmos/{provider}_{category}.csv'
if os.path.exists(file_path):
utmos_results.append(read_utmos_results(file_path, category))

utmos_df = pd.concat(utmos_results, ignore_index=True)

# Calculate mean MOS scores for UTMOS

utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores

# Read WVMOS results

wvmos_results = []
for category in categories:
file_path = f'results/wvmos/mos_summary_{category}.csv'
if os.path.exists(file_path):
wvmos_results.append(read_wvmos_results(file_path, category))

wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df

Merge UTMOS and WVMOS results

comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))

Print the comparison

comparison_df

Create a CSV with mean MOS for Eleven and Smallest for each category

mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = ['mean_mos_eleven', 'mean_mos_smallest']
mean_mos_comparison.to_csv('mean_mos_comparison.csv')

Print the mean MOS comparison

mean_mos_comparisonAs seen above Smallest.ai surpasses Elevenlabs in MOS scores in 17/20 categories, whilst only being marginally worse in the remaining categories. Overall, across all categories, Smallest.ai has an average MOS score of 4.14 whereas Elevenlabs has an average MOS score of 3.83. Also the worst performing category for Smallest.ai is sentences with date-time, wherein the MOS score is 3.915; however, the worst performing category for Elevenlabs is Long sentences, wherein Elevenlabs has a MOS score of just 3.374. We clearly prove that Elevenlabs performs significantly worse on certain critical categories.ConclusionOur benchmark tests demonstrate that Smallest.ai not only surpasses Eleven Labs in terms of latency, providing faster response times essential for real-time applications, it also consistently achieves higher MOS scores, particularly in categories that involve complex, multilingual, and culturally nuanced content, establishing itself as the superior choice for high-quality, diverse TTS solutions. Even in categories where Elevenlabs comes out as better, Smallest.ai is just marginally lacking. However, in categories where Smallest.ai is better, Elevenlabs is lagging far behind. This establishes Smallest.ai as state-of-the-art in Text-to-Speech (Mic drop).We hope that this benchmark helps potential buyers make an informed decision while considering the two companies and we hope to publish more detailed reports in the very near future.

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Automate Now

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Automate Now

Related Blogs

Lead with Precision, Speak with Purpose: What Smallest.ai Shares with Emmanuel Macron

Nov 25, 2025

Conversational AI in Customer Service: 4 Use Cases And Steps

Dec 18, 2025

The Future of AI in Customer Service: What Comes Next

Dec 18, 2025

9 Ways Contact Center AI Is Changing Customer Calls Forever

Dec 18, 2025

How Generative AI in Financial Services is Defining 2025 ROI

Dec 18, 2025