Mon Feb 10 2025 • 13 min Read
TTS Benchmark 2025: Smallest.ai vs Cartesia Report
Smallest.ai Vs Cartesia - TTS benchmark, evaluating latency and speech quality to help users choose the best fit for their real-time voice synthesis needs.
Pooja Porwal
Head - Growth
Abstract
Text-to-Speech (TTS) technology is revolutionizing human-machine interaction, enabling natural, real-time communication across diverse industries. From virtual assistants to accessibility tools, TTS is pivotal in creating seamless and engaging user experiences. The performance of TTS platforms is typically evaluated through key metrics such as latency and Mean Opinion Scores (MOS). Latency measures the speed of audio generation, while MOS assesses the naturalness and intelligibility of synthesized speech.
In this 2025 benchmark report, we compare Smallest.ai and Cartesia, two leading TTS platforms, analyzing their performance across latency and audio quality. Whether your priority is real-time responsiveness or speech clarity, this report provides actionable insights to help you choose the best TTS solution for your needs.
Introduction
This benchmarking report focuses on comparing Smallest.ai and Cartesia, evaluating their Text-to-Speech (TTS) capabilities. For those who missed our previous report comparing Smallest.ai with ElevenLabs, you can access it here
We delve into critical TTS performance metrics, including latency and MOS scores, across multiple categories. The results highlight Smallest.ai’s leadership in speed, clarity, and multilingual support, making it a superior choice for businesses and developers seeking scalable and reliable voice synthesis solutions.
About the Companies
Smallest.ai is an emerging AI startup focused on low-latency, high-quality foundational multi-modal AI models that challenge the status quo. Their TTS model, Lightning is one of the world's fastest text-to-speech models. This is the only TTS in the world that has a RTF (Real Time Factor) of 0.01.
Cartesia is an innovative speech AI platform specializing in realistic voice generation, designed for developers and creators who demand precision, scalability, and seamless integration for cutting-edge applications. Their TTS model, Sonic, also claims to be extremely fast, leveraging sub-quadratic architectures to reduce latency.
Both platforms are strong contenders in the high-performance TTS space.
Key Metrics
1. Latency Comparison
Latency is a critical factor in real-time applications, where lower latency ensures smoother interactions. This is especially important for virtual assistants, customer service bots, and interactive content.
For this comparison we have used the lightning model for Smallest.ai & Cartesia uses the sonic-2024-12-12 model for TTS. Below are different latency results based on region and protocol:
1. Smallest.ai
a. Latencies for full audio generation (HTTP):
- India: 340 ms
- US: 336 ms
b. Time to generate full audio (Web Socket):
- India: 340ms
- US: 336ms
c. Time to First Byte (TTFB):
- India: 187ms
- US: 187ms
*For smallest.ai, websockets are available for paid accounts.
Code to get average latencies for Smallest.ai:
# Smallest generation
import requests
import time
import matplotlib.pyplot as plt
# API Key
SMALLEST_API = ""
# Define the URL for the Smallest API
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
# Define the payload for the API request
payload = {
"voice_id": "emily",
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"speed": 1,
"sample_rate": 24000,
"add_wav_header": True
}
# Define the headers for the API request
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
# Initialize lists to store status codes and latencies
latencies_smallest = []
# Make 10 API requests to measure latency
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)
latencies_smallest.append(time.time() - start_time)
# Calculate and print the average latency
average_latency = sum(latencies_smallest) / len(latencies_smallest)
print("Average Latency Smallest.ai: ", average_latency)
2. Cartesia
a. Latencies for full audio generation (HTTP):
- India: 1737 ms
- US: 2320ms
b. Time to generate full audio (Web Socket):
- India:1289 ms
- US: 1056ms
c. Time to First Byte (TTFB):
- India: 156ms
- US: 163ms
Code to generate latency using HTTP
from cartesia import Cartesia
import os
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
latencies_cartesia = []
for i in range(10):
start_time = time.time()
data = client.tts.bytes(
model_id="sonic",
transcript="Hello, my name is Emily. I am a text-to-speech voice.",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # Barbershop Man
# You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/tts/bytes
output_format={
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 24000,
},
)
with open("output.wav", "wb") as f:
f.write(data)
latencies_cartesia.append(time.time()-start_time)
print("Average Latencies for Cartesia: " , sum(latencies_cartesia)/10)
Code to generate latency using websocket
from cartesia import Cartesia
import os
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
voice_id = "a0e99841-438c-4a64-b679-ae501e7d6091"
voice = client.voices.get(id=voice_id)
transcript = "Hello, my name is Emily. I am a text-to-speech voice."
# You can check out our models at https://docs.cartesia.ai/getting-started/available-models
model_id = "sonic-english"
# You can find the supported `output_format`s at https://docs.cartesia.ai/reference/api-reference/rest/stream-speech-server-sent-events
output_format = {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 22050,
}
rate = 22050
# Set up the websocket connection
ws = client.tts.websocket()
cartesia_latency_ttfb = []
cartesia_latencies = []
for i in range(10):
audio_start_time = time.time()
start_time = time.time()
stream = None
# Generate and stream audio using the websocket
for output in ws.send(
model_id=model_id,
transcript=transcript,
voice_embedding=voice["embedding"],
stream=True,
output_format=output_format,
):
buffer = output["audio"]
if stream == None:
cartesia_latency_ttfb.append(time.time() - start_time)
start_time = time.time()
stream = 1
cartesia_latencies.append(time.time() - audio_start_time)
print("Average time to first byte for Cartesia: " , sum(cartesia_latency_ttfb)/10)
print("Average Latencues for Cartesia: " , sum(cartesia_latencies)/len(cartesia_latencies))
Smallest uses a non auto regressive model and hence - ttfb is same as latency
Latency Analysis:
Smallest.ai consistently outperforms Cartesia in non-real-time audio generation, delivering results at least 3x faster. This significant reduction in latency makes Smallest.ai the superior choice for applications demanding long-form voice synthesis, for use cases such as audiobooks, podcasts, social-media voiceovers, and more, making it the perfect choice for content creators.
For real-time use cases, if you use WebSockets, Smallest.ai and Cartesia offer similar latencies, making both the TTS APIs incredibly powerful for building voice agents.
While lower latencies might appear beneficial, the accompanying drop in audio quality doesn't enhance the user experience. Therefore, we also assess the quality of both models using widely recognized open-source benchmarks for MOS scores.
2. MOS (Mean Opinion Score) Analysis
MOS is a widely recognized metric used to evaluate the quality of synthesized speech, with higher scores indicating better naturalness and clarity. The following table outlines the MOS scores for both platforms across several test categories
We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:
Category | Smallest.ai | Cartesia | Examples |
---|---|---|---|
Punctuation Sentences | 4.646 | 4.101 | Don't you think? This isn't right! |
Small Sentences | 4.622 | 4.593 | You are very talented. |
Mixed Languages | 4.543 | 3.827 | तुम्हें नहीं लगता? This isn't right! |
Sentences with Names | 4.541 | 3.984 | मोहन discovered the quantum nature of karma. |
Acronyms | 4.517 | 3.873 | POTUS |
Places | 4.608 | 4.573 | Thiruvananthapuram |
Sentences with Numbers (2nd Category) | 4.413 | 3.445 | बैंक के लाभ में -90.93% की कमी देखी गई। |
Time Sentences | 4.282 | 4.413 | The meeting is scheduled for October 15th at 11 AM. |
Sentences with Numbers | 4.158 | 3.595 | तुम्हें presentation 30 मिनट में ख़त्म करनी होगी। |
Hindi in English Sentences | 4.332 | 4.699 | He was supposed to call, but ab tak koi khabar nahi hai. |
Sentences with Places | 4.159 | 4.022 | The waters of Sukkur began showing properties of conscious thought and memory retention. |
Sentences with Medicine | 4.182 | 3.586 | Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment. |
Sentences with Date and Time | 4.132 | 3.577 | २०२१-१२-२५T००:००:००+००:०० marked the birth of universal consciousness. |
Phonetic Hindi in English Sentences | 4.386 | 4.622 | Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi! |
Names | 4.388 | 4.516 | Indira |
Mahabharat Story | 4.208 | 4.466 | Duryodhana was jealous of the Pandavas' abilities and popularity and made several attempts to eliminate them. |
Hard Sentences | 4.213 | 4.447 | Joyful jaguars joyfully jumped joyful joyful jumps |
Long Sentences | 3.801 | 3.273 | कृपया ध्यान दें कि आपके कार्यों का न केवल आपके व्यक्तिगत जीवन पर, बल्कि समाज पर भी गहरा प्रभाव पड़ता है। इसलिए, हमेशा सही और नैतिक कार्य करने का प्रयास करें, जिससे सभी को लाभ हो। |
English in Hindi Sentences | 4.361 | 3.708 | मुझे तुम्हारी बात समझ आई, but अभी भी कुछ doubts हैं। |
Hindi in English Sentences | 4.332 | 4.699 | You should call him, warna phir he will be mad. |
Generate Audios for Smallest.ai and Cartesia
import os
import time
from cartesia import Cartesia
import pandas as pd
import re
import requests
SMALLEST_API = ""
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
# Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--
df = pd.read_csv("tts_test_new.csv")
def generate_audio_cartesia(text, filename, language=True):
if language:
data = client.tts.bytes(
model_id="sonic",
transcript=text,
voice_id="faf0731e-dfb9-4cfc-8119-259a79b27e12", # Apoorva
# You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/tts/bytes
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100,
},
language="hi"
)
else:
data = client.tts.bytes(
model_id="sonic",
transcript=text,
voice_id="729651dc-c6c3-4ee5-97fa-350da1f88600",
# You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/tts/bytes
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100,
},
)
with open(filename, "wb") as f:
f.write(data)
def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
## Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")
def is_hindi(text):
return bool(re.search(r'[\u0900-\u097F]', text))
os.makedirs("audio_samples", exist_ok=True)
# Iterate over each sentence in the dataframe and generate audio
for index, row in df.iterrows():
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/cartesia/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
cartesia_filename = f"audio_samples/cartesia/{category}/sentence_{index}.wav"
# generate_audio_smallest(text, smallest_filename)
generate_audio_cartesia(text, cartesia_filename, language=is_hindi(text))
generate_audio_smallest(text, smallest_filename)
UTMOS is an open source benchmark which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2
'''
Installation:
sudo apt install git-lfs
GIT_LFS_SKIP_SMUDGE=1 pip install git+https://github.com/sarulab-speech/UTMOSv2.git
'''
import utmosv2
import os
import csv
import pandas as pd
import pydub
import requests
model = utmosv2.create_model(pretrained=True)
tts_test_df = pd.read_csv('tts_test_new.csv')
categories = tts_test_df['category'].unique()
providers = ["cartesia", 'smallest']
# Evaluate Smallest.ai WAV files
for category in categories:
for provider in providers:
os.makedirs('results', exist_ok=True)
os.makedirs('results/utmos', exist_ok=True)
print(f'results/utmos/{provider}_{category}.csv')
print(f"audio_samples/{provider}/{category}")
if not os.path.exists(f'results/utmos/{provider}_{category}.csv'):
mos = model.predict(input_dir=f"test_audio/{provider}/{category}")
with open(f"results/utmos/{provider}_{category}.csv", mode='w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
writer.writeheader()
writer.writerows(mos)
print(f"Data has been written to results/utmos/{provider}_{category}.csv")
WVMOS is an open-source benchmark which uses pretrained Wav2Vec2 to extract features and predict the MOS scores. Below is the code used to generate MOS scores using WV MOS library
'''
!pip3 install git+https://github.com/AndreevP/wvmos
'''
from wvmos import get_wvmos
import os
import csv
import pandas as pd
import pydub
import requests
from pathlib import Path
from tqdm import tqdm
import logging
import torch
mos_model = get_wvmos(cuda=True)
tts_test_df = pd.read_csv('tts_test_new.csv')
def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))
# For WAV files, calculate directly
for audio_path in tqdm(audio_files, desc=f"Evaluating {directory}"):
try:
score = float(mos_model.calculate_one(str(audio_path)))
results.append({
'file': audio_path.name,
'provider': directory.split('/')[-2],
'category': directory.split('/')[-1],
'mos_score': score
})
except Exception as e:
print(f"Error calculating MOS for {audio_path.name}: {str(e)}")
results.append({
'file': audio_path.name,
'provider': directory.split('/')[-2],
'category': directory.split('/')[-1],
'mos_score': None
})
return results
# Evaluate Smallest.ai WAV files
for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)
if os.path.exists(f'audio_samples/cartesia/{category}'):
results = evaluate_directory(f'audio_samples/cartesia/{category}', mos_model, 'wav')
all_results.extend(results)
if not all_results:
print("No audio files found in generated directories!")
else:
# Create results DataFrame
results_df = pd.DataFrame(all_results)
# Calculate summary statistics
summary = results_df.groupby('provider')['mos_score'].agg([
'count', 'mean', 'std', 'min', 'max'
]).round(3)
# Save results
os.makedirs('results', exist_ok=True)
os.makedirs('results/wvmos', exist_ok=True)
results_df.to_csv(f'results/wvmos/detailed_mos_scores_{category}.csv', index=False)
summary.to_csv(f'results/wvmos/mos_summary_{category}.csv')
# Print summary
print("\nMOS Score Summary:")
print(summary)
# Print top/bottom samples (only include text if available)
print("\nTop 1 Best Samples:")
columns_to_show = ['provider', 'file', 'mos_score']
if 'text' in results_df.columns:
columns_to_show.insert(2, 'text')
print(results_df.nlargest(1, 'mos_score')[columns_to_show])
print("\nBottom 1 Samples:")
print(results_df.nsmallest(1, 'mos_score')[columns_to_show])
Below is the code used to generate a combined report using both MOS scores for each category
import pandas as pd
import os
# Function to read UTMOS results
def read_utmos_results(file_path, category):
df = pd.read_csv(file_path)
df['provider'] = file_path.split('/')[-1].split('_')[0]
df['category'] = category
return df
# Function to read WVMOS results
def read_wvmos_results(file_path, category):
df = pd.read_csv(file_path)
df['category'] = category
return df
# Read UTMOS results
utmos_results = []
for category in categories:
for provider in providers:
file_path = f'results/utmos/{provider}_{category}.csv'
if os.path.exists(file_path):
utmos_results.append(read_utmos_results(file_path, category))
utmos_df = pd.concat(utmos_results, ignore_index=True)
# Calculate mean MOS scores for UTMOS
utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores
# Read WVMOS results
wvmos_results = []
for category in categories:
file_path = f'results/wvmos/mos_summary_{category}.csv'
if os.path.exists(file_path):
wvmos_results.append(read_wvmos_results(file_path, category))
wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df
# Merge UTMOS and WVMOS results
comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))
# Print the comparison
comparison_df
# Create a CSV with mean MOS for Eleven and Smallest for each category
mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = providers
mean_mos_comparison.to_csv('mean_mos_comparison.csv')
Key Insights:
- Acronyms: Smallest.ai scores significantly higher (4.517 vs 3.873).
- Mixed Language Handling: Smallest.ai leads with a score of 4.543, compared to Cartesia’s 3.827.
- Complex Sentences & Numbers: Smallest.ai delivers superior quality, especially with numbers (4.413 vs 3.445) and date/time sentences (4.132 vs 3.577).
- Punctuation & Clarity: Smallest.ai consistently performs better in categories requiring clarity, scoring 4.646 in punctuation sentences.
As seen above, smallest outperforms Cartesia in 14/20 categories with an average MOS score of 4.33 while Cartesia stands at 4.06
Conclusion: Smallest.ai vs Cartesia
The benchmark results confirm that Smallest.ai outperforms Cartesia in both overall latency and audio quality.
While Cartesia and Smallest have similar latencies for real-time APIs, for full audio generation Cartesia is 3x slower.
Added to this, Smallest outperforms Cartesia in almost all MOS buckets, with an average MOS score that is 0.27 points higher than that of Cartesia. While Cartesia is slightly better in pronouncing proper nouns, very hard sentences, and Hindi in English sentences, in most buckets such as sentences with numbers, date-time, long sentences etc., Cartesia is significantly worse.
If you're looking for a TTS solution that does not compromise on quality, while maintaining extremely low latencies, Smallest.ai is your undisputed champion in the voice AI arena. Speed, quality, and linguistic finesse—we're firing on all fronts!
Login and try the Smallest TTS platform now!
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
Top Open Source Text to Speech Alternatives Compared
Explore top TTS alternatives like Piper and Espeak-ng for natural output. Choose the best open source option for your needs. Click now!
Top 11 Conversational AI Platforms In 2025
Looking for the best conversational AI tools in 2025? Compare top platforms, their features, pricing, pros, and cons to choose the best tool for your needs.
Using Text-to-Speech Feature on Android and Windows Devices
Master how to use text to speech on Android and Windows. Set up and configure easily. Click to enhance device accessibility now!