Smallest.ai Vs Cartesia - TTS benchmark, evaluating latency and speech quality to help users choose the best fit for their real-time voice synthesis needs.

Akshat Mandloi
Updated on
January 19, 2026 at 11:16 AM
API Key
SMALLEST_API = ""
Define the URL for the Smallest API
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
Define the payload for the API request
payload = {
"voice_id": "emily",
"text": "Hello, my name is Emily. I am a text-to-speech voice.",
"speed": 1,
"sample_rate": 24000,
"add_wav_header": True
}
Define the headers for the API request
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
Initialize lists to store status codes and latencies
latencies_smallest = []
Make 10 API requests to measure latency
for i in range(10):
start_time = time.time()
response = requests.request("POST", url, json=payload, headers=headers)
latencies_smallest.append(time.time() - start_time)
Calculate and print the average latency
average_latency = sum(latencies_smallest) / len(latencies_smallest)
print("Average Latency Smallest.ai: ", average_latency)
2. Cartesia a. Latencies for full audio generation (HTTP):India: 1737 msUS: 2320msb. Time to generate full audio (Web Socket): India:1289 ms US: 1056msc. Time to First Byte (TTFB):India: 156msUS: 163msCode to generate latency using HTTPfrom cartesia import Cartesia
import os
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
latencies_cartesia = []
for i in range(10):
start_time = time.time()
data = client.tts.bytes(
model_id="sonic",
transcript="Hello, my name is Emily. I am a text-to-speech voice.",
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # Barbershop Man
# You can find the supported output_formats at https://docs.cartesia.ai/api-reference/tts/bytes
output_format={
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 24000,
},
)
with open("output.wav", "wb") as f:
f.write(data)
latencies_cartesia.append(time.time()-start_time)
print("Average Latencies for Cartesia: " , sum(latencies_cartesia)/10)
Code to generate latency using websocketfrom cartesia import Cartesia
import os
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
voice_id = "a0e99841-438c-4a64-b679-ae501e7d6091"
voice = client.voices.get(id=voice_id)
transcript = "Hello, my name is Emily. I am a text-to-speech voice."
You can check out our models at https://docs.cartesia.ai/getting-started/available-models
model_id = "sonic-english"
You can find the supported output_formats at https://docs.cartesia.ai/reference/api-reference/rest/stream-speech-server-sent-events
output_format = {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 22050,
}
rate = 22050
Set up the websocket connection
ws = client.tts.websocket()
cartesia_latency_ttfb = []
cartesia_latencies = []
for i in range(10):
audio_start_time = time.time()
start_time = time.time()
stream = None
# Generate and stream audio using the websocket
for output in ws.send(
model_id=model_id,
transcript=transcript,
voice_embedding=voice["embedding"],
stream=True,
output_format=output_format,
):
buffer = output["audio"]
print("Average time to first byte for Cartesia: " , sum(cartesia_latency_ttfb)/10)
print("Average Latencues for Cartesia: " , sum(cartesia_latencies)/len(cartesia_latencies))
Smallest uses a non auto regressive model and hence - ttfb is same as latencyLatency Analysis:Smallest.ai consistently outperforms Cartesia in non-real-time audio generation, delivering results at least 3x faster. This significant reduction in latency makes Smallest.ai the superior choice for applications demanding long-form voice synthesis, for use cases such as audiobooks, podcasts, social-media voiceovers, and more, making it the perfect choice for content creators.For real-time use cases, if you use WebSockets, Smallest.ai and Cartesia offer similar latencies, making both the TTS APIs incredibly powerful for building voice agents.While lower latencies might appear beneficial, the accompanying drop in audio quality doesn't enhance the user experience. Therefore, we also assess the quality of both models using widely recognized open-source benchmarks for MOS scores.2. MOS (Mean Opinion Score) AnalysisMOS is a widely recognized metric used to evaluate the quality of synthesized speech, with higher scores indicating better naturalness and clarity. The following table outlines the MOS scores for both platforms across several test categoriesWe have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:CategorySmallest.aiCartesia Examples (Full list available here)Punctuation Sentences4.6464.101Don't you think? This isn't right!Small Sentences4.6224.593You are very talented.Mixed Languages4.5433.827तुम्हें नहीं लगता? This isn't right!Sentences with Names4.5413.984मोहन discovered the quantum nature of karma.Acronyms4.5173.873POTUSPlaces4.6084.573ThiruvananthapuramSentences with Numbers (2nd Category)4.4133.445बैंक के लाभ में -90.93% की कमी देखी गई।Time Sentences4.2824.413The meeting is scheduled for October 15th at 11 AM.Sentences with Numbers4.1583.595तुम्हें presentation 30 मिनट में ख़त्म करनी होगी।Hindi in English Sentences4.3324.699He was supposed to call, but ab tak koi khabar nahi hai.Sentences with Places4.1594.022The waters of Sukkur began showing properties of conscious thought and memory retention.Sentences with Medicine4.1823.586Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.Sentences with Date and Time4.1323.577२०२१-१२-२५T००:००:००+००:०० marked the birth of universal consciousness.Phonetic Hindi in English Sentences4.3864.622Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!Names4.3884.516IndiraMahabharat Story4.2084.466Duryodhana was jealous of the Pandavas' abilities and popularity and made several attempts to eliminate them.Hard Sentences4.2134.447Joyful jaguars joyfully jumped joyful joyful jumpsLong Sentences3.8013.273कृपया ध्यान दें कि आपके कार्यों का न केवल आपके व्यक्तिगत जीवन पर, बल्कि समाज पर भी गहरा प्रभाव पड़ता है। इसलिए, हमेशा सही और नैतिक कार्य करने का प्रयास करें, जिससे सभी को लाभ हो।English in Hindi Sentences4.3613.708मुझे तुम्हारी बात समझ आई, but अभी भी कुछ doubts हैं।Hindi in English Sentences4.3324.699You should call him, warna phir he will be mad.Generate Audios for Smallest.ai and Cartesiaimport os
import time
from cartesia import Cartesia
import pandas as pd
import re
import requests
SMALLEST_API = ""
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--| Sentence |-| Category |--
#--|---------------------|-|-----------------------|--
#--| Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--
df = pd.read_csv("tts_test_new.csv")
def generate_audio_cartesia(text, filename, language=True):
if language:
data = client.tts.bytes(
model_id="sonic",
transcript=text,
voice_id="faf0731e-dfb9-4cfc-8119-259a79b27e12", # Apoorva
# You can find the supported output_formats at https://docs.cartesia.ai/api-reference/tts/bytes
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100,
},
language="hi"
)
else:
data = client.tts.bytes(
model_id="sonic",
transcript=text,
voice_id="729651dc-c6c3-4ee5-97fa-350da1f88600",
# You can find the supported output_formats at https://docs.cartesia.ai/api-reference/tts/bytes
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100,
},
)
with open(filename, "wb") as f:
f.write(data)
def generate_audio_smallest(text, filename):
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
Edit the header - enter Token
headers = {
"Authorization": f"Bearer {SMALLEST_API}",
"Content-Type": "application/json"
}
payload = {
"text": text,
"voice_id": "arnav",
"sample_rate": 24000,
"speed": 1.0,
"add_wav_header": True
}
response = requests.request("POST", url, json=payload, headers=headers)
if response.status_code == 200:
with open(filename, 'wb') as wav_file:
wav_file.write(response.content)
print(f"Audio file saved as {filename}")
else:
print(f"Error Occured with status code {response.text}")
def is_hindi(text):
return bool(re.search(r'[\u0900-\u097F]', text))
os.makedirs("audio_samples", exist_ok=True)
Iterate over each sentence in the dataframe and generate audio
for index, row in df.iterrows():
text = row['sentence']
category = row['category']
os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
os.makedirs(f'audio_samples/cartesia/{category}', exist_ok=True)
smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
cartesia_filename = f"audio_samples/cartesia/{category}/sentence_{index}.wav"
generate_audio_smallest(text, smallest_filename)
generate_audio_cartesia(text, cartesia_filename, language=is_hindi(text))
generate_audio_smallest(text, smallest_filename)
UTMOS is an open source benchmark which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2'''
Installation:
sudo apt install git-lfs
GIT_LFS_SKIP_SMUDGE=1 pip install git+https://github.com/sarulab-speech/UTMOSv2.git
'''
import utmosv2
import os
import csv
import pandas as pd
import pydub
import requests
model = utmosv2.create_model(pretrained=True)
tts_test_df = pd.read_csv('tts_test_new.csv')
categories = tts_test_df['category'].unique()
providers = ["cartesia", 'smallest']
Evaluate Smallest.ai WAV files
for category in categories:
for provider in providers:
os.makedirs('results', exist_ok=True)
os.makedirs('results/utmos', exist_ok=True)
print(f'results/utmos/{provider}{category}.csv')
print(f"audio_samples/{provider}/{category}")
if not os.path.exists(f'results/utmos/{provider}{category}.csv'):
mos = model.predict(input_dir=f"test_audio/{provider}/{category}")
with open(f"results/utmos/{provider}_{category}.csv", mode='w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
writer.writeheader()
writer.writerows(mos)
!pip3 install git+https://github.com/AndreevP/wvmos
'''
from wvmos import get_wvmos
import os
import csv
import pandas as pd
import pydub
import requests
from pathlib import Path
from tqdm import tqdm
import logging
import torch
mos_model = get_wvmos(cuda=True)
tts_test_df = pd.read_csv('tts_test_new.csv')
def evaluate_directory(directory, mos_model, extension):
"""Evaluate all audio files in a directory."""
results = []
dir_path = Path(directory)
audio_files = sorted(dir_path.glob(f"*.{extension}"))
For WAV files, calculate directly
for audio_path in tqdm(audio_files, desc=f"Evaluating {directory}"):
try:
score = float(mos_model.calculate_one(str(audio_path)))
results.append({
'file': audio_path.name,
'provider': directory.split('/')[-2],
'category': directory.split('/')[-1],
'mos_score': score
})
except Exception as e:
print(f"Error calculating MOS for {audio_path.name}: {str(e)}")
results.append({
'file': audio_path.name,
'provider': directory.split('/')[-2],
'category': directory.split('/')[-1],
'mos_score': None
})
return results
Evaluate Smallest.ai WAV files
for category in categories:
if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
all_results = []
if os.path.exists(f'audio_samples/smallest/{category}'):
results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
all_results.extend(results)
import os
Function to read UTMOS results
def read_utmos_results(file_path, category):
df = pd.read_csv(file_path)
df['provider'] = file_path.split('/')[-1].split('_')[0]
df['category'] = category
return df
Function to read WVMOS results
def read_wvmos_results(file_path, category):
df = pd.read_csv(file_path)
df['category'] = category
return df
Read UTMOS results
utmos_results = []
for category in categories:
for provider in providers:
file_path = f'results/utmos/{provider}_{category}.csv'
if os.path.exists(file_path):
utmos_results.append(read_utmos_results(file_path, category))
utmos_df = pd.concat(utmos_results, ignore_index=True)
Calculate mean MOS scores for UTMOS
utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores
Read WVMOS results
wvmos_results = []
for category in categories:
file_path = f'results/wvmos/mos_summary_{category}.csv'
if os.path.exists(file_path):
wvmos_results.append(read_wvmos_results(file_path, category))
wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df
Merge UTMOS and WVMOS results
comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))
Print the comparison
comparison_df
Create a CSV with mean MOS for Eleven and Smallest for each category
mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = providers
mean_mos_comparison.to_csv('mean_mos_comparison.csv')Key Insights:- Acronyms: Smallest.ai scores significantly higher (4.517 vs 3.873).- Mixed Language Handling: Smallest.ai leads with a score of 4.543, compared to Cartesia’s 3.827.- Complex Sentences & Numbers: Smallest.ai delivers superior quality, especially with numbers (4.413 vs 3.445) and date/time sentences (4.132 vs 3.577).- Punctuation & Clarity: Smallest.ai consistently performs better in categories requiring clarity, scoring 4.646 in punctuation sentences.As seen above, smallest outperforms Cartesia in 14/20 categories with an average MOS score of 4.33 while Cartesia stands at 4.06 Conclusion: Smallest.ai vs CartesiaThe benchmark results confirm that Smallest.ai outperforms Cartesia in both overall latency and audio quality. While Cartesia and Smallest have similar latencies for real-time APIs, for full audio generation Cartesia is 3x slower.Added to this, Smallest outperforms Cartesia in almost all MOS buckets, with an average MOS score that is 0.27 points higher than that of Cartesia. While Cartesia is slightly better in pronouncing proper nouns, very hard sentences, and Hindi in English sentences, in most buckets such as sentences with numbers, date-time, long sentences etc., Cartesia is significantly worse. If you're looking for a TTS solution that does not compromise on quality, while maintaining extremely low latencies, Smallest.ai is your undisputed champion in the voice AI arena. Speed, quality, and linguistic finesse—we're firing on all fronts!Login and try the Smallest TTS platform now!



