TTS Benchmark 2025: Smallest.ai vs Cartesia Report

Abstract

Text-to-Speech (TTS) technology is revolutionizing human-machine interaction, enabling natural, real-time communication across diverse industries. From virtual assistants to accessibility tools, TTS is pivotal in creating seamless and engaging user experiences. The performance of TTS platforms is typically evaluated through key metrics such as latency and Mean Opinion Scores (MOS). Latency measures the speed of audio generation, while MOS assesses the naturalness and intelligibility of synthesized speech.

In this 2025 benchmark report, we compare Smallest.ai and Cartesia, two leading TTS platforms, analyzing their performance across latency and audio quality. Whether your priority is real-time responsiveness or speech clarity, this report provides actionable insights to help you choose the best TTS solution for your needs.

Introduction

This benchmarking report focuses on comparing Smallest.ai and Cartesia, evaluating their Text-to-Speech (TTS) capabilities. For those who missed our previous report comparing Smallest.ai with ElevenLabs, you can access it here

We delve into critical TTS performance metrics, including latency and MOS scores, across multiple categories. The results highlight Smallest.ai’s leadership in speed, clarity, and multilingual support, making it a superior choice for businesses and developers seeking scalable and reliable voice synthesis solutions.

About the Companies

Smallest.ai is an emerging AI startup focused on low-latency, high-quality foundational multi-modal AI models that challenge the status quo. Their TTS model, Lightning is one of the world's fastest text-to-speech models. This is the only TTS in the world that has a RTF (Real Time Factor) of 0.01.

Cartesia is an innovative speech AI platform specializing in realistic voice generation, designed for developers and creators who demand precision, scalability, and seamless integration for cutting-edge applications. Their TTS model, Sonic, also claims to be extremely fast, leveraging sub-quadratic architectures to reduce latency.

Both platforms are strong contenders in the high-performance TTS space.

Key Metrics

1. Latency Comparison

Latency is a critical factor in real-time applications, where lower latency ensures smoother interactions. This is especially important for virtual assistants, customer service bots, and interactive content.

For this comparison we have used the lightning model for Smallest.ai & Cartesia uses the sonic-2024-12-12 model for TTS. Below are different latency results based on region and protocol:

1. Smallest.ai

a. Latencies for full audio generation (HTTP):

India: 340 ms
US: 336 ms

b. Time to generate full audio (Web Socket):

India: 340ms
US: 336ms

c. Time to First Byte (TTFB):

India: 187ms
US: 187ms

*For smallest.ai, websockets are available for paid accounts.

Code to get average latencies for Smallest.ai:


# Smallest generation
import requests
import time
import matplotlib.pyplot as plt


# API Key
SMALLEST_API = ""


# Define the URL for the Smallest API
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"


# Define the payload for the API request
payload = {
   "voice_id": "emily",
   "text": "Hello, my name is Emily. I am a text-to-speech voice.",
   "speed": 1,
   "sample_rate": 24000,
   "add_wav_header": True
}


# Define the headers for the API request
headers = {
   "Authorization": f"Bearer {SMALLEST_API}",
   "Content-Type": "application/json"
}


# Initialize lists to store status codes and latencies
latencies_smallest = []


# Make 10 API requests to measure latency
for i in range(10):
   start_time = time.time()
   response = requests.request("POST", url, json=payload, headers=headers)
   latencies_smallest.append(time.time() - start_time)


# Calculate and print the average latency
average_latency = sum(latencies_smallest) / len(latencies_smallest)
print("Average Latency Smallest.ai: ", average_latency)

2. Cartesia

a. Latencies for full audio generation (HTTP):

India: 1737 ms
US: 2320ms

b. Time to generate full audio (Web Socket):

India:1289 ms
US: 1056ms

c. Time to First Byte (TTFB):

India: 156ms
US: 163ms

Code to generate latency using HTTP

from cartesia import Cartesia
import os
CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)


latencies_cartesia = []
for i in range(10):
   start_time = time.time()
   data = client.tts.bytes(
       model_id="sonic",
       transcript="Hello, my name is Emily. I am a text-to-speech voice.",
       voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",  # Barbershop Man
       # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/tts/bytes
       output_format={
           "container": "raw",
           "encoding": "pcm_f32le",
           "sample_rate": 24000,
       },
   )


   with open("output.wav", "wb") as f:
       f.write(data)


   latencies_cartesia.append(time.time()-start_time)


print("Average Latencies for Cartesia: " , sum(latencies_cartesia)/10)

Code to generate latency using websocket

from cartesia import Cartesia
import os


CARTESIA_API_KEY = ""
client = Cartesia(api_key=CARTESIA_API_KEY)
voice_id = "a0e99841-438c-4a64-b679-ae501e7d6091"
voice = client.voices.get(id=voice_id)
transcript = "Hello, my name is Emily. I am a text-to-speech voice."


# You can check out our models at https://docs.cartesia.ai/getting-started/available-models
model_id = "sonic-english"


# You can find the supported `output_format`s at https://docs.cartesia.ai/reference/api-reference/rest/stream-speech-server-sent-events
output_format = {
   "container": "raw",
   "encoding": "pcm_f32le",
   "sample_rate": 22050,
}

rate = 22050

# Set up the websocket connection
ws = client.tts.websocket()


cartesia_latency_ttfb = []
cartesia_latencies = []


for i in range(10):
    audio_start_time = time.time()
    start_time = time.time()
    stream = None
    # Generate and stream audio using the websocket
    for output in ws.send(
        model_id=model_id,
        transcript=transcript,
        voice_embedding=voice["embedding"],
        stream=True,
        output_format=output_format,
    ):
        buffer = output["audio"]

        if stream == None:
            cartesia_latency_ttfb.append(time.time() - start_time)
            start_time = time.time()
            stream = 1

    cartesia_latencies.append(time.time() - audio_start_time)
  
print("Average time to first byte for Cartesia: " , sum(cartesia_latency_ttfb)/10)
print("Average Latencues for Cartesia: " , sum(cartesia_latencies)/len(cartesia_latencies))

Smallest uses a non auto regressive model and hence - ttfb is same as latency

Latency Analysis:

Smallest.ai consistently outperforms Cartesia in non-real-time audio generation, delivering results at least 3x faster. This significant reduction in latency makes Smallest.ai the superior choice for applications demanding long-form voice synthesis, for use cases such as audiobooks, podcasts, social-media voiceovers, and more, making it the perfect choice for content creators.

For real-time use cases, if you use WebSockets, Smallest.ai and Cartesia offer similar latencies, making both the TTS APIs incredibly powerful for building voice agents.

While lower latencies might appear beneficial, the accompanying drop in audio quality doesn't enhance the user experience. Therefore, we also assess the quality of both models using widely recognized open-source benchmarks for MOS scores.

2. MOS (Mean Opinion Score) Analysis

MOS is a widely recognized metric used to evaluate the quality of synthesized speech, with higher scores indicating better naturalness and clarity. The following table outlines the MOS scores for both platforms across several test categories

We have used 2 commonly accepted open source libraries WVMOS and UTMOS and average the MOS scores from both the libraries and report them in the table below:

Category	Smallest.ai	Cartesia	Examples (Full list available here)
Punctuation Sentences	4.646	4.101	Don't you think? This isn't right!
Small Sentences	4.622	4.593	You are very talented.
Mixed Languages	4.543	3.827	तुम्हें नहीं लगता? This isn't right!
Sentences with Names	4.541	3.984	मोहन discovered the quantum nature of karma.
Acronyms	4.517	3.873	POTUS
Places	4.608	4.573	Thiruvananthapuram
Sentences with Numbers (2nd Category)	4.413	3.445	बैंक के लाभ में -90.93% की कमी देखी गई।
Time Sentences	4.282	4.413	The meeting is scheduled for October 15th at 11 AM.
Sentences with Numbers	4.158	3.595	तुम्हें presentation 30 मिनट में ख़त्म करनी होगी।
Hindi in English Sentences	4.332	4.699	He was supposed to call, but ab tak koi khabar nahi hai.
Sentences with Places	4.159	4.022	The waters of Sukkur began showing properties of conscious thought and memory retention.
Sentences with Medicine	4.182	3.586	Amoxicillin की प्रभावशीलता तुलसी के साथ मिलकर दस गुना बढ़ गई, creating a new paradigm in antibiotic treatment.
Sentences with Date and Time	4.132	3.577	२०२१-१२-२५T००:००:००+००:०० marked the birth of universal consciousness.
Phonetic Hindi in English Sentences	4.386	4.622	Tumhe pata hai, mujhe abhi abhi ek funny video mila, dekho toh sahi!
Names	4.388	4.516	Indira
Mahabharat Story	4.208	4.466	Duryodhana was jealous of the Pandavas' abilities and popularity and made several attempts to eliminate them.
Hard Sentences	4.213	4.447	Joyful jaguars joyfully jumped joyful joyful jumps
Long Sentences	3.801	3.273	कृपया ध्यान दें कि आपके कार्यों का न केवल आपके व्यक्तिगत जीवन पर, बल्कि समाज पर भी गहरा प्रभाव पड़ता है। इसलिए, हमेशा सही और नैतिक कार्य करने का प्रयास करें, जिससे सभी को लाभ हो।
English in Hindi Sentences	4.361	3.708	मुझे तुम्हारी बात समझ आई, but अभी भी कुछ doubts हैं।
Hindi in English Sentences	4.332	4.699	You should call him, warna phir he will be mad.

Generate Audios for Smallest.ai and Cartesia

import os
import time
from cartesia import Cartesia
import pandas as pd
import re
import requests


SMALLEST_API = ""
CARTESIA_API_KEY  = ""


client = Cartesia(api_key=CARTESIA_API_KEY)


# Read the Test CSV which has the following structure
#--|---------------------|-|-----------------------|--
#--|        Sentence     |-|      Category         |--
#--|---------------------|-|-----------------------|--
#--|      Sentence text. |-| Category of sentences |--
#--|---------------------|-|-----------------------|--


df = pd.read_csv("tts_test_new.csv")


def generate_audio_cartesia(text, filename, language=True):


   if language:
       data = client.tts.bytes(
           model_id="sonic",
           transcript=text,
           voice_id="faf0731e-dfb9-4cfc-8119-259a79b27e12",  # Apoorva
           # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/tts/bytes
           output_format={
               "container": "wav",
               "encoding": "pcm_f32le",
               "sample_rate": 44100,
           },
           language="hi"
       )
   else:
       data = client.tts.bytes(
       model_id="sonic",
       transcript=text,
       voice_id="729651dc-c6c3-4ee5-97fa-350da1f88600", 
       # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/tts/bytes
           output_format={
               "container": "wav",
               "encoding": "pcm_f32le",
               "sample_rate": 44100,
           },
       )


   with open(filename, "wb") as f:
       f.write(data)


def generate_audio_smallest(text, filename):
   url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
   ## Edit the header - enter Token
   headers = {
       "Authorization": f"Bearer {SMALLEST_API}",
       "Content-Type": "application/json"
   }
   payload = {
       "text": text,
       "voice_id": "arnav",
       "sample_rate": 24000,
       "speed": 1.0,
       "add_wav_header": True
   }
   response = requests.request("POST", url, json=payload, headers=headers)
   if response.status_code == 200:
       with open(filename, 'wb') as wav_file:
           wav_file.write(response.content)
       print(f"Audio file saved as {filename}")
   else:
       print(f"Error Occured with status code {response.text}")


def is_hindi(text):
   return bool(re.search(r'[\u0900-\u097F]', text))


os.makedirs("audio_samples", exist_ok=True)


# Iterate over each sentence in the dataframe and generate audio
for index, row in df.iterrows():
   text = row['sentence']
   category = row['category']




   os.makedirs(f'audio_samples/smallest/{category}', exist_ok=True)
   os.makedirs(f'audio_samples/cartesia/{category}', exist_ok=True)


   smallest_filename = f"audio_samples/smallest/{category}/sentence_{index}.wav"
   cartesia_filename = f"audio_samples/cartesia/{category}/sentence_{index}.wav"


   # generate_audio_smallest(text, smallest_filename)
   generate_audio_cartesia(text, cartesia_filename, language=is_hindi(text))
   generate_audio_smallest(text, smallest_filename)

UTMOS is an open source benchmark which leverages audio features to predict MOS based on the audio quality and naturalness of speech. Below is the code used to generate MOS scores for Smallest and Elevenlabs using UtMOSV2

'''
Installation:


sudo apt install git-lfs


GIT_LFS_SKIP_SMUDGE=1 pip install git+https://github.com/sarulab-speech/UTMOSv2.git


'''


import utmosv2
import os
import csv
import pandas as pd
import pydub
import requests


model = utmosv2.create_model(pretrained=True)


tts_test_df = pd.read_csv('tts_test_new.csv')
categories = tts_test_df['category'].unique()
providers = ["cartesia", 'smallest']


# Evaluate Smallest.ai WAV files
for category in categories:
   for provider in providers:
       os.makedirs('results', exist_ok=True)
       os.makedirs('results/utmos', exist_ok=True)
       print(f'results/utmos/{provider}_{category}.csv')
       print(f"audio_samples/{provider}/{category}")
       if not os.path.exists(f'results/utmos/{provider}_{category}.csv'):
           mos = model.predict(input_dir=f"test_audio/{provider}/{category}")
           with open(f"results/utmos/{provider}_{category}.csv", mode='w', newline='') as file:
               writer = csv.DictWriter(file, fieldnames=['file_path', 'predicted_mos'])
               writer.writeheader()
               writer.writerows(mos)


           print(f"Data has been written to results/utmos/{provider}_{category}.csv")

WVMOS is an open-source benchmark which uses pretrained Wav2Vec2 to extract features and predict the MOS scores. Below is the code used to generate MOS scores using WV MOS library

'''
!pip3 install git+https://github.com/AndreevP/wvmos
'''


from wvmos import get_wvmos
import os
import csv
import pandas as pd
import pydub
import requests
from pathlib import Path
from tqdm import tqdm
import logging
import torch


mos_model = get_wvmos(cuda=True)


tts_test_df = pd.read_csv('tts_test_new.csv')


def evaluate_directory(directory, mos_model, extension):
   """Evaluate all audio files in a directory."""
   results = []
   dir_path = Path(directory)
   audio_files = sorted(dir_path.glob(f"*.{extension}"))


   # For WAV files, calculate directly
   for audio_path in tqdm(audio_files, desc=f"Evaluating {directory}"):
       try:
           score = float(mos_model.calculate_one(str(audio_path)))
           results.append({
               'file': audio_path.name,
               'provider': directory.split('/')[-2],
               'category': directory.split('/')[-1],
               'mos_score': score
           })
       except Exception as e:
           print(f"Error calculating MOS for {audio_path.name}: {str(e)}")
           results.append({
               'file': audio_path.name,
               'provider': directory.split('/')[-2],
               'category': directory.split('/')[-1],
               'mos_score': None
           })


   return results




# Evaluate Smallest.ai WAV files
for category in categories:
   if not os.path.exists(f'results/wvmos/mos_summary_{category}.csv'):
       all_results = []
       if os.path.exists(f'audio_samples/smallest/{category}'):
           results = evaluate_directory(f'audio_samples/smallest/{category}', mos_model, 'wav')
           all_results.extend(results)


       if os.path.exists(f'audio_samples/cartesia/{category}'):
           results = evaluate_directory(f'audio_samples/cartesia/{category}', mos_model, 'wav')
           all_results.extend(results)


       if not all_results:
           print("No audio files found in generated directories!")
       else:
           # Create results DataFrame
           results_df = pd.DataFrame(all_results)


           # Calculate summary statistics
           summary = results_df.groupby('provider')['mos_score'].agg([
               'count', 'mean', 'std', 'min', 'max'
           ]).round(3)


           # Save results
           os.makedirs('results', exist_ok=True)
           os.makedirs('results/wvmos', exist_ok=True)
           results_df.to_csv(f'results/wvmos/detailed_mos_scores_{category}.csv', index=False)
           summary.to_csv(f'results/wvmos/mos_summary_{category}.csv')


           # Print summary
           print("\nMOS Score Summary:")
           print(summary)


           # Print top/bottom samples (only include text if available)
           print("\nTop 1 Best Samples:")
           columns_to_show = ['provider', 'file', 'mos_score']
           if 'text' in results_df.columns:
               columns_to_show.insert(2, 'text')
           print(results_df.nlargest(1, 'mos_score')[columns_to_show])


           print("\nBottom 1 Samples:")
           print(results_df.nsmallest(1, 'mos_score')[columns_to_show])

Below is the code used to generate a combined report using both MOS scores for each category

import pandas as pd
import os


# Function to read UTMOS results
def read_utmos_results(file_path, category):
   df = pd.read_csv(file_path)
   df['provider'] = file_path.split('/')[-1].split('_')[0]
   df['category'] = category
   return df


# Function to read WVMOS results
def read_wvmos_results(file_path, category):
   df = pd.read_csv(file_path)
   df['category'] = category
   return df


# Read UTMOS results
utmos_results = []
for category in categories:
   for provider in providers:
       file_path = f'results/utmos/{provider}_{category}.csv'
       if os.path.exists(file_path):
           utmos_results.append(read_utmos_results(file_path, category))


utmos_df = pd.concat(utmos_results, ignore_index=True)
# Calculate mean MOS scores for UTMOS
utmos_mean_scores = utmos_df.groupby(['provider', 'category'])['predicted_mos'].mean().reset_index()
utmos_mean_scores


# Read WVMOS results
wvmos_results = []
for category in categories:
   file_path = f'results/wvmos/mos_summary_{category}.csv'
   if os.path.exists(file_path):
       wvmos_results.append(read_wvmos_results(file_path, category))


wvmos_df = pd.concat(wvmos_results, ignore_index=True)
wvmos_df
# Merge UTMOS and WVMOS results
comparison_df = pd.merge(utmos_mean_scores, wvmos_df, on=['provider', 'category'], suffixes=('_utmos', '_wvmos'))


# Print the comparison
comparison_df


# Create a CSV with mean MOS for Eleven and Smallest for each category
mean_mos_comparison = comparison_df.pivot(index='category', columns='provider', values='mean')
mean_mos_comparison.columns = providers
mean_mos_comparison.to_csv('mean_mos_comparison.csv')

Key Insights:

- Acronyms: Smallest.ai scores significantly higher (4.517 vs 3.873).

- Mixed Language Handling: Smallest.ai leads with a score of 4.543, compared to Cartesia’s 3.827.

- Complex Sentences & Numbers: Smallest.ai delivers superior quality, especially with numbers (4.413 vs 3.445) and date/time sentences (4.132 vs 3.577).

- Punctuation & Clarity: Smallest.ai consistently performs better in categories requiring clarity, scoring 4.646 in punctuation sentences.

As seen above, smallest outperforms Cartesia in 14/20 categories with an average MOS score of 4.33 while Cartesia stands at 4.06

Conclusion: Smallest.ai vs Cartesia

The benchmark results confirm that Smallest.ai outperforms Cartesia in both overall latency and audio quality.

While Cartesia and Smallest have similar latencies for real-time APIs, for full audio generation Cartesia is 3x slower.

Added to this, Smallest outperforms Cartesia in almost all MOS buckets, with an average MOS score that is 0.27 points higher than that of Cartesia. While Cartesia is slightly better in pronouncing proper nouns, very hard sentences, and Hindi in English sentences, in most buckets such as sentences with numbers, date-time, long sentences etc., Cartesia is significantly worse.

If you're looking for a TTS solution that does not compromise on quality, while maintaining extremely low latencies, Smallest.ai is your undisputed champion in the voice AI arena. Speed, quality, and linguistic finesse—we're firing on all fronts!

Mon Feb 10 2025 • 13 min Read

TTS Benchmark 2025: Smallest.ai vs Cartesia Report

Akshat Mandloi

Abstract

Introduction

About the Companies

Key Metrics

1. Latency Comparison

Smallest uses a non auto regressive model and hence - ttfb is same as latency

Latency Analysis:

2. MOS (Mean Opinion Score) Analysis

Conclusion: Smallest.ai vs Cartesia

Recent Blog Posts

How AI Voice Handles Property Inquiries and Scheduling with Ease in Real Estate

Learn how AI voice agents can handle property inquiries, scheduling, and virtual tours in real estate, boosting efficiency and enhancing customer engagement.

Conversational AI in Finance: Key Applications and Industry Impact

Discover how conversational AI for finance streamlines customer service, automates compliance, and improves risk management with real-world industry results.

What is AI in Banking? Practical Strategies and What’s Next

Explore the impact of AI in banking, from enhanced security to personalized services, and learn how financial institutions are transforming with cutting-edge technology.

Mon Feb 10 2025 • 13 min Read

TTS Benchmark 2025: Smallest.ai vs Cartesia Report

Akshat Mandloi

Abstract

Introduction

About the Companies

Key Metrics

1. Latency Comparison

Smallest uses a non auto regressive model and hence - ttfb is same as latencyLatency Analysis:

2. MOS (Mean Opinion Score) Analysis

Conclusion: Smallest.ai vs Cartesia

Recent Blog Posts

How AI Voice Handles Property Inquiries and Scheduling with Ease in Real Estate

Learn how AI voice agents can handle property inquiries, scheduling, and virtual tours in real estate, boosting efficiency and enhancing customer engagement.

Conversational AI in Finance: Key Applications and Industry Impact

Discover how conversational AI for finance streamlines customer service, automates compliance, and improves risk management with real-world industry results.

What is AI in Banking? Practical Strategies and What’s Next

Explore the impact of AI in banking, from enhanced security to personalized services, and learn how financial institutions are transforming with cutting-edge technology.

Smallest uses a non auto regressive model and hence - ttfb is same as latency

Latency Analysis: