As frontier voice AI systems converge on baseline quality, the next generation of evals must move beyond isolated clips to measure what users actually experience: turn-taking, interruption handling, role fit, and perceived presence in conversation.

Smallest Text to Speech Team
Updated on

Text-to-Speech evaluations are fundamentally flawed.
For years, the speech community has obsessed over improving TTS models across a familiar set of parameters: better prosody, richer emotion, cleaner audio, more natural pacing. Architectures evolved, datasets scaled, and neural vocoders reached astonishing realism. But while the field has made a huge leap forward, evaluation has not.
The uncomfortable truth is that we don't actually know how to reliably measure how natural synthetic speech sounds. Which means most current TTS benchmarks are, at best, weak proxies and, at worst, misleading.
Today, frontier voice systems have converged to a level of realism where legacy benchmarks can no longer distinguish between them in any meaningful way. The industry keeps searching for a single score to rank TTS systems, but that assumption is fundamentally flawed. Natural speech is not one-dimensional. It is a layered interaction of timing, pitch, micro-prosody, emotion, breath, phrasing, and contextual fit. Two systems can score identically on overall naturalness while sounding completely different where it actually matters. And once speech enters real interaction, where listening, reacting, and adapting to current context defines how convincing a voice is, a single naturalness score becomes almost meaningless.
Until evaluation reflects that complexity, the field will keep over-optimizing on the same parameters, and benchmarks will keep producing confident numbers that do little to improve the end user experience.So what does better evaluation actually look like?
In this article, we make three contributions toward answering that question. First, we show that MOS, LLM-as-a-judge, and win rate, all three dominant metrics in the field today, do not correlate strongly with customer preferences for conversational voices in voice agents. Second, we show how defining an extremely specific persona for judging voices vastly increases the ability of evaluations to truly measure what humans consider natural. Third, we introduce Lightning V3, a text-to-speech model designed for conversational voice agents, with state-of-the-art naturalness across inbound and outbound contact center use cases.
Issues with MOS - Naturalness in voices is multi-dimensional. So why do we measure it like it isn’t?
Here's the issue in one example. Take two voices: one scores 3.5 out of 5 on MOS (Mean Opinion Score), the industry's standard for TTS quality. The other scores 3.17.
Issues with LLM as a Judge - Are we evaluating emotional display as a proxy for naturalness?
To reduce the subjectivity with human evaluation, the field has turned to LLMs as judges. That substitution, it turns out, introduces its own problems.
When we use LLMs as judges, we’re implicitly assuming three things:
The model can infer naturalness from text or metadata.
The model has stable internal criteria.
The model’s judgments correlate with human perception.
In practice, none of these assumptions hold.
We evaluated multiple TTS systems using Gemini 2.5 Pro as an LLM judge, scoring models across conversation quality and expressiveness. On paper, this looked rigorous, structured, quantitative, and most importantly, comparable.
ElevenLabs v3 averaged 4.26 out of 5 on emotion and conversation quality. Lightning v3 averaged 3.88. By the numbers, ElevenLabs wins cleanly.
However, in a blind listening test, Lightning v3 sounded noticeably more natural. ElevenLabs showed stronger measurable traits: more expressive emotion, tighter prosody, and cleaner pronunciation. And yet it did not feel natural. There is a gap in the way humans talk while feeling an emotion and the way a TTS model tries to emulate that emotion today.
That is a serious evaluation failure.
The problem does not lie with the definition behind the chosen evaluation categories. Even with well-defined dimensions like prosody, intonation, and emotional behavior, we were still asking a language model to evaluate something fundamentally perceptual. As seen in the samples above, a consistently excited tone can feel unnatural. Excitement is most effective when used selectively, at the right moment, rather than throughout.
Naturalness is not captured by asking whether a voice has emotion or pronounces correctly. It lives in the micro-details: subtle timing variations, micro-pauses, breath placement, spectral smoothness, transitions between phonemes, and the small imperfections that the brain reads as human. These are things you notice instantly when you hear them.
Issues with Win Rate - Comparative evaluation and what it can’t tell you
Pairwise preference testing takes a different approach: instead of rating a sample in isolation, human listeners directly compare two samples and choose which one sounds better. This sidesteps a lot of the calibration noise that plagues scalar scoring.
The limitation is that it tells you who won, not why. You learn that one system is preferred, but you gain no diagnostic insight into what drove that preference or how to close the gap.
Listen to these pairs of audio samples. For each pair, ask yourself which one you prefer, and by how much. Chances are the difference is smaller than you expected. You are not measuring quality anymore. You are measuring margin.
Every evaluation method runs into the same wall. The better our models get, the more precise and diagnostic our evaluation needs to become.
The methodologies discussed above also fails to predict what a voice sounds like in a live interaction.
There is a difference in
Conversational generation is the hardest thing a TTS model has to do. Unlike narration or dictation, there's no complete script to work from. The audio is synthesized in real time, emotions shift mid-sentence, and the model is always operating on incomplete information, committing to each chunk of audio before it knows what comes next.
In an independent win-rate benchmark, Magnus significantly outperformed Daniel. However, in real conversational deployment, informal listening tests, Daniel sounded considerably more natural.
Persona Fit: The Dimension Nobody Is Measuring
A voice isn't just a delivery mechanism. In live interaction, it is the agent's identity. Listeners don't hear "a pleasant voice reading text." They hear a character and the preference for one voice over the other depends heavily on the context of what is being said.
When a voice and persona are misaligned, it creates a specific kind of dissonance that listeners feel but struggle to name.
The second voice sample here is uncomfortable. Not because it is unnatural, but because it conflicts with our mental model of what a hypothetical Tesla salesman should sound like. The first voice, Kyle, aligns more closely with that expectation.
Consider what a persona actually encodes:
Authority vs. approachability: A clinical support agent needs a voice that signals competence. A wellness bot needs one that signals warmth. The same voice cannot do both with equal success.
Pacing as personality: A slow, deliberate voice on a fast transactional assistant doesn't just feel inefficient; it feels like the agent has the wrong temperament for the job.
Emotional register: A voice with uplift and brightness on a bereavement support line doesn't sound natural; it sounds indifferent to the weight of the conversation.
The critical insight is that naturalness is not a property of the voice, it's a property of the voice-in-role.
In a live exchange, the interaction is dynamic. A voice that tests well on static sentences may fracture when it has to carry emotional transitions. A persona-coherent voice can navigate those shifts because it has an implied character that listeners have already accepted.
The voice, in other words, has to be believable, not just legible.
The concept worth anchoring here is communicative adequacy: not whether a voice sounds human, but whether it sounds like the right human for a specific job. A calm healthcare agent and a measured financial advisor are not doing the same communicative work. The emotional register it needs to hold, the conversational pressure it will face, the trust it needs to build should differ because the goal differs. A benchmark that rewards proximity to a population-average naturalness score will systematically penalize the deviations that make a voice appropriate for its context.
There is already evidence that this distinction is real. When listeners are asked how natural is this versus how appropriate is this here, they give materially different scores for the same sample.
A voice has to be judged in context, for whether it carries the right social signal, fits the persona it inhabits, and feels believable in the moment it was built for.
Voices evaluated for conversation using knowledge-based evaluation systems
The other contention with existing evals is LLMs have no internal model of what makes a voice sound natural. What they do instead is pattern match: recognizing that certain descriptions of speech tend to go with certain scores and replicating that. The result is that an LLM can confidently score a flawed voice, invent emotional qualities that aren't there, or shift its judgment based on how the question is phrased. That is not evaluation. It is sophisticated guessing.
The first step is to push existing benchmarks further. MOS, pairwise preference, and LLM scoring are not not needed anymore. However, instead of just asking how natural a voice sounds overall, we need to probe further ask where exactly it breaks down. Is it the pacing? The pitch contour? The breath placement? Evaluation has to become more specific about what it is measuring before it can become useful for improvement.
That specificity requires a different foundation. Edward Feigenbaum argued for knowledge-based systems: programs built on explicit rules defined by human experts rather than pattern matching. Applied to TTS, this means building evaluation from speech science directly. Pitch contour continuity, pause distribution, phoneme transition smoothness. A score that reads "pitch variability within human normative band: 92%" means something actionable. A score that reads "naturalness: 4.2", not so much.
Lightning V3 is built around a simple conviction: a voice isn't good or bad in the abstract. It's right or wrong for the role it's playing. That means designing for persona fit, for conversational texture, for the moments where a voice has to carry emotional weight mid-sentence without falling apart. That's what we've been building toward.
The final leap in voice quality won't come from better architectures alone. It'll come from measuring the right things. Getting benchmarks right isn't a research footnote, it's the whole game. Until evaluation catches up to what we're actually trying to build, the field will keep chasing scores that don't mean what we think they mean.
That is the problem we are solving at smallest.ai.

Lightning V3 is our most natural-sounding TTS model
Now available on a pay-as-you-go model.
Learn More


