Learn how to evaluate voice agent accuracy using WER, latency, task success, TTS quality, and conversational intelligence metrics.

Prithvi Bharadwaj
Updated on

When a voice agent misunderstands a customer, it’s more than a technical error; it’s a frustrating experience that can damage trust. As voice becomes a primary interface for users, the performance of your agent defines their perception of your brand. A comprehensive assessment requires a multi-layered approach. You must measure not only the accuracy of the transcription but also the agent's ability to understand intent, handle conversational turns, and successfully complete the user's requested task. This detailed analysis helps pinpoint specific weaknesses in the system, from acoustic model errors to flaws in dialogue management logic.
At Smallest.ai, we've powered millions of voice interactions and learned that a simple pass/fail test isn't enough. To properly evaluate voice agents, you have to break down the entire conversation, from the first word spoken to the final result, and check every component.
This framework is built for developers, product managers, and anyone responsible for deploying conversational AI that works. We will cover the technical metrics that engineers need to track alongside the business-focused KPIs that matter to stakeholders.
Here’s the step-by-step process we’ll follow:
Define Your Evaluation Framework and Goals.
Assess Core Speech-to-Text (STT) Performance.
Measure Text-to-Speech (TTS) and Latency.
Analyze Conversational Intelligence and Task Completion.
Evaluate Advanced Voice-Specific Capabilities.
Synthesize Metrics and Iterate on Performance.
Step 1: Define Your Evaluation Framework and Goals
Before you start measuring, you need a clear idea of what you’re measuring and why. A solid evaluation starts with a framework that fits your specific use case. For instance, an agent handling quick banking transactions has different accuracy needs than one built for in-depth technical support. Your first job is to define what “accurate” means for both your users and your business goals.
Start by mapping out the most important user journeys. What are the top 3-5 tasks your users must be able to do without a hitch? For an e-commerce bot, that might be tracking an order, starting a return, or checking if a product is in stock. For a healthcare agent, it could be scheduling an appointment or refilling a prescription. These critical paths become the basis for your test cases.
With your key journeys defined, you can set your key performance indicators (KPIs). These should be a mix of technical stats and business outcomes. On the technical side, you’ll look at things like transcription errors and response speed. For the business side, you’ll focus on results like task success rate and customer satisfaction. This combination is essential because a technically flawless agent that can’t solve user problems is, for all practical purposes, a failure. Understanding how voice agents differ from chatbots is important here, as the success metrics are not the same due to the nature of spoken language.
Your initial framework should document:
Primary Use Cases: The most critical tasks the agent is expected to handle.
Success Criteria: A clear definition of what a successful interaction looks like for each use case (e.g., the order is tracked, the appointment is booked).
Key Metrics: A specific list of the metrics you plan to track (which we'll detail below).
Data Collection Plan: Your strategy for gathering test data. Will you use existing call logs, create synthetic data, or bring in live testers?
Step 2: Assess the Core Speech-to-Text (STT) Performance
The Speech-to-Text (STT) engine, often called Automatic Speech Recognition (ASR), acts as the ears of your voice agent. If it fails to accurately transcribe what a user says, every other part of the conversation rests on a weak foundation. Checking STT accuracy is a fundamental requirement.
Word Error Rate (WER)
The industry standard for this is the Word Error Rate (WER). As Gladia noted in 2024, WER measures the percentage of errors in a transcript. It works by adding up the substitutions (S), deletions (D), and insertions (I) of words, then dividing that sum by the total number of words in the correct reference transcript (N).
The formula is: WER = (S + D + I) / N
For instance, if a user said, “I want to track my package,” and the STT heard, “I want track my package,” that’s one deletion. With 6 words in the original phrase, the WER is 1/6, or 16.7%. Most enterprise applications should aim for a WER below 5% in production environments (Hamming AI, 2026). This metric is heavily influenced by your choice from the best speech-to-text APIs.
Character Error Rate (CER)
But WER isn't always the full story. For languages with different structures than English, or in cases with many proper nouns, Character Error Rate (CER) can provide a better signal. CER uses the same logic but counts errors at the character level. A 2023 paper in the ACL Anthology argues for using CER in agglutinative languages like Turkish or German, where one word can contain a lot of information. In those languages, a single word error can inflate the WER score, while CER gives a more balanced view of performance. When you're evaluating multilingual capabilities, relying only on WER can be misleading.
Creating a Golden Dataset
To effectively measure STT accuracy:
Create a Golden Dataset: Gather a representative collection of audio recordings (your 'test set') and have them transcribed by humans. This perfect transcript becomes your ground truth.
Run the Audio Through Your STT: Process the entire test set with the STT engine you're evaluating.
Calculate WER/CER: Use an automated script or tool to compare the STT's output against your golden transcripts and calculate the error rates.
Segment Your Analysis: Don't stop at the overall WER. Break down the performance by different accents, noise levels, and acoustic environments. Does the agent struggle with calls from a noisy coffee shop? Does accuracy dip for non-native speakers? These details point you toward real improvements.
Step 3: Measure Text-to-Speech (TTS) and Latency
After your agent understands the user, it has to respond. The quality of its Text-to-Speech (TTS) voice and the speed of its reply are just as vital as its listening accuracy. A robotic voice or a long, awkward pause can make for a jarring experience and erode user trust.
Text-to-Speech (TTS) Quality
The main way to evaluate TTS quality is with the Mean Opinion Score (MOS). According to FutureBeeAI (2026), MOS is a subjective rating where human listeners score the naturalness, clarity, and overall quality of synthesized speech, usually on a 1-to-5 scale.
Score | Quality | Description |
|---|---|---|
5 | Excellent | Imperceptible from human speech. |
4 | Good | Natural and clear, with very minor imperfections. |
3 | Fair | Mostly understandable, but with noticeable robotic artifacts. |
2 | Poor | Difficult to understand; significant unnaturalness. |
1 | Bad | Completely unintelligible or extremely unpleasant. |
To run a MOS test, you need a set of sample sentences covering a variety of phonetic sounds, lengths, and emotional tones. A high-quality, human-like TTS engine should be aiming for a MOS of 4.0 or higher. You should also consider prosody (the rhythm, stress, and intonation of speech) and whether the agent can correctly pronounce brand-specific terms, which is a must if you need it to handle enterprise-level needs.
End-to-End Latency
Latency is the delay between a user finishing their sentence and the agent starting its reply. Long pauses make conversations feel unnatural and frustrating. In human conversation, we expect a response in about 300 milliseconds. Voice agents are slower, with a median of 1.4-1.7 seconds (Hamming AI, 2026), but closing that gap is essential.
Don't just look at the average response time. You need to analyze the entire latency pipeline and its distribution:
Time to First Audio (TTFA): This measures the time from the end of the user's speech to the first byte of the agent's audio response. It's the most important latency metric from a user's perspective.
Component Latency: Break down the delay by each part of the process: STT processing, NLU/LLM inference, and TTS synthesis. This helps you find the bottlenecks that need optimization.
Percentile Distribution: Measure latency at the 50th (P50), 95th (P95), and 99th (P99) percentiles. A good P50 doesn't mean much if your P95 is over 5 seconds, because that means 1 in 20 users are having a terrible experience. A good target for P95 latency is under 2.5 seconds.
Step 4: Analyze Conversational Intelligence and Task Completion
An agent can have perfect transcription and a pleasant voice but still be a complete failure if it can’t understand what the user wants, manage the conversation, and actually get things done. This part of the evaluation moves beyond the mechanics of speech and into the agent's 'brain': its Natural Language Understanding (NLU) and dialogue management.
At this stage, the metrics become less about technical precision and more about functional success. The most important KPI is the Task Success Rate (TSR), sometimes called Goal Completion Rate. This is a simple yes/no measure: did the user accomplish what they called for? If someone called to change a password and the agent helped them do it, that's a success. If they got frustrated, hung up, or were transferred to a human, that's a failure.
Another key business metric is First Call Resolution (FCR). This tracks the percentage of calls where the user's problem is solved completely by the automated system on the first try, with no human intervention needed. A high FCR is a strong sign of both agent accuracy and efficiency, and it has a direct impact on operational costs.
Other key metrics for conversational intelligence include:
Intent Recognition Accuracy: How often does the agent correctly figure out the user's goal? If a user says, “My bill seems wrong,” does the agent correctly classify the intent as ‘billing inquiry’?
Slot Filling Accuracy: For tasks that need several pieces of information, how well does the agent extract them? For a flight booking query like “I need a flight to Boston for two people next Tuesday,” the agent must pull out Destination: Boston, Passengers: 2, and Date: [date of next Tuesday]. One wrong slot can derail the whole task.
Containment Rate: What percentage of calls are handled entirely by the voice agent without needing to escalate to a person? This is closely tied to FCR and is a primary measure of ROI.
To evaluate these metrics, you'll need to analyze conversation logs, user surveys, and business outcome data. It's a process with many facets that connects the agent's technical performance directly to its real-world business impact. The team at Softcery has a great overview of methods and tools for this kind of production-level testing.
Step 5: Evaluate Advanced Voice-Specific Capabilities
The best voice agents are masters of the unique challenges that come with spoken conversation. Checking these capabilities is what separates a good agent from a great one.
Interruption Handling (Barge-In)
Human conversations aren't always neat and tidy. People interrupt each other. An agent's ability to detect when a user is talking over it (barge-in) and politely stop to listen is crucial. You should measure both the accuracy of barge-in detection (aim for over 95%) and the response latency after an interruption (aim for under 200ms). If this fails, you'll have users shouting over the agent in frustration.
Hallucination and Factual Accuracy
Modern agents that use Large Language Models (LLMs) can sometimes “hallucinate” or make up information. This is particularly risky in regulated fields like finance or healthcare. To check for this, create a test set of questions with known, verifiable answers. You can then measure the Hallucination and Unsafe Response (HUN) Rate by comparing the agent's answers to your ground truth. For critical facts, the acceptable HUN rate should be close to 0%.
Compliance and Data Security
For enterprise applications, agents must follow compliance rules. This isn't just a feature; it's a core performance metric. Your evaluation needs to include tests that confirm the agent correctly redacts Personally Identifiable Information (PII) from logs and transcripts, follows required disclosure scripts (like “This call is being recorded”), and complies with regulations such as HIPAA or PCI DSS when necessary.
Step 6: Synthesize Metrics and Iterate on Performance
The final step is to pull all your data together to get a complete view of your voice agent's performance. A single metric can be misleading. A low WER is great, but not if the Task Success Rate is also low. A high MOS for your TTS voice is a plus, but not if high latency makes the conversation feel slow and unnatural.
Create a dashboard or report that tracks your key metrics over time. This will help you spot trends, catch regressions after a new deployment, and see the impact of your improvements. For example, you might find that your WER is great overall but spikes for users with a certain regional accent. That insight tells you exactly where to focus your next data collection efforts.
Use this complete picture to build a prioritized backlog of improvements. Your analysis might show that the biggest cause of task failure isn't the STT, but a confusing dialogue flow where users don't understand the agent's questions. In that case, your priority should be redesigning that part of the conversation, not just tweaking the ASR model.
Evaluation isn't a one-time task; it's a continuous cycle. Whenever you deploy changes, you have to re-run your evaluations to see if they worked. This iterative loop of measuring, analyzing, and improving is the only way to build and maintain a truly accurate and effective voice agent. By regularly performing these checks, you ensure that your investment in Smallest.ai's voice agents continues to deliver an exceptional customer experience.
Common Pitfalls to Avoid
When you evaluate voice agents, it's easy to fall into a few common traps. Being aware of them can save you time and lead to more meaningful results.
Testing with Unrealistic Audio: If you only use clean, high-fidelity audio from a studio, you'll get an artificially low WER. Your test data needs to reflect the real world: background noise, bad connections, different accents, and casual language.
Focusing Only on WER: As we've covered, WER is important, but it's not everything. An agent can have a 0% WER on a sentence and still misunderstand the user's intent, causing the task to fail. Always balance component-level metrics with outcome-based KPIs like Task Success Rate.
Ignoring Latency: A slow agent feels unintelligent, regardless of how accurate its answers are. Measure the full end-to-end latency, from the moment the user stops talking to when the agent starts its reply. Long pauses kill conversations.
Using a Small or Biased Test Set: If your test data doesn't represent your actual user base, your results will be skewed. Make sure your dataset includes a diverse range of speakers, accents, and scenarios that your agent will face in the wild.
Neglecting Qualitative Feedback: Numbers on a dashboard are crucial, but they don't capture user frustration. Supplement your quantitative data with qualitative analysis. Listen to call recordings, read user feedback, and run user testing sessions to understand the 'why' behind your metrics.
Putting It All Together
A comprehensive evaluation framework is the first step toward building a high-performing voice agent. Smallest.ai provides the tools and expertise needed to measure transcription accuracy, intent recognition, and overall task success. Our platform helps you establish clear benchmarks and continuously monitor performance to deliver a superior customer experience.
Ready to improve your voice agent's performance? Contact our team for a personalized demo.
Remember, the ultimate measure of accuracy is user success and satisfaction. A truly accurate agent doesn't just get the words right; it gets the job done, creating a smooth and helpful experience that builds trust in your brand. For more insights on building and deploying advanced voice AI, be sure to explore our blog.
Answer to all your questions
Have more questions? Contact our sales team to get the answer you’re looking for



