Pulse ASR ranked #1 on Artificial Analysis: Here’s what the numbers actually mean

TABLE OF CONTENT

Agent Workflows

AI-Powered Solutions

Revolutionizing Industries

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Pulse ranked #1 for Speed Factor on Artificial Analysis. Learn what speech recognition metrics like AA-WER, Speed Factor, and streaming latency actually mean for real-world performance.

That's the headline. The more interesting question is what that result actually tells you.


Speed Factor


Is the fastest model also the best model? What does "fast" even mean for a speech-to-text system? Is it how quickly an API returns a transcript, how much audio it can process per second, or how responsive it feels in a real-time conversation? How should speed be weighed against accuracy, latency, and cost? And why do some models excel at batch transcription while others are better suited for real-time voice agents? The reality is that speech-to-text systems aren't optimized for a single objective.

As speech AI has expanded beyond transcription into voice agents, contact centers, meeting assistants, and real-time applications, evaluating speech models has become significantly more nuanced. A company transcribing millions of recorded calls cares about very different metrics than a team building a voice agent.

In this guide, we'll use the Artificial Analysis leaderboard as a case study to break down the major dimensions of speech recognition evaluation from accuracy and latency to throughput and pricing and explain what these metrics actually tell you about a model's real-world performance.

How Artificial Analysis Evaluates Speech-to-Text Models

Understanding the Benchmark Dataset

Artificial Analysis evaluates models on approximately 8 hours of audio drawn from three datasets:

Dataset

Weight

AA-AgentTalk

50%

VoxPopuli-Cleaned-AA

25%

Earnings22-Cleaned-AA

25%


This weighting is used for both batch and streaming benchmarks.

AA-AgentTalk (50%)

Artificial Analysis' proprietary dataset focused on speech relevant to voice agent use cases. This dataset represents realistic conversational speech and modern voice-AI interactions.

VoxPopuli-Cleaned-AA (25%)

English parliamentary proceedings containing diverse accents, speaking styles, and recording conditions. This tests robustness across speakers rather than optimization for a specific domain.

Earnings22-Cleaned-AA (25%)

Corporate earnings calls containing financial terminologies, company names, technical vocabulary, multiple accents and long-form speech. This dataset is particularly valuable for enterprise workloads.

Part I: Reading the Batch Leaderboard

The batch leaderboard evaluates a straightforward workflow: an audio file is uploaded, the entire recording is available upfront, and the model returns a transcript.

This benchmark is most relevant for call recordings, meeting transcription, podcast processing, video captioning, compliance and archival workloads

Because the full audio is available before transcription begins, the primary tradeoffs are accuracy, throughput, consistency, and cost.

AA-WER Index

The primary accuracy metric on the leaderboard is the Artificial Analysis Word Error Rate Index (AA-WER). Word Error Rate (WER) measures the percentage of words transcribed incorrectly relative to a reference transcript and is computed as:

WER = (Substitutions + Insertions + Deletions) ÷ Words in
WER = (Substitutions + Insertions + Deletions) ÷ Words in
WER = (Substitutions + Insertions + Deletions) ÷ Words in


Example:

Reference: Pulse ranked first in speed factor on artificial analysis

Transcription: Pulse ranked best speed factor on the artificial analysis

Errors: Substitution (first → best ), deletion (on), insertion (the)


WER = 3 ÷ 9 = 33
WER = 3 ÷ 9 = 33
WER = 3 ÷ 9 = 33


Lower WER is better.

At a high level, WER answers a simple question: How often does the model get words wrong?

Because it is easy to calculate and compare across systems, WER has become the industry-standard metric for evaluating transcription accuracy.

However, WER is not a perfect measure of usefulness.

Not all errors are equally important. Mishearing "Boston" as "Austin" counts as a single substitution, but completely changes the meaning of a request. Meanwhile, omitting a filler word like "uh" may increase WER without materially affecting understanding.

This is one reason why Artificial Analysis also publishes performance on individual datasets rather than relying solely on a single aggregate score.

WER by Dataset


AA-WER (Non-streaming) by Individual Dataset


While AA-WER provides an aggregate score, Artificial Analysis also exposes WER for each benchmark dataset individually.

This breakdown helps answer questions that aggregate scores cannot.

For example:

  • Is performance consistent across domains?

  • Does the model struggle with particular speech types?

  • Is overall accuracy being driven by one exceptionally strong dataset?

Dataset-level evaluation helps separate general-purpose transcription quality from domain-specific strengths. A model that performs consistently across all datasets is generally more robust than one that achieves the same aggregate score through uneven performance.


Speed Factor

This is the metric where Pulse ranks #1. Speed Factor measures throughput: how much audio a model can transcribe relative to the time required to process it.

Artificial Analysis defines Speed Factor as:



For example:

  • 1x = real-time transcription

  • 10x = 10 seconds of audio processed per second

  • 100x = 100 seconds of audio processed per second

  • 1000x = 1000 seconds of audio processed per second


Speed factor as a metric measures how quickly the system can process large volumes of audio. Speed Factor does not measure conversational responsiveness. It does not tell us how quickly a user receives a transcript after speaking.

Speed Factor Variance


Speed Factor Variance


Average throughput only tells part of the story. Two providers may report identical Speed Factors while exhibiting dramatically different levels of consistency.

Speed Factor Variance measures how much throughput fluctuates across benchmark runs.

Conceptually:



Where:

SFᵢ is the Speed Factor of a benchmark run

μ is the mean Speed Factor

Lower is better.

It measures how predictable the performance is. A provider that consistently delivers 500x throughput can be easier to operate than one that alternates between 100x and 1000x.


Speed Factor Over Time


Speed Factor Over Time


The Speed Factor Over Time chart tracks throughput across repeated benchmark runs over an extended period. Points shown in Speed Factor Over Time are the median of four randomized measurements taken within a 24-hour period.

Unlike previous metrics, this is not attempting to compute a single score. Instead, it visualizes operational behavior. This chart can reveal: infrastructure instability, capacity bottlenecks, throughput regressions, operational consistency.

For enterprise deployments, long-term stability can be just as important as peak benchmark performance. Taken together, these metrics provide a comprehensive picture of batch transcription performance.

AA-WER measures transcript quality. Dataset-level WER reveals where that quality comes from. Speed Factor measures throughput. Variance measures consistency. The streaming leaderboard evaluates a different challenge altogether: how quickly and accurately a model can understand speech while a conversation is still happening.

Part II: Reading the Streaming Leaderboard

If the batch leaderboard measures how efficiently a model can transcribe completed recordings, the streaming leaderboard measures how effectively a model can understand and transcribe speech as it arrives.

Real-time applications operate under a different set of constraints than offline transcription. A company processing recorded calls can tolerate a few extra seconds of latency if accuracy is high. A voice agent cannot. Every additional millisecond between a user finishing a sentence and the system responding affects the perceived responsiveness of the conversation.

As a result, streaming speech-to-text systems must balance three competing objectives: accuracy, responsiveness, and transcript stability. The Artificial Analysis streaming benchmark is designed to measure those tradeoffs.

One of the defining characteristics of the streaming leaderboard is that accuracy and latency are evaluated together. A highly accurate model that takes several seconds to finalize a transcript may feel sluggish in a conversational setting, while a low-latency model that frequently mishears users may not be useful at all. The goal is therefore not simply to maximize accuracy or minimize latency, but to achieve both simultaneously.

A Note on How Streaming Latency Is Measured

Before looking at the streaming latency metrics, it's worth understanding where the timer actually starts.

Artificial Analysis uses an external instance of Silero VAD (Voice Activity Detection) to determine when speech ends. Streaming latency is then measured relative to that common endpoint rather than each provider's internal endpointing logic.

This is an important methodological choice. End-of-speech detection is itself a tradeoff: waiting longer can improve stability and reduce interruptions, while responding sooner can improve perceived responsiveness. Because providers implement this differently, comparing latency using provider-defined endpoints would not be a fair comparison.

Time to Final Transcript



This leaderboard isolates finalization latency.



Lower is better.

This metric is particularly important for: voice agents, interactive assistants, customer support systems, real-time conversational interfaces. Once a transcript is finalized, downstream systems can act with confidence. Reducing finalization latency improves conversational turn-taking and reduces the delay between user speech and system response.

This metric answers how quickly the system can become certain about what was said.

Time to First Partial Transcript After Speech End


Time to First Partial Transcript


This leaderboard isolates responsiveness.



Lower is better.

This metric often correlates more strongly with perceived speed than finalization latency. Humans tend to judge systems by when they first react, not by when processing is fully complete.


Consider a voice agent hearing:

"Book me a flight to Boston tomorrow."

If the first partial transcript predicts:

"Book me a flight to Austin tomorrow."
the agent may begin reasoning about the wrong destination before the transcript stabilizes.

For conversational AI, this is often the first metric engineers optimize because it directly affects how responsive the system feels to users. This metric answers how quickly the system begins responding.


WER Streaming Index vs Time to Final Transcript



This chart plots: AA-WER Streaming Index and Time to Final Transcription

The ideal position is the bottom-left corner of the chart.

Models closer to the origin achieve:

  • Lower transcription error rates

  • Faster transcript finalization

Models higher on the chart make more transcription mistakes. Models further to the right take longer to produce a final transcript.

The chart therefore provides a quick way to evaluate the tradeoff between accuracy and responsiveness. A model may be highly accurate but slow to finalize transcripts, while another may be fast but sacrifice accuracy. The strongest systems improve both dimensions simultaneously and move closer to the origin.


WER Streaming Index (First Partial) vs Time to First Partial Transcript After Speech End



This chart plots:
AA-WER Streaming Index (First Partial) — the percentage of words transcribed incorrectly in the first partial transcript after detected end of speech

Time to First Partial Transcription After Speech End — the number of seconds between detected end of speech and the first transcript-bearing event returned by the provider

As with the previous chart, the ideal position is the bottom-left corner.

This chart is particularly relevant for voice agents because many systems begin downstream processing before a final transcript is available. Intent classification, retrieval, tool selection, and response planning can often start from the first partial transcript.

As a result, this visualization captures an important real-world tradeoff: not just how quickly a model responds, but how useful that first response is. The most effective streaming systems are those that can provide an accurate partial transcript with minimal delay, placing them closer to the origin of the chart.


Bringing It All Together

The batch and streaming leaderboards answer fundamentally different questions.

The batch leaderboard focuses on:

  • Accuracy

  • Throughput

  • Consistency

The streaming leaderboard focuses on:

  • Accuracy

  • Responsiveness

  • Transcript stability

Neither leaderboard produces a single "best" speech-to-text model. Instead, they reveal the tradeoffs each system makes.

A company transcribing millions of recorded calls may prioritize AA-WER, Speed Factor, and cost efficiency.

A team building a real-time voice agent may prioritize Time to First Partial Transcript, Time to Final Transcript, and streaming accuracy.
The value of benchmarks isn't that they tell you who won. It's that they make tradeoffs visible. Accuracy, throughput, latency, stability, and cost are all dimensions of performance, and different applications prioritize them differently. A leaderboard position is a result. Understanding the metrics behind it is what allows you to make an informed engineering decision.

Frequently asked questions

Frequently asked questions

What is the difference between batch speech recognition and streaming speech recognition?

Is Speed Factor the same as latency?

What is Time to First Partial Transcript?

What is Time to Final Transcript?

Can one benchmark determine the best speech-to-text model?