Customers / Smallest

Customers / Smallest

Every Word, On the Record: How Pocket Transcribes 12M Minutes a Month

Every Word, On the Record: How Pocket Transcribes 12M Minutes a Month

9.63% WER

9.63% WER

9.63% WER

in the real world against Deepgram's ~28%

~12M min/mon

minutes/month of audio transcribed

10 languages

across all 16 markets

The Problem

Pocket records in the real world. Far-field rooms, phone calls, several people talking at once. Generic speech-to-text falls apart exactly there: wrong names, missing words, no sense of who said what.

And the stakes are high. Therapy sessions and financial calls carry sensitive data that has to be redacted, not just transcribed. Users span 10 languages across 16 markets, so one model has to hold accuracy everywhere.

A transcript users can't trust breaks everything downstream: the summary, the action items, the follow-up.

The Solution

Pocket runs Smallest AI's Pulse STT as the transcription layer in its batch pipeline. Because Pocket processes complete files rather than live streams, it uses Pulse in pre-recorded mode, where accuracy is highest.

One pass returns everything: transcript, speaker labels, PII and PCI redaction, punctuation, timestamps, and noise handling. Automatic language detection covers all 10 market languages. Nothing to stitch together as volume grows.

The Results

The following results are;

  • 9.63% WER in the real world vs Deepgram's ~28%

  • Every transcript ships with speaker labels, redaction, punctuation, and timestamps in a single pass

    Pulse holds accuracy across all of Pocket's market languages:

    insert table on multilingual benchmarks here

    And it stays ahead as conditions degrade. WER by noise band:

    insert table on noise-band benchmarks here

    Batch full-file processing scales with volume with no per-stream limits, so quality stays consistent across every market and use case.

    Building something that depends on accurate transcription? See the Pulse model card or talk to our team.

Company name

Pocket (Open Vision Engineering Inc.)

Industry

AI Hardware / Productivity Tech

Company size

SMB

Products used

Speech to text (Pulse)

About the company

Pocket, built by Open Vision Engineering, is a wearable capture device. It records conversations, calls, and meetings, then processes the full audio file in batch into a clean transcript and summary. Therapists, realtors, sales reps, and founders use it for hands-free capture across 16 Western markets. With 86,700+ customers on the platform, Pocket records first and transcribes after. That choice puts all the weight on one thing: the quality of the final transcript.