Speech to Text API: Integration Guide for Python, Node and Streaming
Speech to Text API: Integration Guide for Python, Node and Streaming
Speech to Text API: Integration Guide for Python, Node and Streaming
A high benchmark score doesn't mean reliable transcription on your users' audio. Here's how to integrate Pulse STT correctly — pre-recorded and real-time streaming, with the edge cases that trip up most integrations.
Sumit Mor
Updated on
A speech to text api that scores well on clean benchmark audio and a speech to text api that works reliably on your users' actual recordings are often not the same thing. The gap between them is not usually the model. It is the audio conditions, the language, the sample rate, the mode you chose, and whether your response handler is reading the right field. This guide covers all of those. How to call the Pulse speech to text api from Python and Node, how to handle the response correctly in batch and real-time streaming modes, how accuracy varies with real-world audio, and what to check before the integration goes to production.
What the Pulse speech to text API is
Pulse is Smallest AI's speech recognition model. It converts audio into text via two modes: pre-recorded and real-time.
Pre-recorded transcription accepts audio files and returns a complete transcript in a single synchronous HTTP response. It is the right choice for batch processing, archived recordings, and any workflow where you can afford to wait for the full result.
Real-time transcription streams audio over a persistent WebSocket connection and returns partial and final transcript events as the model processes incoming audio. It is the right choice for voice agents, live captioning, and phone call analysis where the transcript needs to arrive before the speaker stops talking.
Both modes share the same base URL. Pre-recorded requests go to https://api.smallest.ai/waves/v1/pulse/get_text as HTTPS POST. Real-time requests connect to wss://api.smallest.ai/waves/v1/pulse/get_text as a WebSocket.
Every request carries the key in the Authorization header.
Pre-recorded transcription in Python
The pre-recorded endpoint takes raw audio bytes in the request body. The language and any optional features go in as query parameters. The Content-Type header tells the API what kind of audio it is receiving.
One thing worth noting immediately: the response field is transcription, not transcript. This distinction matters when you are building the response handler. A successful response looks like this.
{"status":"success","transcription":"Hello, this is a test transcription.","words":[{"start":0.48,"end":1.12,"word":"Hello,"},{"start":1.12,"end":1.28,"word":"this"},{"start":1.28,"end":1.44,"word":"is"},{"start":1.44,"end":2.16,"word":"a"},{"start":2.16,"end":2.96,"word":"test"},{"start":2.96,"end":3.76,"word":"transcription."}],"utterances":[{"start":0.48,"end":3.76,"text":"Hello, this is a test transcription."}]}
{"status":"success","transcription":"Hello, this is a test transcription.","words":[{"start":0.48,"end":1.12,"word":"Hello,"},{"start":1.12,"end":1.28,"word":"this"},{"start":1.28,"end":1.44,"word":"is"},{"start":1.44,"end":2.16,"word":"a"},{"start":2.16,"end":2.96,"word":"test"},{"start":2.96,"end":3.76,"word":"transcription."}],"utterances":[{"start":0.48,"end":3.76,"text":"Hello, this is a test transcription."}]}
{"status":"success","transcription":"Hello, this is a test transcription.","words":[{"start":0.48,"end":1.12,"word":"Hello,"},{"start":1.12,"end":1.28,"word":"this"},{"start":1.28,"end":1.44,"word":"is"},{"start":1.44,"end":2.16,"word":"a"},{"start":2.16,"end":2.96,"word":"test"},{"start":2.96,"end":3.76,"word":"transcription."}],"utterances":[{"start":0.48,"end":3.76,"text":"Hello, this is a test transcription."}]}
The words array gives you per-word timing, useful for caption generation, subtitle tracks, and confidence-based review workflows. The utterances array gives you sentence-level segments with timing, useful for structured call summaries and readable transcripts.
When your audio is already hosted somewhere accessible, you can send a URL instead of raw bytes.
This is useful when audio files live in cloud storage such as S3 or Google Cloud Storage, where sending raw bytes would require downloading the file locally first.
Word timestamps give you start and end times for every word. Cheap to enable and difficult to reconstruct accurately after the fact. Enable by default for any use case where transcript timing will matter.
Utterances give you sentence-level segments with timing. Useful for displaying captions, syncing playback, and storing structured conversation records.
Diarization separates the transcript into speaker turns. For any multi-speaker audio such as call recordings or meeting transcripts, this transforms a wall of text into an attributed conversation.
Emotion detection returns the emotional tone of each segment with strength indicators across five core emotion types. In customer service contexts, the gap between what a caller says and how they say it often matters as much as the words themselves.
Age and gender detection return demographic estimates per speaker. These are probabilistic signals that perform well in aggregate. Treat individual results as indicators rather than facts.
Accuracy, word error rate and what affects them
Word Error Rate is the standard metric for transcription accuracy. It measures the percentage of words in the output that differ from the reference transcript. A WER of 5% means roughly one word in twenty is wrong. On clean studio audio, modern models routinely achieve 3 to 5 percent. On a noisy phone call, the same model can produce 15 to 20 percent.
The factors that matter most in real-world audio are background noise level, speaker accent, audio sample rate, and domain-specific vocabulary. The last one is often underestimated. A model that handles general English well can still produce noticeably higher error rates on medical terminology, legal language, or product names that rarely appeared in its training data.
Testing with audio that represents your actual users is more informative than any published benchmark. A WER of 3% on clean English broadcast audio tells you nothing about what the model will do with call centre recordings from Mumbai or São Paulo.
On sample rate: accuracy drops noticeably below 16kHz. If your pipeline is producing 8kHz audio from older telephony infrastructure, resampling to 16kHz before sending is worth the overhead.
Language support
Pass ISO 639-1 language codes in the language query parameter. For audio where speakers switch between languages mid-conversation, use language=multi for automatic language detection and switching.
Code
Language
Code
Language
en
English
hi
Hindi
de
German
fr
French
es
Spanish
it
Italian
pt
Portuguese
ru
Russian
pl
Polish
nl
Dutch
ta
Tamil
bn
Bengali
gu
Gujarati
kn
Kannada
ml
Malayalam
mr
Marathi
te
Telugu
pa
Punjabi
uk
Ukrainian
sv
Swedish
fi
Finnish
da
Danish
ro
Romanian
bg
Bulgarian
cs
Czech
sk
Slovak
hu
Hungarian
lv
Latvian
lt
Lithuanian
et
Estonian
mt
Maltese
or
Odia
multi
Auto-detect
High-resource languages achieve lower error rates than lower-resource ones. If your users are concentrated in a specific language community, test against audio samples from that community rather than the overall WER figures.
Real-time streaming transcription
Batch transcription works by completing before returning. Real-time transcription works by returning continuously while audio is still arriving. That difference changes everything about how you build with it.
The real-time API connects over WebSocket to wss://api.smallest.ai/waves/v1/pulse/get_text. Audio goes in as raw binary frames of 4096 bytes. Transcript events come back as JSON messages with an is_final flag that tells you whether the result is provisional or committed.
Connection parameters go in as URL query parameters when you establish the WebSocket.
{"session_id":"sess_12345abcde","transcript":"Hello, how are you?","full_transcript":"Hello, how are you?","is_final":true,"is_last":false,"language":"en"}
{"session_id":"sess_12345abcde","transcript":"Hello, how are you?","full_transcript":"Hello, how are you?","is_final":true,"is_last":false,"language":"en"}
{"session_id":"sess_12345abcde","transcript":"Hello, how are you?","full_transcript":"Hello, how are you?","is_final":true,"is_last":false,"language":"en"}
transcript contains the current segment. full_transcript contains the cumulative transcript for the entire session. When is_final is true, use full_transcript to maintain your running session log.
The Python equivalent using websockets and sounddevice for microphone capture.
When you have finished streaming, send the finalize signal before closing the connection. Without it, the model's internal buffer may not flush and the last segment of audio will be lost.
awaitws.send(json.dumps({"type": "finalize"}))
awaitws.send(json.dumps({"type": "finalize"}))
awaitws.send(json.dumps({"type": "finalize"}))
Batch versus real-time: the deciding factors
The choice between modes comes down to whether your application can wait for a complete result or needs partial results while audio is still arriving.
Use pre-recorded transcription when you are processing existing audio files, when results in under a second are not required, when you prefer simpler HTTP request-response code over WebSocket lifecycle management, or when you are running high-volume offline batch jobs.
Use real-time streaming when you are building a voice agent where response latency determines whether the interaction feels natural, when you need partial transcripts visible to the user while they are still speaking, when you need turn detection to drive downstream logic, or when you are transcribing live phone calls or microphone input.
The two modes also return different response structures, and that is worth planning around before you build. Pre-recorded returns a single JSON object with transcription, words, and utterances. Real-time returns a stream of events where each message has transcript, full_transcript, and is_final. If your response handler treats them interchangeably, you will get silent bugs.
Next Steps
Checkout Smallest AI Cookbook, A comprehensive collection of real-world examples and tutorials for building with Smallest AI's APIs, including basic transcription, voice agents, and advanced features
Speech-to-Text Examples:
Getting Started: Basic transcription examples for Python and JavaScript
Jarvis Voice Assistant: Complete always-on assistant with wake word detection, LLM reasoning, and TTS integration
Meeting Notes Bot: Automated meeting transcription with intelligent speaker identification and structured note generation
Emotion Analyzer: Visualize speaker emotions across conversations with interactive charts