Speech to Text API: Integration Guide for Python, Node and Streaming

Speech to Text API: Integration Guide for Python, Node and Streaming

Speech to Text API: Integration Guide for Python, Node and Streaming

A high benchmark score doesn't mean reliable transcription on your users' audio. Here's how to integrate Pulse STT correctly — pre-recorded and real-time streaming, with the edge cases that trip up most integrations.

Sumit Mor

Updated on

AI Receptionist Buyer’s Guide: Features, Costs, and Deployment in 2026

A speech to text api that scores well on clean benchmark audio and a speech to text api that works reliably on your users' actual recordings are often not the same thing. The gap between them is not usually the model. It is the audio conditions, the language, the sample rate, the mode you chose, and whether your response handler is reading the right field.
This guide covers all of those. How to call the Pulse speech to text api from Python and Node, how to handle the response correctly in batch and real-time streaming modes, how accuracy varies with real-world audio, and what to check before the integration goes to production.

What the Pulse speech to text API is

Pulse is Smallest AI's speech recognition model. It converts audio into text via two modes: pre-recorded and real-time.

Pre-recorded transcription accepts audio files and returns a complete transcript in a single synchronous HTTP response. It is the right choice for batch processing, archived recordings, and any workflow where you can afford to wait for the full result.

Real-time transcription streams audio over a persistent WebSocket connection and returns partial and final transcript events as the model processes incoming audio. It is the right choice for voice agents, live captioning, and phone call analysis where the transcript needs to arrive before the speaker stops talking.

Both modes share the same base URL. Pre-recorded requests go to https://api.smallest.ai/waves/v1/pulse/get_text as HTTPS POST. Real-time requests connect to wss://api.smallest.ai/waves/v1/pulse/get_text as a WebSocket.

Authentication

Get an API key from the Smallest AI console.


export SMALLEST_API_KEY="your-api-key-here"
export SMALLEST_API_KEY="your-api-key-here"
export SMALLEST_API_KEY="your-api-key-here"

Every request carries the key in the Authorization header.


Pre-recorded transcription in Python

The pre-recorded endpoint takes raw audio bytes in the request body. The language and any optional features go in as query parameters. The Content-Type header tells the API what kind of audio it is receiving.


import os
import requests

def transcribe(file_path: str, language: str = "en") -> dict:
    with open(file_path, "rb") as f:
        audio_data = f.read()

    response = requests.post(
        "https://api.smallest.ai/waves/v1/pulse/get_text",
        params={
            "language": language,
            "word_timestamps": "true",
        },
        headers={
            "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
            "Content-Type": "audio/wav",
        },
        data=audio_data,
    )

    response.raise_for_status()
    return response.json()

result = transcribe("recording.wav")
print(result["transcription"])
import os
import requests

def transcribe(file_path: str, language: str = "en") -> dict:
    with open(file_path, "rb") as f:
        audio_data = f.read()

    response = requests.post(
        "https://api.smallest.ai/waves/v1/pulse/get_text",
        params={
            "language": language,
            "word_timestamps": "true",
        },
        headers={
            "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
            "Content-Type": "audio/wav",
        },
        data=audio_data,
    )

    response.raise_for_status()
    return response.json()

result = transcribe("recording.wav")
print(result["transcription"])
import os
import requests

def transcribe(file_path: str, language: str = "en") -> dict:
    with open(file_path, "rb") as f:
        audio_data = f.read()

    response = requests.post(
        "https://api.smallest.ai/waves/v1/pulse/get_text",
        params={
            "language": language,
            "word_timestamps": "true",
        },
        headers={
            "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
            "Content-Type": "audio/wav",
        },
        data=audio_data,
    )

    response.raise_for_status()
    return response.json()

result = transcribe("recording.wav")
print(result["transcription"])

One thing worth noting immediately: the response field is transcription, not transcript. This distinction matters when you are building the response handler. A successful response looks like this.


{
    "status": "success",
    "transcription": "Hello, this is a test transcription.",
    "words": [
        {"start": 0.48, "end": 1.12, "word": "Hello,"},
        {"start": 1.12, "end": 1.28, "word": "this"},
        {"start": 1.28, "end": 1.44, "word": "is"},
        {"start": 1.44, "end": 2.16, "word": "a"},
        {"start": 2.16, "end": 2.96, "word": "test"},
        {"start": 2.96, "end": 3.76, "word": "transcription."}
    ],
    "utterances": [
        {"start": 0.48, "end": 3.76, "text": "Hello, this is a test transcription."}
    ]
}
{
    "status": "success",
    "transcription": "Hello, this is a test transcription.",
    "words": [
        {"start": 0.48, "end": 1.12, "word": "Hello,"},
        {"start": 1.12, "end": 1.28, "word": "this"},
        {"start": 1.28, "end": 1.44, "word": "is"},
        {"start": 1.44, "end": 2.16, "word": "a"},
        {"start": 2.16, "end": 2.96, "word": "test"},
        {"start": 2.96, "end": 3.76, "word": "transcription."}
    ],
    "utterances": [
        {"start": 0.48, "end": 3.76, "text": "Hello, this is a test transcription."}
    ]
}
{
    "status": "success",
    "transcription": "Hello, this is a test transcription.",
    "words": [
        {"start": 0.48, "end": 1.12, "word": "Hello,"},
        {"start": 1.12, "end": 1.28, "word": "this"},
        {"start": 1.28, "end": 1.44, "word": "is"},
        {"start": 1.44, "end": 2.16, "word": "a"},
        {"start": 2.16, "end": 2.96, "word": "test"},
        {"start": 2.96, "end": 3.76, "word": "transcription."}
    ],
    "utterances": [
        {"start": 0.48, "end": 3.76, "text": "Hello, this is a test transcription."}
    ]
}


The words array gives you per-word timing, useful for caption generation, subtitle tracks, and confidence-based review workflows. The utterances array gives you sentence-level segments with timing, useful for structured call summaries and readable transcripts.

When your audio is already hosted somewhere accessible, you can send a URL instead of raw bytes.


import os
import requests

def transcribe_from_url(audio_url: str, language: str = "en") -> dict:
    response = requests.post(
        "https://api.smallest.ai/waves/v1/pulse/get_text",
        params={"language": language, "word_timestamps": "true"},
        headers={
            "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
            "Content-Type": "application/json",
        },
        json={"url": audio_url},
    )

    response.raise_for_status()
    return response.json()

result = transcribe_from_url("http://thepodcastexchange.ca/s/Porsche-Macan-July-5-2018-1.mp3")
print(result["transcription"])
import os
import requests

def transcribe_from_url(audio_url: str, language: str = "en") -> dict:
    response = requests.post(
        "https://api.smallest.ai/waves/v1/pulse/get_text",
        params={"language": language, "word_timestamps": "true"},
        headers={
            "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
            "Content-Type": "application/json",
        },
        json={"url": audio_url},
    )

    response.raise_for_status()
    return response.json()

result = transcribe_from_url("http://thepodcastexchange.ca/s/Porsche-Macan-July-5-2018-1.mp3")
print(result["transcription"])
import os
import requests

def transcribe_from_url(audio_url: str, language: str = "en") -> dict:
    response = requests.post(
        "https://api.smallest.ai/waves/v1/pulse/get_text",
        params={"language": language, "word_timestamps": "true"},
        headers={
            "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
            "Content-Type": "application/json",
        },
        json={"url": audio_url},
    )

    response.raise_for_status()
    return response.json()

result = transcribe_from_url("http://thepodcastexchange.ca/s/Porsche-Macan-July-5-2018-1.mp3")
print(result["transcription"])

This is useful when audio files live in cloud storage such as S3 or Google Cloud Storage, where sending raw bytes would require downloading the file locally first.

Pre-recorded transcription in Node

The same call in Node using node-fetch.


const fs = require("fs");
const fetch = require("node-fetch");

async function transcribe(filePath, language = "en") {
    const audioData = fs.readFileSync(filePath);

    const url = new URL("https://api.smallest.ai/waves/v1/pulse/get_text");
    url.searchParams.set("language", language);
    url.searchParams.set("word_timestamps", "true");

    const response = await fetch(url.toString(), {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.SMALLEST_API_KEY}`,
            "Content-Type": "audio/wav",
        },
        body: audioData,
    });

    if (!response.ok) {
        const error = await response.text();
        throw new Error(`Transcription failed: ${error}`);
    }

    return response.json();
}

transcribe("recording.wav")
    .then((result) => console.log(result.transcription))
    .catch(console.error);
const fs = require("fs");
const fetch = require("node-fetch");

async function transcribe(filePath, language = "en") {
    const audioData = fs.readFileSync(filePath);

    const url = new URL("https://api.smallest.ai/waves/v1/pulse/get_text");
    url.searchParams.set("language", language);
    url.searchParams.set("word_timestamps", "true");

    const response = await fetch(url.toString(), {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.SMALLEST_API_KEY}`,
            "Content-Type": "audio/wav",
        },
        body: audioData,
    });

    if (!response.ok) {
        const error = await response.text();
        throw new Error(`Transcription failed: ${error}`);
    }

    return response.json();
}

transcribe("recording.wav")
    .then((result) => console.log(result.transcription))
    .catch(console.error);
const fs = require("fs");
const fetch = require("node-fetch");

async function transcribe(filePath, language = "en") {
    const audioData = fs.readFileSync(filePath);

    const url = new URL("https://api.smallest.ai/waves/v1/pulse/get_text");
    url.searchParams.set("language", language);
    url.searchParams.set("word_timestamps", "true");

    const response = await fetch(url.toString(), {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.SMALLEST_API_KEY}`,
            "Content-Type": "audio/wav",
        },
        body: audioData,
    });

    if (!response.ok) {
        const error = await response.text();
        throw new Error(`Transcription failed: ${error}`);
    }

    return response.json();
}

transcribe("recording.wav")
    .then((result) => console.log(result.transcription))
    .catch(console.error);

And from a remote URL.


async function transcribeFromUrl(audioUrl, language = "en") {
    const url = new URL("https://api.smallest.ai/waves/v1/pulse/get_text");
    url.searchParams.set("language", language);
    url.searchParams.set("word_timestamps", "true");

    const response = await fetch(url.toString(), {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.SMALLEST_API_KEY}`,
            "Content-Type": "application/json",
        },
        body: JSON.stringify({ url: audioUrl }),
    });

    if (!response.ok) {
        const error = await response.text();
        throw new Error(`Transcription failed: ${error}`);
    }

    return response.json();
}
async function transcribeFromUrl(audioUrl, language = "en") {
    const url = new URL("https://api.smallest.ai/waves/v1/pulse/get_text");
    url.searchParams.set("language", language);
    url.searchParams.set("word_timestamps", "true");

    const response = await fetch(url.toString(), {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.SMALLEST_API_KEY}`,
            "Content-Type": "application/json",
        },
        body: JSON.stringify({ url: audioUrl }),
    });

    if (!response.ok) {
        const error = await response.text();
        throw new Error(`Transcription failed: ${error}`);
    }

    return response.json();
}
async function transcribeFromUrl(audioUrl, language = "en") {
    const url = new URL("https://api.smallest.ai/waves/v1/pulse/get_text");
    url.searchParams.set("language", language);
    url.searchParams.set("word_timestamps", "true");

    const response = await fetch(url.toString(), {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.SMALLEST_API_KEY}`,
            "Content-Type": "application/json",
        },
        body: JSON.stringify({ url: audioUrl }),
    });

    if (!response.ok) {
        const error = await response.text();
        throw new Error(`Transcription failed: ${error}`);
    }

    return response.json();
}

Optional enrichment features

All enrichment features are query parameters on the same endpoint. Enable any combination without changing anything else about the request.


response = requests.post(
    "https://api.smallest.ai/waves/v1/pulse/get_text",
    params={
        "language": "en",
        "word_timestamps": "true",
        "emotion_detection": "true",
        "age_detection": "true",
        "gender_detection": "true",
        "diarization": "true",
    },
    headers={
        "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
        "Content-Type": "audio/wav",
    },
    data=open("call_recording.wav", "rb").read(),
)

result = response.json()
print(result["transcription"])
response = requests.post(
    "https://api.smallest.ai/waves/v1/pulse/get_text",
    params={
        "language": "en",
        "word_timestamps": "true",
        "emotion_detection": "true",
        "age_detection": "true",
        "gender_detection": "true",
        "diarization": "true",
    },
    headers={
        "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
        "Content-Type": "audio/wav",
    },
    data=open("call_recording.wav", "rb").read(),
)

result = response.json()
print(result["transcription"])
response = requests.post(
    "https://api.smallest.ai/waves/v1/pulse/get_text",
    params={
        "language": "en",
        "word_timestamps": "true",
        "emotion_detection": "true",
        "age_detection": "true",
        "gender_detection": "true",
        "diarization": "true",
    },
    headers={
        "Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}",
        "Content-Type": "audio/wav",
    },
    data=open("call_recording.wav", "rb").read(),
)

result = response.json()
print(result["transcription"])

A brief note on each feature worth enabling.

Word timestamps give you start and end times for every word. Cheap to enable and difficult to reconstruct accurately after the fact. Enable by default for any use case where transcript timing will matter.

Utterances give you sentence-level segments with timing. Useful for displaying captions, syncing playback, and storing structured conversation records.

Diarization separates the transcript into speaker turns. For any multi-speaker audio such as call recordings or meeting transcripts, this transforms a wall of text into an attributed conversation.

Emotion detection returns the emotional tone of each segment with strength indicators across five core emotion types. In customer service contexts, the gap between what a caller says and how they say it often matters as much as the words themselves.

Age and gender detection return demographic estimates per speaker. These are probabilistic signals that perform well in aggregate. Treat individual results as indicators rather than facts.

Accuracy, word error rate and what affects them

Word Error Rate is the standard metric for transcription accuracy. It measures the percentage of words in the output that differ from the reference transcript. A WER of 5% means roughly one word in twenty is wrong. On clean studio audio, modern models routinely achieve 3 to 5 percent. On a noisy phone call, the same model can produce 15 to 20 percent.

The factors that matter most in real-world audio are background noise level, speaker accent, audio sample rate, and domain-specific vocabulary. The last one is often underestimated. A model that handles general English well can still produce noticeably higher error rates on medical terminology, legal language, or product names that rarely appeared in its training data.

Testing with audio that represents your actual users is more informative than any published benchmark. A WER of 3% on clean English broadcast audio tells you nothing about what the model will do with call centre recordings from Mumbai or São Paulo.

On sample rate: accuracy drops noticeably below 16kHz. If your pipeline is producing 8kHz audio from older telephony infrastructure, resampling to 16kHz before sending is worth the overhead.


Language support

Pass ISO 639-1 language codes in the language query parameter. For audio where speakers switch between languages mid-conversation, use language=multi for automatic language detection and switching.

Code

Language

Code

Language

en

English

hi

Hindi

de

German

fr

French

es

Spanish

it

Italian

pt

Portuguese

ru

Russian

pl

Polish

nl

Dutch

ta

Tamil

bn

Bengali

gu

Gujarati

kn

Kannada

ml

Malayalam

mr

Marathi

te

Telugu

pa

Punjabi

uk

Ukrainian

sv

Swedish

fi

Finnish

da

Danish

ro

Romanian

bg

Bulgarian

cs

Czech

sk

Slovak

hu

Hungarian

lv

Latvian

lt

Lithuanian

et

Estonian

mt

Maltese

or

Odia

multi

Auto-detect



High-resource languages achieve lower error rates than lower-resource ones. If your users are concentrated in a specific language community, test against audio samples from that community rather than the overall WER figures.


Real-time streaming transcription

Batch transcription works by completing before returning. Real-time transcription works by returning continuously while audio is still arriving. That difference changes everything about how you build with it.

The real-time API connects over WebSocket to wss://api.smallest.ai/waves/v1/pulse/get_text. Audio goes in as raw binary frames of 4096 bytes. Transcript events come back as JSON messages with an is_final flag that tells you whether the result is provisional or committed.

Connection parameters go in as URL query parameters when you establish the WebSocket.


const API_KEY = process.env.SMALLEST_API_KEY;

const url = new URL("wss://api.smallest.ai/waves/v1/pulse/get_text");
url.searchParams.append("language", "en");
url.searchParams.append("encoding", "linear16");
url.searchParams.append("sample_rate", "16000");
url.searchParams.append("word_timestamps", "true");

const ws = new WebSocket(url.toString(), {
    headers: {
        Authorization: `Bearer ${API_KEY}`,
    },
});

ws.onopen = () => {
    console.log("Connected to Pulse STT");
};

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.is_final) {
        console.log("\n[FINAL]", data.transcript);
        console.log("[FULL SESSION]", data.full_transcript);
    } else {
        process.stdout.write(`\r${data.transcript}`);
    }
};

function sendAudioChunk(audioBuffer) {
    if (ws.readyState === WebSocket.OPEN) {
        ws.send(audioBuffer);
    }
}

function endStream() {
    ws.send(JSON.stringify({ type: "finalize" }));
}
const API_KEY = process.env.SMALLEST_API_KEY;

const url = new URL("wss://api.smallest.ai/waves/v1/pulse/get_text");
url.searchParams.append("language", "en");
url.searchParams.append("encoding", "linear16");
url.searchParams.append("sample_rate", "16000");
url.searchParams.append("word_timestamps", "true");

const ws = new WebSocket(url.toString(), {
    headers: {
        Authorization: `Bearer ${API_KEY}`,
    },
});

ws.onopen = () => {
    console.log("Connected to Pulse STT");
};

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.is_final) {
        console.log("\n[FINAL]", data.transcript);
        console.log("[FULL SESSION]", data.full_transcript);
    } else {
        process.stdout.write(`\r${data.transcript}`);
    }
};

function sendAudioChunk(audioBuffer) {
    if (ws.readyState === WebSocket.OPEN) {
        ws.send(audioBuffer);
    }
}

function endStream() {
    ws.send(JSON.stringify({ type: "finalize" }));
}
const API_KEY = process.env.SMALLEST_API_KEY;

const url = new URL("wss://api.smallest.ai/waves/v1/pulse/get_text");
url.searchParams.append("language", "en");
url.searchParams.append("encoding", "linear16");
url.searchParams.append("sample_rate", "16000");
url.searchParams.append("word_timestamps", "true");

const ws = new WebSocket(url.toString(), {
    headers: {
        Authorization: `Bearer ${API_KEY}`,
    },
});

ws.onopen = () => {
    console.log("Connected to Pulse STT");
};

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.is_final) {
        console.log("\n[FINAL]", data.transcript);
        console.log("[FULL SESSION]", data.full_transcript);
    } else {
        process.stdout.write(`\r${data.transcript}`);
    }
};

function sendAudioChunk(audioBuffer) {
    if (ws.readyState === WebSocket.OPEN) {
        ws.send(audioBuffer);
    }
}

function endStream() {
    ws.send(JSON.stringify({ type: "finalize" }));
}

The response for each event carries these fields.


{
    "session_id": "sess_12345abcde",
    "transcript": "Hello, how are you?",
    "full_transcript": "Hello, how are you?",
    "is_final": true,
    "is_last": false,
    "language": "en"
}
{
    "session_id": "sess_12345abcde",
    "transcript": "Hello, how are you?",
    "full_transcript": "Hello, how are you?",
    "is_final": true,
    "is_last": false,
    "language": "en"
}
{
    "session_id": "sess_12345abcde",
    "transcript": "Hello, how are you?",
    "full_transcript": "Hello, how are you?",
    "is_final": true,
    "is_last": false,
    "language": "en"
}

transcript contains the current segment. full_transcript contains the cumulative transcript for the entire session. When is_final is true, use full_transcript to maintain your running session log.

The Python equivalent using websockets and sounddevice for microphone capture.


import asyncio
import json
import os
import websockets
import sounddevice as sd

SAMPLE_RATE = 16000
CHUNK_BYTES = 4096

async def stream_microphone():
    url = (
        "wss://api.smallest.ai/waves/v1/pulse/get_text"
        f"?language=en&encoding=linear16&sample_rate={SAMPLE_RATE}&word_timestamps=true"
    )
    headers = {"Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}"}

    async with websockets.connect(url, extra_headers=headers) as ws:
        loop = asyncio.get_event_loop()
        audio_queue = asyncio.Queue()

        def audio_callback(indata, frames, time, status):
            loop.call_soon_threadsafe(audio_queue.put_nowait, bytes(indata))

        with sd.RawInputStream(
            samplerate=SAMPLE_RATE,
            channels=1,
            dtype="int16",
            blocksize=CHUNK_BYTES // 2,
            callback=audio_callback,
        ):
            async def send_audio():
                while True:
                    chunk = await audio_queue.get()
                    await ws.send(chunk)

            async def receive_events():
                async for message in ws:
                    data = json.loads(message)
                    if data.get("is_final"):
                        print(f"\n[FINAL] {data['transcript']}")
                    else:
                        print(f"\r{data['transcript']}", end="", flush=True)

            await asyncio.gather(send_audio(), receive_events())

asyncio.run(stream_microphone())
import asyncio
import json
import os
import websockets
import sounddevice as sd

SAMPLE_RATE = 16000
CHUNK_BYTES = 4096

async def stream_microphone():
    url = (
        "wss://api.smallest.ai/waves/v1/pulse/get_text"
        f"?language=en&encoding=linear16&sample_rate={SAMPLE_RATE}&word_timestamps=true"
    )
    headers = {"Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}"}

    async with websockets.connect(url, extra_headers=headers) as ws:
        loop = asyncio.get_event_loop()
        audio_queue = asyncio.Queue()

        def audio_callback(indata, frames, time, status):
            loop.call_soon_threadsafe(audio_queue.put_nowait, bytes(indata))

        with sd.RawInputStream(
            samplerate=SAMPLE_RATE,
            channels=1,
            dtype="int16",
            blocksize=CHUNK_BYTES // 2,
            callback=audio_callback,
        ):
            async def send_audio():
                while True:
                    chunk = await audio_queue.get()
                    await ws.send(chunk)

            async def receive_events():
                async for message in ws:
                    data = json.loads(message)
                    if data.get("is_final"):
                        print(f"\n[FINAL] {data['transcript']}")
                    else:
                        print(f"\r{data['transcript']}", end="", flush=True)

            await asyncio.gather(send_audio(), receive_events())

asyncio.run(stream_microphone())
import asyncio
import json
import os
import websockets
import sounddevice as sd

SAMPLE_RATE = 16000
CHUNK_BYTES = 4096

async def stream_microphone():
    url = (
        "wss://api.smallest.ai/waves/v1/pulse/get_text"
        f"?language=en&encoding=linear16&sample_rate={SAMPLE_RATE}&word_timestamps=true"
    )
    headers = {"Authorization": f"Bearer {os.environ['SMALLEST_API_KEY']}"}

    async with websockets.connect(url, extra_headers=headers) as ws:
        loop = asyncio.get_event_loop()
        audio_queue = asyncio.Queue()

        def audio_callback(indata, frames, time, status):
            loop.call_soon_threadsafe(audio_queue.put_nowait, bytes(indata))

        with sd.RawInputStream(
            samplerate=SAMPLE_RATE,
            channels=1,
            dtype="int16",
            blocksize=CHUNK_BYTES // 2,
            callback=audio_callback,
        ):
            async def send_audio():
                while True:
                    chunk = await audio_queue.get()
                    await ws.send(chunk)

            async def receive_events():
                async for message in ws:
                    data = json.loads(message)
                    if data.get("is_final"):
                        print(f"\n[FINAL] {data['transcript']}")
                    else:
                        print(f"\r{data['transcript']}", end="", flush=True)

            await asyncio.gather(send_audio(), receive_events())

asyncio.run(stream_microphone())

When you have finished streaming, send the finalize signal before closing the connection. Without it, the model's internal buffer may not flush and the last segment of audio will be lost.


await ws.send(json.dumps({"type": "finalize"}))
await ws.send(json.dumps({"type": "finalize"}))
await ws.send(json.dumps({"type": "finalize"}))


Batch versus real-time: the deciding factors

The choice between modes comes down to whether your application can wait for a complete result or needs partial results while audio is still arriving.

Use pre-recorded transcription when you are processing existing audio files, when results in under a second are not required, when you prefer simpler HTTP request-response code over WebSocket lifecycle management, or when you are running high-volume offline batch jobs.

Use real-time streaming when you are building a voice agent where response latency determines whether the interaction feels natural, when you need partial transcripts visible to the user while they are still speaking, when you need turn detection to drive downstream logic, or when you are transcribing live phone calls or microphone input.

The two modes also return different response structures, and that is worth planning around before you build. Pre-recorded returns a single JSON object with transcription, words, and utterances. Real-time returns a stream of events where each message has transcript, full_transcript, and is_final. If your response handler treats them interchangeably, you will get silent bugs.


Next Steps

Checkout Smallest AI Cookbook, A comprehensive collection of real-world examples and tutorials for building with Smallest AI's APIs, including basic transcription, voice agents, and advanced features

Speech-to-Text Examples:

  1. Getting Started: Basic transcription examples for Python and JavaScript

  2. Jarvis Voice Assistant: Complete always-on assistant with wake word detection, LLM reasoning, and TTS integration

  3. Meeting Notes Bot: Automated meeting transcription with intelligent speaker identification and structured note generation

  4. Emotion Analyzer: Visualize speaker emotions across conversations with interactive charts

API reference: Pulse STT Quickstart, Pre-Recorded API, Real-Time WebSocket API

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Automate Now

No headings found on page

Pulse is built for the audio conditions your users actually have

Try Pulse STT Free

Learn More