Build Voice AI in Python: Complete Speech-to-Text Developer Guide (2026)

Build Voice AI in Python: Complete Speech-to-Text Developer Guide (2026)

Build Voice AI in Python: Complete Speech-to-Text Developer Guide (2026)

Build Voice AI in Python with this complete 2026 speech-to-text developer guide. Learn real-time transcription, APIs, models, and production best practices.

Abhishek Mishra

Updated on

February 5, 2026 at 12:20 PM

The Real Applications Driving AI Voice Recognition Adoption
The Real Applications Driving AI Voice Recognition Adoption
The Real Applications Driving AI Voice Recognition Adoption

TL;DR – Quick Integration Overview

API Platform: Pulse STT by Smallest AI – a state-of-the-art speech-to-text API supporting real-time streaming and batch audio transcription.

Key Features:

  • Transcribes in 32+ languages with automatic language detection

  • Ultra-low latency: ~64ms time-to-first-transcript for streaming

  • Rich metadata: word timestamps, speaker diarization, emotion detection, age/gender estimation, PII redaction

Integration Methods:

  • Pre-Recorded Audio: POST https://waves-api.smallest.ai/api/v1/pulse/get_text – upload files for batch processing

  • Real-Time Streaming: wss://waves-api.smallest.ai/api/v1/pulse/get_text – WebSocket for live transcription

Developer Experience: Use any HTTP/WebSocket client or official SDKs (Python, Node.js). Authentication via a single API key.

Why Pulse STT? Compared to other providers, Pulse offers faster response (64ms vs 200-500ms for typical cloud STT) and all-in-one features (no need for separate services for speaker ID, sentiment, or PII masking).

Quick Links:

Introduction: Why Voice Integration Matters

Voice is becoming the next frontier for user interaction. From virtual assistants and voice bots to real-time transcription in meetings, speech interfaces are making software more accessible and user-friendly. Developers today have access to Automatic Speech Recognition (ASR) APIs that convert voice to text, opening up possibilities for hands-free control, live captions, voice search, and more.

However, integrating voice AI is more than just getting raw text from audio. Modern use cases demand speed and accuracy – a voice assistant needs to transcribe commands almost instantly, and a call center analytics tool might need not just the transcript but also who spoke when and how they said it.

Latency is critical. A delay of even a second feels laggy in conversation. Traditional cloud speech APIs often have 500–1200ms latency for live transcription, with better ones hovering around 200–250ms. This has pushed the industry toward ultra-low latency – under 300ms – to enable seamless real-time interactions.

In this guide, we'll walk through how to integrate an AI voice & speech API that meets these modern demands using Smallest AI's Pulse STT. By the end, you'll know how to:

  1. Transcribe audio files (WAV/MP3) to text using a simple HTTP API

  2. Stream live audio for instantaneous transcripts via WebSockets

  3. Leverage advanced features like timestamps, speaker diarization, and emotion detection

  4. Use both Python and Node.js to integrate voice capabilities


Understanding Pulse STT

Pulse is the speech-to-text/ ASR(automatic speech recognition) model from Smallest AI's "Waves" platform. It's designed for fast, accurate, and rich transcription with industry-leading latency – around 64 milliseconds to first transcribed word TTFT for streaming audio. This is an order of magnitude faster than many alternatives.


Highlight Features

Feature

Description

Real-Time & Batch Modes

Stream live audio via WebSocket or upload files via HTTP POST

32+ Languages

English, Spanish, Hindi, French, German, Arabic, Japanese, and more with auto-detection

Word/Sentence Timestamps

Know exactly when each word was spoken (great for subtitles)

Speaker Diarization

Differentiate speakers: "Speaker A said X, Speaker B said Y"

Emotion Detection

Tag segments with emotions: happy, angry, neutral, etc.

Age/Gender Estimation

Infer speaker demographics for analytics

PII/PCI Redaction

Automatically mask credit cards, SSNs, and personal info

64ms Latency

Time-to-first-transcript in streaming mode


Getting Started: Authentication

Step 1: Get Your API Key

Sign up on the Smallest AI Console and generate an API key. This key authenticates all your requests.


Step 2: Test Your Key

curl -H "Authorization: Bearer $SMALLEST_API_KEY" \
  https://waves-api.smallest.ai/api/v1/lightning-v3.1/get_voices

Authentication Header

All requests require this header:

Authorization: Bearer <YOUR_API_KEY>


Part 1: Transcribing Audio Files (REST API)

The Pre-Recorded API is perfect for batch processing voicemails, podcasts, meeting recordings, or any existing audio files.


Endpoint


Query Parameters

Parameter

Type

Description

model

string

Model identifier: pulse (required)

language

string

ISO code (en, es, hi) or multi for auto-detect

word_timestamps

boolean

Include word-level timing data

diarize

boolean

Enable speaker diarization

emotion_detection

boolean

Detect speaker emotions

age_detection

boolean

Estimate speaker age group

gender_detection

boolean

Estimate speaker gender


Supported Languages (32+)

Italian, Spanish, English, Portuguese, Hindi, German, French, Ukrainian, Russian, Kannada, Malayalam, Polish, Marathi, Gujarati, Czech, Slovak, Telugu, Odia, Dutch, Bengali, Latvian, Estonian, Romanian, Punjabi, Finnish, Swedish, Bulgarian, Tamil, Hungarian, Danish, Lithuanian, Maltese, and auto-detection (multi).


cURL Example


curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true&word_timestamps=true&emotion_detection=true" \
  --header "Authorization: Bearer $SMALLEST_API_KEY" \
  --header "Content-Type: audio/wav" \
  --data-binary "@/path/to/audio.wav"


Python Example


import os
import requests

API_KEY = os.getenv("SMALLEST_API_KEY")
audio_file = "meeting_recording.wav"

url = "https://waves-api.smallest.ai/api/v1/pulse/get_text"
params = {
    "model": "pulse",
    "language": "en",
    "word_timestamps": "true",
    "diarize": "true",
    "emotion_detection": "true"
}
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "audio/wav"
}

with open(audio_file, "rb") as f:
    audio_data = f.read()

response = requests.post(url, params=params, headers=headers, data=audio_data)
result = response.json()

# Print transcription
print("Transcription:", result.get("transcription"))

# Print word-level details with speaker info
for word in result.get("words", []):
    speaker = word.get("speaker", "N/A")
    print(f"  [Speaker {speaker}] [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']}")

# Check emotions
if "emotions" in result:
    print("\nEmotions detected:")
    for emotion, score in result["emotions"].items():
        if score > 0.1:
            print(f"  {emotion}: {score:.1%}")


Node.js Example


const fs = require('fs');
const axios = require('axios');

const API_KEY = process.env.SMALLEST_API_KEY;
const audioFile = 'meeting_recording.wav';

const url = 'https://waves-api.smallest.ai/api/v1/pulse/get_text';
const params = new URLSearchParams({
  model: 'pulse',
  language: 'en',
  word_timestamps: 'true',
  diarize: 'true',
  emotion_detection: 'true'
});

const audioData = fs.readFileSync(audioFile);

axios.post(`${url}?${params}`, audioData, {
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'audio/wav'
  }
})
.then(res => {
  console.log('Transcription:', res.data.transcription);
  
  // Print words with speaker info
  res.data.words?.forEach(word => {
    console.log(`  [Speaker ${word.speaker}] [${word.start}s - ${word.end}s] ${word.word}`);
  });
})
.catch(err => {
  console.error('Error:', err.response?.data || err.message);
});


Example Response


{
  "status": "success",
  "transcription": "Hello, this is a test transcription.",
  "words": [
    {"start": 0.0, "end": 0.88, "word": "Hello,", "confidence": 0.82, "speaker": 0, "speaker_confidence": 0.61},
    {"start": 0.88, "end": 1.04, "word": "this", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.76},
    {"start": 1.04, "end": 1.20, "word": "is", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.20, "end": 1.36, "word": "a", "confidence": 1.0, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.36, "end": 1.68, "word": "test", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99},
    {"start": 1.68, "end": 2.16, "word": "transcription.", "confidence": 0.99, "speaker": 0, "speaker_confidence": 0.99}
  ],
  "utterances": [
    {"start": 0.0, "end": 2.16, "text": "Hello, this is a test transcription.", "speaker": 0}
  ],
  "age": "adult",
  "gender": "female",
  "emotions": {
    "happiness": 0.28,
    "sadness": 0.0,
    "anger": 0.0,
    "fear": 0.0,
    "disgust": 0.0
  },
  "metadata": {
    "duration": 1.97,
    "fileSize": 63236
  }
}


Part 2: Real-Time Streaming (WebSocket API)

For live audio – voice assistants, live captioning, call center analytics – use the WebSocket API for sub-second latency with partial results as audio streams in.

WebSocket Endpoint

Query Parameters

Parameter

Type

Default

Description

language

string

en

Language code or multi for auto-detect

encoding

string

linear16

Audio format: linear16, linear32, alaw, mulaw, opus

sample_rate

string

16000

Sample rate: 8000, 16000, 22050, 24000, 44100, 48000

word_timestamps

string

true

Include word-level timestamps

full_transcript

string

false

Include cumulative transcript

sentence_timestamps

string

false

Include sentence-level timestamps

redact_pii

string

false

Redact personal information

redact_pci

string

false

Redact payment card information

diarize

string

false

Enable speaker diarization


Python Streaming Example

From the official cookbook:

import asyncio
import json
import os
import numpy as np
import websockets
import librosa
from urllib.parse import urlencode

WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text"

# Configurable features
LANGUAGE = "en"
ENCODING = "linear16"
SAMPLE_RATE = 16000
WORD_TIMESTAMPS = False
FULL_TRANSCRIPT = True
SENTENCE_TIMESTAMPS = False
DIARIZE = False
REDACT_PII = False
REDACT_PCI = False

async def transcribe(audio_file: str, api_key: str):
    params = {
        "language": LANGUAGE,
        "encoding": ENCODING,
        "sample_rate": SAMPLE_RATE,
        "word_timestamps": str(WORD_TIMESTAMPS).lower(),
        "full_transcript": str(FULL_TRANSCRIPT).lower(),
        "sentence_timestamps": str(SENTENCE_TIMESTAMPS).lower(),
        "diarize": str(DIARIZE).lower(),
        "redact_pii": str(REDACT_PII).lower(),
        "redact_pci": str(REDACT_PCI).lower(),
    }

    url = f"{WS_URL}?{urlencode(params)}"
    headers = {"Authorization": f"Bearer {api_key}"}

    # Load audio with librosa (handles any format)
    audio, _ = librosa.load(audio_file, sr=SAMPLE_RATE, mono=True)
    chunk_duration = 0.1  # 100ms chunks
    chunk_size = int(chunk_duration * SAMPLE_RATE)

    async with websockets.connect(url, additional_headers=headers) as ws:
        print("✅ Connected to Pulse STT WebSocket")
        
        async def send_audio():
            for i in range(0, len(audio), chunk_size):
                chunk = audio[i:i + chunk_size]
                pcm16 = (chunk * 32768.0).astype(np.int16).tobytes()
                await ws.send(pcm16)
                await asyncio.sleep(chunk_duration)
            await ws.send(json.dumps({"type": "end"}))
            print("📤 Sent end signal")

        async def receive_responses():
            async for message in ws:
                result = json.loads(message)
                
                if result.get("is_final"):
                    print(f"✓ {result.get('transcript')}")
                    
                    if result.get("is_last"):
                        if result.get("full_transcript"):
                            print(f"\n{'='*60}")
                            print("FULL TRANSCRIPT")
                            print(f"{'='*60}")
                            print(result.get("full_transcript"))
                        break

        await asyncio.gather(send_audio(), receive_responses())

# Usage
if __name__ == "__main__":
    api_key = os.environ.get("SMALLEST_API_KEY")
    asyncio.run(transcribe("recording.wav", api_key))

Install dependencies:

Run:

export SMALLEST_API_KEY="your-api-key"
python transcribe.py recording.wav

(Preview output)

Node.js Streaming Example

From the official cookbook:

const fs = require("fs");
const WebSocket = require("ws");
const wav = require("wav");

const WS_URL = "wss://waves-api.smallest.ai/api/v1/pulse/get_text";

// Configurable features
const LANGUAGE = "en";
const ENCODING = "linear16";
const SAMPLE_RATE = 16000;
const WORD_TIMESTAMPS = false;
const FULL_TRANSCRIPT = true;
const DIARIZE = false;
const REDACT_PII = false;
const REDACT_PCI = false;

async function loadAudio(audioFile) {
  return new Promise((resolve, reject) => {
    const reader = new wav.Reader();
    const chunks = [];

    reader.on("format", (format) => {
      reader.on("data", (chunk) => chunks.push(chunk));
      reader.on("end", () => {
        const buffer = Buffer.concat(chunks);
        const samples = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / 2);
        resolve(samples);
      });
    });

    reader.on("error", reject);
    fs.createReadStream(audioFile).pipe(reader);
  });
}

async function transcribe(audioFile, apiKey) {
  const params = new URLSearchParams({
    language: LANGUAGE,
    encoding: ENCODING,
    sample_rate: SAMPLE_RATE,
    word_timestamps: WORD_TIMESTAMPS,
    full_transcript: FULL_TRANSCRIPT,
    diarize: DIARIZE,
    redact_pii: REDACT_PII,
    redact_pci: REDACT_PCI,
  });

  const url = `${WS_URL}?${params}`;
  const audio = await loadAudio(audioFile);
  const chunkDuration = 0.1; // 100ms
  const chunkSize = Math.floor(chunkDuration * SAMPLE_RATE);

  return new Promise((resolve, reject) => {
    const ws = new WebSocket(url, {
      headers: { Authorization: `Bearer ${apiKey}` },
    });

    ws.on("open", async () => {
      console.log("✅ Connected to Pulse STT WebSocket");
      
      for (let i = 0; i < audio.length; i += chunkSize) {
        const chunk = audio.slice(i, i + chunkSize);
        ws.send(Buffer.from(chunk.buffer, chunk.byteOffset, chunk.byteLength));
        await new Promise((r) => setTimeout(r, chunkDuration * 1000));
      }
      ws.send(JSON.stringify({ type: "end" }));
      console.log("📤 Sent end signal");
    });

    ws.on("message", (data) => {
      const result = JSON.parse(data.toString());
      
      if (result.is_final) {
        console.log(`✓ ${result.transcript}`);
        
        if (result.is_last) {
          if (result.full_transcript) {
            console.log("\n" + "=".repeat(60));
            console.log("FULL TRANSCRIPT");
            console.log("=".repeat(60));
            console.log(result.full_transcript);
          }
          ws.close();
        }
      }
    });

    ws.on("close", resolve);
    ws.on("error", reject);
  });
}

// Usage
const apiKey = process.env.SMALLEST_API_KEY;
transcribe("recording.wav", apiKey).then(() => console.log("Done!"));

Install dependencies:


Run:


export SMALLEST_API_KEY="your-api-key"
node transcribe.js recording.wav


WebSocket Response Format


{
  "session_id": "sess_12345abcde",
  "transcript": "Hello, how are you?",
  "full_transcript": "Hello, how are you?",
  "is_final": true,
  "is_last": false,
  "language": "en",
  "words": [
    {"word": "Hello,", "start": 0.0, "end": 0.5, "confidence": 0.98, "speaker": 0},
    {"word": "how", "start": 0.5, "end": 0.7, "confidence": 0.99, "speaker": 0},
    {"word": "are", "start": 0.7, "end": 0.9, "confidence": 0.97, "speaker": 0},
    {"word": "you?", "start": 0.9, "end": 1.2, "confidence": 0.99, "speaker": 0}
  ]
}


Key Response Fields

Field

Description

is_final

false = partial/interim transcript; true = finalized segment

is_last

true when the entire session is complete

transcript

Current segment text

full_transcript

Accumulated text from entire session (if enabled)

words

Word-level timestamps (if enabled)


Part 3: Advanced Features


Speaker Diarization

Enable diarize=true to identify different speakers:


params = {"model": "pulse", "language": "en", "diarize": "true"}

Response includes speaker labels:


{
  "words": [
    {"word": "Hello", "speaker": 0, "speaker_confidence": 0.95},
    {"word": "Hi", "speaker": 1, "speaker_confidence": 0.92}
  ],
  "utterances": [
    {"text": "Hello, how can I help?", "speaker": 0},
    {"text": "I have a question.", "speaker": 1}
  ]
}


Emotion Detection

Enable emotion_detection=true to analyze speaker sentiment:


{
  "emotions": {
    "happiness": 0.28,
    "sadness": 0.0,
    "anger": 0.0,
    "fear": 0.0,
    "disgust": 0.0
  }
}


PII/PCI Redaction

For compliance (HIPAA, PCI-DSS), enable redact_pii=true or redact_pci=true:


{
  "transcript": "My credit card is [CREDITCARD_1] and SSN is [SSN_1]",
  "redacted_entities": ["[CREDITCARD_1]", "[SSN_1]"]
}


Age and Gender Detection

Enable age_detection=true and gender_detection=true:


{
  "age": "adult",
  "gender": "female"
}


Comparing STT Providers

Provider

Latency

Languages

Diarization

Emotion

PII Redaction

Price (per 1000 min)

Pulse STT

~64ms

32+

Competitive

Google Cloud STT

200-300ms

125+

~$16

Deepgram

100-200ms

36+

~$4-5

AssemblyAI

200-400ms

30+

~$3.50

OpenAI Whisper

Batch only

99+

~$6

Why Pulse STT stands out:

  • Fastest time-to-first-transcript (64ms)

  • All-in-one features (no separate services needed)

  • Competitive accuracy across diverse accents

  • Built for real-time voice AI applications

Best Practices

Audio Quality

  • Use 16kHz, mono, 16-bit PCM for best results

  • WAV or FLAC formats are ideal

  • Minimize background noise when possible


Error Handling

try:
    response = requests.post(url, params=params, headers=headers, data=audio_data, timeout=120)
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        # Rate limited - implement exponential backoff
        time.sleep(2 ** retry_count)
    elif e.response.status_code == 401:
        # Invalid API key
        raise ValueError("Invalid API key")

Rate Limiting

  • Add 500ms+ delay between batch requests

  • Use webhooks for long audio files

  • Implement exponential backoff for 429 errors

Bonus: Full Demo Application

Want to see everything working together? Check out the demo app in the code samples repository — a complete Next.js web application featuring:

  • File upload transcription with word-level timestamps (hover to see timing)

  • Real-time microphone streaming with live transcript display

  • Secure WebSocket proxy that keeps your API key server-side

  • Modern UI with Smallest AI brand colors

  • Language selection (English, Hindi, Spanish, French, German, Portuguese, Auto-detect)

  • Emotion detection and speaker diarization display


Quick Start

cd demo-app
npm install

Create a .env.local file with your API key:

echo 'SMALLEST_API_KEY=your-api-key' > .env.local

Start both servers (Next.js + WebSocket proxy):

Then open http://localhost:3000 in Chrome or Safari (for microphone access).

How It Works

The demo runs two servers:

  1. Next.js (port 3000) — Serves the React UI and handles file upload via /api/transcribe

  2. WebSocket Proxy (port 3001) — Securely proxies audio from browser to Pulse STT WebSocket API


This architecture keeps your API key secure on the server while enabling real-time streaming.

Project Structure


Scripts

Command

Description

npm run dev

Start Next.js only

npm run dev:ws

Start WebSocket proxy only

npm run dev:all

Start both (recommended)

This architecture pattern is recommended for production apps — API keys stay server-side while the React frontend provides a smooth user experience with both file upload and real-time microphone transcription.


Conclusion

Integrating voice and speech capabilities into your workflow and apps can greatly enhance user experience. With Pulse STT, developers can achieve high-accuracy, low-latency transcription with just a few API calls.

When to use REST API:

  • Podcast transcription

  • Meeting recordings

  • Voicemail processing

  • Batch analytics

When to use WebSocket API:

  • Live captioning

  • Voice assistants

  • Call center real-time analytics

  • Interactive voice applications

The code patterns in this guide translate directly to production. Start with the REST API for prototyping, then add WebSocket streaming when real-time interaction becomes a requirement.

Resources

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Automate Now

Connect with us

Explore how Smallest.ai can transform your enterprise

1160 Battery Street East,
San Francisco, CA,
94111

Products

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Industries

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Others

Documentation

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Legal

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Connect with us

Explore how Smallest.ai can transform your enterprise

1160 Battery Street East,
San Francisco, CA,
94111

Products

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Industries

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Others

Documentation

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Legal

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Connect with us

Explore how Smallest.ai can transform your enterprise

1160 Battery Street East,
San Francisco, CA,
94111

Products

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Industries

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Others

Documentation

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Legal

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon

Coming Soon