Agents

Models

Resources

Pricing

Contact Sales

March 30, 2026

Building an AI Agent for Sales Calls: A Complete Guide

Prithvi Bharadwaj

Book a demo

Start building

TABLE OF CONTENT

Agent Workflows

AI-Powered Solutions

Revolutionizing Industries

Start Building Your AI Sales Voice Agent

Launch with real-time voice APIs

Contact sales

Abstract digital illustration of a blurred human silhouette standing before a glowing doorway, rendered in teal and black with a grainy, atmospheric texture.

Build an AI agent for sales calls with real-time STT, LLMs, and human-like TTS. Learn how to create low-latency voice agents with Smallest.ai.

Building an AI agent for sales calls no longer requires an enterprise budget or a dedicated ML team. The models are fast, the tooling is mature, and the voices are startlingly good. But most teams hit the same wall: the voice itself. Your agent can nail objection handling, follow a perfect script, and qualify leads accurately, yet if it sounds like a robot reading a teleprompter, prospects hang up in seconds.

This guide covers how to build a production-ready AI sales agent using Smallest.ai's speech models as the voice layer. The specific goal: output that sounds more natural than what you'd get from ElevenLabs or other mainstream TTS providers. That means lower latency, better prosody, and emotional range that doesn't feel like someone toggling between preset moods.

It's written for developers and technical sales leaders who want to ship a voice agent that prospects actually stay on the line with. You should be comfortable with APIs, basic Python or Node.js, and have a working understanding of LLMs. By the end, you'll have a complete pipeline from speech recognition through reasoning to voice output.

The full process at a glance:

Step 1: Define your sales call architecture and conversation flow.
Step 2: Choose and configure your speech-to-text layer for real-time transcription.
Step 3: Build the reasoning engine (LLM + prompt design for sales conversations).
Step 4: Integrate Smallest.ai's text-to-speech for natural, human-like voice output.
Step 5: Wire everything together with a speech-to-speech pipeline for low-latency calls.
Step 6: Add emotional intelligence and dynamic voice control.
Step 7: Test, iterate, and deploy to production.

The stakes are real. According to Salesforce's 2024 State of Sales Report, 81% of sales teams are either experimenting with or have fully implemented AI (Autobound, 2026). McKinsey's 2025 research found that AI tools can increase sales leads by up to 50% and reduce customer acquisition costs by up to 60% (AnyBiz.io, 2026). The teams that get voice right will capture a disproportionate share of that upside.

Prerequisites and What You'll Need

Before you start building, make sure you have the following. Some of these are accounts you'll create along the way, but it helps to see the full picture upfront.

A Smallest.ai account with API access (sign up at app.smallest.ai; check Smallest.ai pricing for plan details).
A telephony provider or SIP trunk for handling phone calls (Twilio, Vonage, or similar).
An LLM API key (OpenAI GPT-4o, Anthropic Claude, or an open-source model you're hosting).
Python 3.10+ or Node.js 18+ on your development machine.
Basic familiarity with WebSocket connections and streaming audio.
A sales script or call flow document. Even a rough outline of your ideal conversation helps enormously.

If you've built chatbots but never worked with real-time audio, the biggest conceptual shift is this: voice agents operate on streams, not request-response cycles. Everything happens concurrently. The prospect is speaking while your agent is listening, transcribing, reasoning, and preparing a response, all overlapping in time.

Step 1: Define Your Sales Call Architecture

Every effective AI agent for sales calls starts with a clear architecture. You're building a pipeline with three core stages: listening (speech-to-text), thinking (LLM reasoning), and speaking (text-to-speech). How well these stages coordinate, especially around latency, determines the quality of the final product.

For a sales call, your conversation flow needs to handle several distinct phases: the opening (you have about 3 seconds to sound credible), qualification questions, objection handling, value proposition delivery, and a closing or next-step request. Map these out before writing any code. Each phase has different vocal requirements. The opening needs warmth and confidence. Objection handling needs to feel responsive, not rehearsed. The close needs measured assertiveness.

MIT Sloan research from 2025 explored how AI agents perform in negotiation scenarios and found that a combination of warmth and dominance leads to the most effective outcomes in simulated sales conversations (MIT Sloan, 2025). This isn't academic trivia. It should directly inform how you design your agent's personality and voice parameters at each stage of the call.

Decide early whether you're building a fully autonomous agent or an agent-assist system where a human can take over. For outbound cold calls, fully autonomous agents are increasingly common. For high-value enterprise deals, a hybrid approach where the AI handles the first 60 seconds of qualification before routing to a human often works better. If you're exploring the personalization angle for outbound, the Smallest.ai team wrote a useful piece on making cold calling more personal with AI.

Step 2: Configure Real-Time Speech-to-Text

Your agent needs to hear and understand the prospect in real time. The critical metrics for speech-to-text (STT) are accuracy (especially on names, numbers, and industry jargon) and latency (how quickly you get a usable transcript after the prospect stops speaking).

Smallest.ai offers a speech-to-text API optimized for real-time conversational use cases. Set up a WebSocket connection to the STT endpoint and stream audio frames as they arrive from your telephony provider. The key configuration decisions:

Sample rate: Match your telephony input (typically 8kHz for PSTN, 16kHz for WebRTC).
Language and accent model: Select the appropriate model for your target market.
Endpointing sensitivity: Controls how quickly the system decides the speaker has finished a thought. Too aggressive and you'll cut people off. Too lenient and your agent pauses awkwardly. Start with the default and tune from there.
Interim results: Enable these so your LLM can start processing before the speaker finishes. Critical for reducing perceived latency.

A common mistake is treating STT as a set-and-forget component. In sales calls, prospects mumble, talk over your agent, use slang, and drop company names the model has never encountered. Maintain a custom vocabulary list of product names, competitor names, and industry terms. This single optimization can dramatically improve downstream reasoning quality, because if the transcript is wrong, the LLM reasons about the wrong thing.

Step 3: Build the Reasoning Engine

The reasoning layer is where your agent decides what to say. This is typically an LLM (GPT-4o, Claude, Llama, or similar) wrapped in a carefully designed prompt that encodes your sales methodology, product knowledge, and conversational style.

Prompt engineering for sales agents is fundamentally different from chatbot prompts. You're not optimizing for general helpfulness. You're optimizing for a specific conversational outcome: qualifying the prospect, booking a meeting, or closing a deal. Your system prompt should include the agent's persona (name, role, company), the call objective, qualification criteria (BANT or MEDDIC frameworks), approved responses to common objections, and strict guardrails on what the agent should never say.

A structural pattern that works well for sales agent prompts:

Identity block: Who the agent is, who they work for, their communication style.
Objective block: The specific goal of this call (e.g., 'Book a 30-minute demo with the decision maker').
Knowledge block: Product details, pricing tiers, competitive positioning.
Conversation rules: Maximum response length (1 to 2 sentences per turn), when to ask questions vs. make statements, how to handle 'not interested' responses.
Escalation rules: When to transfer to a human, when to end the call gracefully.

Keep responses short. This is the single most important rule for voice agents. A response that reads well as text sounds terrible spoken aloud if it runs more than two sentences. Sales conversations are rapid exchanges, not monologues. Aim for 15 to 30 words per agent turn during qualification, and no more than 50 words even during value proposition delivery.

Gartner projects that by 2027, 95% of a seller's research workflows will start with AI, up from less than 20% in 2024 (AnyBiz.io, 2026). Your reasoning engine should pull real-time context about the prospect (company size, recent news, tech stack) and weave it into the conversation naturally. Generic pitches get ignored. Personalized ones get meetings.

Step 4: Integrate Smallest.ai Text-to-Speech for Human-Like Voice

This step makes or breaks the entire experience. The voice is the first thing a prospect evaluates, consciously or not. If it sounds synthetic, trust evaporates before your agent finishes saying hello.

Smallest.ai's text-to-speech API is purpose-built for real-time conversational applications. The Lightning model delivers sub-100ms latency with voice quality that consistently outperforms legacy TTS providers in blind listening tests. For a detailed comparison against alternatives, see this piece on the most realistic text-to-speech AI.

Integration is straightforward: send the text output from your LLM to the TTS endpoint and stream the resulting audio back to the telephony connection. The parameters worth tuning carefully:

Voice selection matters more than most teams realize. For B2B sales calls in North America, a confident, mid-range voice with moderate pacing tends to perform best. Avoid voices that sound overly enthusiastic (prospects pattern-match this to telemarketing) or overly flat (sounds like an IVR system). Smallest.ai provides a range of pre-built voices, and you can clone custom voices if your brand requires a specific sound.

Speed and pitch controls let you fine-tune delivery. Slow down slightly for key value statements. Speed up marginally during transitions ('Great, let me tell you a bit about what we do'). These micro-adjustments separate a voice agent that sounds like it's reading from one that sounds like it's thinking.

The real differentiator with Smallest.ai is streaming synthesis. Rather than waiting for the entire response to generate before speaking, the TTS model begins producing audio as soon as the first tokens arrive from the LLM. Your agent starts speaking within milliseconds of the LLM producing output, creating natural conversational timing that makes prospects forget they're talking to an AI. For more on the emotional dimension of voice synthesis, see a complete guide to human-like AI voices.

Step 5: Wire the Full SPipeline

Now you connect the three stages into a single, low-latency pipeline. Audio comes in from the phone call and is transcribed by the Speech-to-Text (STT) engine. The transcript then feeds into the Large Language Model (LLM). The LLM's text output streams to the Text-to-Speech (TTS) engine, and the synthesized audio streams back to the caller. For a conversational AI agent for sales calls, the entire round trip needs to feel instantaneous.

Smallest.ai offers APIs for each component that simplify this process. Instead of managing three separate API connections from different vendors and handling the orchestration yourself, you can use our integrated components. This simplifies your codebase and, more importantly, reduces latency by eliminating network hops between separate services.

Building the pipeline with individual components gives you more control over each stage. For example, you can inject CRM lookups or sentiment analysis between the STT output and LLM input. The trade-off is complexity for flexibility. For most sales call use cases, starting with a tightly integrated set of components is the right approach. You can add more complex logic later when you need to.

Interruption handling deserves special attention. In real sales conversations, prospects interrupt constantly. Your pipeline needs to detect when the prospect starts speaking mid-response, stop TTS output immediately, process the new input, and respond appropriately. This is called "barge-in" support, and it's non-negotiable for sales calls. Without it, your agent talks over prospects, which is the fastest way to lose a deal. Smallest.ai's voice pipeline handles barge-in natively. If you're building the pipeline manually, you'll need voice activity detection (VAD) on the incoming audio stream to trigger TTS cancellation.

Step 6: Add Emotional Intelligence and Dynamic Voice Control

This is where you move from "functional voice agent" to "agent that actually sounds human." Most TTS systems produce a single emotional register. The voice sounds the same whether the agent is greeting someone, handling a pricing objection, or closing the call. Humans don't talk like that. We modulate tone, pace, and energy based on context constantly.

Smallest.ai's models support dynamic emotion and style tags embedded directly in the text sent to the TTS API. Your LLM can output not just what to say, but how to say it. When your agent detects frustration in the prospect's voice (via sentiment analysis on the STT output), it can shift to a calmer, more empathetic tone. When the prospect expresses interest, the agent can inject subtle enthusiasm without overdoing it.

Implement this by adding a lightweight sentiment classification step between STT and LLM. Use the prospect's tone and word choice to tag the emotional context of each turn, then include that context in your LLM prompt so the model adjusts both its language and its voice directives. A prospect who says "I'm not sure this is the right time" in a hesitant tone needs a different response (and different vocal delivery) than one who says the same words in an annoyed tone.

Sales teams that effectively use AI have seen revenue growth 1.3 times higher than teams that don't (Autobound, 2026). Emotional intelligence in the voice layer is a significant contributor to that gap. It's the difference between a prospect thinking "this is clearly a bot" and "wait, is this a real person?"

Step 7: Test, Iterate, and Deploy

Testing a voice agent is fundamentally different from testing a text-based application. You can't just read the output. You have to listen to it, repeatedly, in simulated call scenarios that mirror real conditions.

Start with internal testing. Have team members call the agent and try to break it. Give them specific scenarios: "Pretend you're a CFO who's been burned by a competitor product." "Pretend you're interested but keep asking about pricing before the agent has qualified you." "Interrupt the agent mid-sentence and change the subject." Record every call and review them as a team.

Quality signals to watch for:

Response latency: Is the gap between the prospect finishing and the agent starting to speak under 500ms? Under 300ms is ideal.
Turn-taking naturalness: Does the agent ever talk over the prospect? Does it pause too long?
Voice consistency: Does quality degrade during longer responses or under load?
Objection handling: Does the agent recover gracefully from unexpected pushback?
Graceful failure: When the agent doesn't understand something, does it ask a clarifying question or hallucinate a response?

Once internal testing is solid, run a limited pilot with real prospects. Start with lower-priority leads so you're not burning your best opportunities while tuning. Track conversion rates, average call duration, and prospect sentiment (you can automate this with speech analytics). Compare against your human sales team's baseline.

For deployment, set up monitoring and alerting on latency, error rates, and call completion rates. Enable automatic call recording and transcription for quality review. Build a human escalation path so that if the agent gets stuck, it can smoothly transfer to a live rep. The Smallest.ai blog has a practical walkthrough on building efficient AI voice bots that covers production deployment patterns in more detail.

Common Mistakes and How to Avoid Them

1. Making the Agent Too Verbose

The number one mistake: agents that talk too much per turn. In text, a 3-sentence response feels concise. Spoken aloud, it's a 15-second monologue, an eternity in a phone conversation. Constrain your LLM output to 1 to 2 sentences per turn during the first half of the call. Let the prospect do most of the talking. You're qualifying them, not pitching at them.

2. Ignoring Latency Until It's Too Late

Latency compounds across your pipeline. If STT takes 200ms, LLM inference takes 400ms, and TTS takes 300ms, you're at 900ms before network overhead. That's noticeable and uncomfortable. Optimize each stage independently, but also measure end-to-end. Smallest.ai's speech-to-speech pipeline exists specifically to collapse these stages and reduce total latency.

3. Using a Generic Voice for Every Market

A voice that works for selling SaaS to US tech companies will not work for selling financial services in the UK. Accent, pacing, formality, and even pitch expectations vary significantly across markets and verticals. Test your voice selection with people from your target demographic, not just your engineering team.

4. Skipping the Rejection Paths

Most teams spend 80% of their prompt engineering on the happy path: prospect is interested, asks good questions, agrees to a meeting. But the majority of real calls involve rejection, deflection, or confusion. Build out at least 10 distinct objection-handling paths in your prompt. Include graceful exit scripts for when the prospect is clearly not interested. An agent that handles rejection well actually builds brand equity, even on calls that don't convert.

5. Not Disclosing That the Caller Is an AI

Depending on your jurisdiction, you may be legally required to disclose that the caller is an AI agent. Even where it's not mandated, transparency builds trust. A simple "Hi, this is Sarah, an AI assistant calling on behalf of [Company]" at the start of the call is honest and, counterintuitively, often increases engagement. Prospects appreciate the novelty and are curious enough to stay on the line.

Why Voice Quality Is the Competitive Moat for AI Sales Agents

By 2028, AI agents are predicted to outnumber human sellers by a factor of ten (Gartner, 2025). Prospects will be fielding multiple AI-driven sales calls per week. The agents that convert will be the ones that sound indistinguishable from skilled human reps. Voice quality isn't a nice-to-have. It's the primary differentiator.

ElevenLabs popularized high-quality TTS for content creation, audiobooks, and media production. Those are excellent use cases. But real-time sales conversations have fundamentally different requirements: sub-100ms synthesis latency (not sub-second), barge-in support (not one-shot generation), dynamic emotional modulation within a single call (not a static voice preset). Smallest.ai's models were designed from the ground up for these conversational, real-time use cases. That architectural difference shows up in every call.

Forrester's 2025 report on AI agents in B2B go-to-market functions emphasizes that the organizations seeing real ROI are the ones investing in the quality of the agent interaction, not just the automation of it (Forrester, 2025). Automating a bad sales call at scale just means you're annoying more people faster. The voice layer is where "automated" becomes "effective."

If you're evaluating when AI agents make sense versus human reps, the analysis in AI call centers vs. human agents provides a useful framework for thinking about that balance.

Quick Reference: All 7 Steps at a Glance

The table below summarizes each step, its primary focus, and the key tools involved. Use it as a checklist when building your own AI sales call agent.

Step	Focus Area	Key Tools / Components	Primary Output
1. Define Sales Call Architecture	Conversation design and pipeline planning	Architecture diagram, call flow document	Mapped call phases (opening, qualification, objection handling, close)
2. Configure Real-Time STT	Speech-to-text transcription	Smallest.ai STT API, WebSocket connection, custom vocabulary	Accurate, low-latency transcript stream
3. Build the Reasoning Engine	LLM prompt design for sales	GPT-4o, Claude, or Llama; BANT/MEDDIC frameworks	Short, goal-oriented agent responses per turn
4. Integrate TTS for Human-Like Voice	Voice synthesis and quality	Smallest.ai Lightning TTS, voice cloning, speed/pitch controls	Natural-sounding audio output with sub-100ms latency
5. Wire Speech-to-Speech Pipeline	End-to-end pipeline orchestration	Smallest.ai speech-to-speech API, VAD for barge-in	Single integration point with minimal round-trip latency
6. Add Emotional Intelligence	Dynamic tone and sentiment adaptation	Sentiment classifier, emotion/style tags in TTS	Context-aware vocal delivery that shifts with prospect mood
7. Test, Iterate, and Deploy	Quality assurance and production readiness	Call recordings, speech analytics, monitoring dashboards	Production-grade agent with human escalation path

Summary and Next Steps

You've walked through the complete process: defining call architecture, configuring real-time speech-to-text, building a sales-specific reasoning engine, integrating Smallest.ai's TTS for natural voice output, wiring the speech-to-speech pipeline, adding emotional intelligence through dynamic voice control, and testing rigorously before production deployment.

The difference between an AI sales agent that gets hung up on and one that books meetings comes down to three things: response speed, conversational intelligence, and voice quality. Smallest.ai gives you a significant edge on the first and third. The second depends on your prompt engineering and sales methodology.

From here, sign up for a Smallest.ai account and experiment with the TTS and STT APIs using sample sales scripts. Build a proof of concept with a single call flow (like a meeting-booking agent) before expanding to more complex scenarios. Record and review your first 50 test calls obsessively. The patterns in those recordings will teach you more than any tutorial.

For teams ready to build more sophisticated multi-agent systems where different AI agents handle different stages of the sales funnel, the guide on building multi-agent voice AI is a strong next step. The future of sales is voice AI that prospects can't distinguish from your best human reps. The tools to build it are available now.

Frequently asked questions

How much does it cost to build an AI agent for sales calls?

Can an AI sales agent handle objections as well as a human rep?

What makes Smallest.ai's voice quality better than ElevenLabs for sales calls?

Do I need to disclose that the caller is an AI?

How do I measure whether my AI sales agent is performing well?

Can the AI agent integrate with my CRM?

What happens if the AI agent gets confused or stuck during a call?

Related Blogposts

View all

Blurry, futuristic scene showing two human silhouettes in a dark teal space, with one figure illuminated by a glowing rectangular light beam.

How to Use Voice Agents for Sales to Qualify Leads on Autopilot

March 25, 2026

A technical, futuristic illustration of an AI voice agent's core. A central, glowing spherical node is surrounded by a complex network of interconnected data points and neural pathways in a teal and white palette. The grainy, stippled aesthetic represents the intricate architecture of voice models, processing, and safety protocol.

AI Voice Agents (2026): Architecture, Voice Models, Use Cases, and Safety Guardrails

March 24, 2026

Build the future of voice agent orchestration

Contact sales

311 California Street, Suite 320
San Francisco, CA 94104

Models

Text to Speech

Speech to Text

Speech to Speech

Voice cloning

Agents

Overview

On Prem

Industries

Debt Collection

Healthcare

Real Estate

Small business

E-commerce

Documentation

For Agents

For Models

Resources

Pricing

Blogs

Research

Careers

Voice AI apps

Integrations

Initiatives

Startup Grants

Legals

Privacy notice

Terms and conditions

Data processing

User Policy

TCPA compliance

Twitter

Instagram

Youtube

Discord

Substack

Medium

System status operational

We are

SOC 2,

GDPR, and

HIPAA, Compliant

Build the future of voice agent orchestration

Contact sales

311 California Street, Suite 320
San Francisco, CA 94104

Models

Text to Speech

Speech to Text

Speech to Speech

Voice cloning

Agents

Overview

On Prem

Industries

Debt Collection

Healthcare

Real Estate

Small business

E-commerce

Legals

Privacy notice

Terms and conditions

Data processing

User Policy

TCPA compliance

Documentation

For Agents

For Models

Resources

Pricing

Blogs

Research

Careers

Voice AI apps

Integrations

Initiatives

Startup Grants

Twitter

Instagram

Youtube

Discord

Substack

Medium

System status operational

We are

SOC 2,

GDPR, and

HIPAA, Compliant

Build the future of voice agent orchestration

Contact sales

311 California Street, Suite 320
San Francisco, CA 94104

Models

Text to Speech

Speech to Text

Speech to Speech

Voice cloning

Agents

Overview

On Prem

Industries

Debt Collection

Healthcare

Real Estate

Small business

E-commerce

Documentation

For Agents

For Models

Resources

Pricing

Blogs

Research

Careers

Voice AI apps

Integrations

Initiatives

Startup Grants

Legals

Privacy notice

Terms and conditions

Data processing

User Policy

TCPA compliance

Twitter

Instagram

Youtube

Discord

Substack

Medium

System status operational

We are

SOC 2,

GDPR, and

HIPAA, Compliant