Agents

Models

Resources

Pricing

Contact Sales

Research

Recent progress in voice AI has been driven by scaling pipeline architectures and fusing speech-to-speech models, producing commercially useful systems but leaving fundamental limitations unresolved. These approaches inherit the sequential, turn-based, half-duplex constraints of text-centric processing, and cannot replicate the concurrent nature of human conversation. In this paper, we argue that passing the Turing test in real-time spoken dialogue requires not scale alone, but structure and scale together. We propose Hydra, an asynchronous voice architecture built around a latent-space world model, where listening, thinking, speaking, and tool-calling operate as independent concurrent processes over a shared compressed representation. Drawing on evidence from cognitive science and neuroscience, we argue that voice intelligence is better characterized by structural properties such as asynchronous processing, separation of intelligence from memory, and inference-time adaptation than by parameter count alone. Hydra achieves conversational dynamics structurally equivalent to human conversation for the first time, with parameter efficiency gains of 100–1000× over token-based models, and enables a new capability we term Artificial Special Intelligence: compact reasoning cores that rapidly self-specialize to deployment domains through use.

The Smallest AI Thesis

Voice is a physical signal. It exists in the real world as pressure waves propagating through air, captured by the ear and transduced into neural activity. It is continuous, analog, and fundamentally temporal: every syllable, every pause, every shift in pitch carries meaning that is inseparable from when it occurs. Voice is how humans evolved to communicate, negotiate, teach, and reason. It is the oldest and most natural interface for intelligence. Text, by contrast, does not exist in the physical world. Text is a human invention, a discrete symbolic encoding system created to preserve speech across time and space. Critically, text is consumed through vision, not through audition. When you read a sentence, your eyes scan symbols on a surface; your visual cortex processes shapes and patterns. Text has no time domain. The word "hello" on a page does not take 400 milliseconds the way the spoken word does. It has no prosody, no emphasis, no emotional coloring. It is an abstraction, a lossy compression of the rich, continuous, temporally grounded signal that is voice.

VIew thesis

Topics of our research

Compute-Memory Separation

LLMs memorize more information as they grow larger. This leads to the illusion of intelligence as evaluations are gamified through overfitting. We train smaller models <100x the size of LLMs and complement them with infinite memory.

Asynchronous Thinking

Humans do not wait for sensory inputs to provide full context before starting the form thoughts or reply. LLMs however operate on full context which leads to slower outputs and inefficient compute utilization. Asynchronous thinking enables decoding over streaming inputs without waiting for full context.

Evaluation of Intelligence

Intelligence of agents is directly proportional to the GDP generated by the agents. We are moving from a world of static, academic, train-time evaluations to dynamic, real-world, test-time evaluations. This is non trivial as due to distribution shift during inference.

Continual Learning

Intelligence and Memory layers need to constantly stay relevant to the task in hand. Train time back propagation is time consuming, human in the loop RL generates bias. Hence it is important to continually learn during inference where one does not have the luxury of exploration in a simulated environment.

Modality Fusion

Babies learn to speak before they learn to write. Speech and Text can be learnt independently. Hence the mapping learnt between speech and text is in some sense - non-scientific, making fusing these modalities a hard problem. Additionally, audio being a dense signal is non-trivial to tokenize and map to text.

Top Research

World Models for Voice

April 2026

SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS

January 2026

Artificial Special Intelligence: Beyond Scaling Laws Towards Structured Intelligence

January 2026

Evaluating the Lightning-v2 Multilingual TTS Model

October 2025

CLIPDraw++: Text-to-Sketch Synthesis with Simple Primitives

July 2025

Low Resource Indic Language Translation Shared Task

June 2024

Build the future of voice agent orchestration

Contact sales

311 California Street, Suite 320
San Francisco, CA 94104

Models

Text to Speech

Speech to Text

Speech to Speech

Voice cloning

Agents

Overview

On Prem

Industries

Debt Collection

Healthcare

Real Estate

Small business

E-commerce

Documentation

For Agents

For Models

Resources

Pricing

Blogs

Research

Careers

Voice AI apps

Initiatives

Startup Grants

Legals

MSA

Privacy notice

HIPAA Agreement

Terms and conditions

Data processing

User Policy

Twitter

Instagram

Youtube

Discord

Substack

Medium

System status operational

We are

SOC 2,

GDPR, and

HIPAA, Compliant

Build the future of voice agent orchestration

Contact sales

311 California Street, Suite 320
San Francisco, CA 94104

Models

Text to Speech

Speech to Text

Speech to Speech

Voice cloning

Agents

Overview

On Prem

Industries

Debt Collection

Healthcare

Real Estate

Small business

E-commerce

Legals

MSA

Privacy notice

HIPAA Agreement

Terms and conditions

Data processing

User Policy

Documentation

For Agents

For Models

Resources

Pricing

Blogs

Research

Careers

Voice AI apps

Initiatives

Startup Grants

Twitter

Instagram

Youtube

Discord

Substack

Medium

System status operational

We are

SOC 2,

GDPR, and

HIPAA, Compliant

Build the future of voice agent orchestration

Contact sales

311 California Street, Suite 320
San Francisco, CA 94104

Models

Text to Speech

Speech to Text

Speech to Speech

Voice cloning

Agents

Overview

On Prem

Industries

Debt Collection

Healthcare

Real Estate

Small business

E-commerce

Documentation

For Agents

For Models

Resources

Pricing

Blogs

Research

Careers

Voice AI apps

Initiatives

Startup Grants

Legals

MSA

Privacy notice

HIPAA Agreement

Terms and conditions

Data processing

User Policy

Twitter

Instagram

Youtube

Discord

Substack

Medium

System status operational

We are

SOC 2,

GDPR, and

HIPAA, Compliant