Research

Recent progress in voice AI has been driven by scaling pipeline architectures and fusing speech-to-speech models, producing commercially useful systems but leaving fundamental limitations unresolved. These approaches inherit the sequential, turn-based, half-duplex constraints of text-centric processing, and cannot replicate the concurrent nature of human conversation. In this paper, we argue that passing the Turing test in real-time spoken dialogue requires not scale alone, but structure and scale together. We propose Hydra, an asynchronous voice architecture built around a latent-space world model, where listening, thinking, speaking, and tool-calling operate as independent concurrent processes over a shared compressed representation. Drawing on evidence from cognitive science and neuroscience, we argue that voice intelligence is better characterized by structural properties such as asynchronous processing, separation of intelligence from memory, and inference-time adaptation than by parameter count alone. Hydra achieves conversational dynamics structurally equivalent to human conversation for the first time, with parameter efficiency gains of 100–1000× over token-based models, and enables a new capability we term Artificial Special Intelligence: compact reasoning cores that rapidly self-specialize to deployment domains through use.

Topics of our research

Compute-Memory Separation

LLMs memorize more information as they grow larger. This leads to the illusion of intelligence as evaluations are gamified through overfitting. We train smaller models <100x the size of LLMs and complement them with infinite memory.

Asynchronous Thinking

Humans do not wait for sensory inputs to provide full context before starting the form thoughts or reply. LLMs however operate on full context which leads to slower outputs and inefficient compute utilization. Asynchronous thinking enables decoding over streaming inputs without waiting for full context.

Evaluation of Intelligence

Intelligence of agents is directly proportional to the GDP generated by the agents. We are moving from a world of static, academic, train-time evaluations to dynamic, real-world, test-time evaluations. This is non trivial as due to distribution shift during inference.

Continual Learning

Intelligence and Memory layers need to constantly stay relevant to the task in hand. Train time back propagation is time consuming, human in the loop RL generates bias. Hence it is important to continually learn during inference where one does not have the luxury of exploration in a simulated environment.

Modality Fusion

Babies learn to speak before they learn to write. Speech and Text can be learnt independently. Hence the mapping learnt between speech and text is in some sense - non-scientific, making fusing these modalities a hard problem. Additionally, audio being a dense signal is non-trivial to tokenize and map to text.