The Limits of Large Fused Kernels on Nvidia GPUs: Why Real-Time AI Inference Needs More

In our previous blog, we discussed why Nvidia GPUs, despite their raw compute power, often fall short when it comes to real-time inference workloads—particularly at low batch sizes, where latency is critical. We briefly touched on kernel fusion as a technique that’s commonly used to optimize inference performance by reducing memory access overhead. In this follow-up, we’ll take a deeper look at why even the most advanced kernel fusion fails to deliver the low-latency execution required for real-time systems—and why these techniques can’t overcome the fundamental architectural limitations present in today’s GPU designs.

What Are Large Fused Kernels, and Why Do They Seem Promising?

One of the most commonly used strategies to reduce inference latency on GPUs is kernel fusion—combining multiple layers or operations into a single GPU kernel to avoid the overhead of launching many smaller ones. When this is taken further, you end up with large, fused kernels that attempt to execute significant portions—or even the entirety—of the model’s forward pass in one go.

At first glance, this seems like an effective optimization. These large kernels aim to:

Reduce global DRAM usage by keeping intermediate activations in shared memory,
Eliminate the latency overhead from launching many discrete kernels,
Avoid memory stalls that typically occur between layers during conventional layer-by-layer execution.

In scenarios with high batch sizes and uniform input shapes—such as training pipelines or offline inference—this approach can bring solid performance gains. The ability to fuse multiple operations helps with better hardware utilization and improves throughput by minimizing data movement and scheduling overheads.

Complexities in Developing Large Fused Kernels

While large fused kernels can offer modest gains in training or high-batch inference scenarios—where inputs are uniform and compute-heavy—the same strategies break down in real-time, low-batch inference. When processing a single sample at a time (as is typical in voice agents, online translation, and other latency-sensitive applications), the complexity and limitations of GPU architecture become far more pronounced.

1. Dynamic Input Shapes

Real-time workloads often deal with variable input sizes, such as different context lengths in speech or NLP models. Large fused kernels typically assume static shapes for efficiency, making them brittle in practice.

For example, softmax over long context lengths may not fit on a single SM due to memory constraints. Without batching, spreading the operation across SMs becomes necessary—and that leads to the next issue.

2. Synchronization and Deadlocks

Large kernels often involve shared memory or intermediate sync points between operations. But CUDA only supports synchronization within a single SM—not across them. If a fused computation is split across SMs and requires synchronization (e.g., between attention steps), you risk deadlocks where some SMs are stalled waiting for others.

To avoid this, developers often restrict the number of SMs used, limiting parallelism on otherwise massive GPUs. Ironically, this underutilization creates a performance ceiling- especially harmful for single-batch inference, where every millisecond counts.

The Latency Wall: Why Real-Time Inference Can’t Reach Silicon Peak

Even after navigating the complexities of large kernel fusion—dynamic input handling, synchronization challenges, and careful SM allocation—the architectural limits of GPUs still hold back real-time inference.

Fusion techniques aim to reduce memory access and kernel launch overhead, and they can yield moderate gains in high-throughput or batched scenarios. But they fundamentally rely on one assumption: that there's enough parallel work per SM to hide memory latency.

To mask latency effectively, Nvidia GPUs require each SM to have multiple active warps. This way, while one warp waits on memory, others can compute. But in batch-1 real-time workloads, the amount of available work is too small. The result? There just aren’t enough warps to keep the SMs busy—so memory latency becomes visible, and performance collapses well below the hardware's theoretical peak.

A Real Example: Small Matrix Multiplication on a Large GPU

Let’s consider a simplified example that highlights the same core issue seen with large fused kernels. Imagine performing a 1024x1024 matrix multiplication using a highly optimized Triton kernel—similar to the one described here. This operation involves a fixed number of FLOPs and is a great case to isolate kernel behavior on modern GPUs.

When the Triton kernel is tuned for performance, it automatically chooses a block size—the sub-tiles into which the input matrices are divided across the M, N, and K dimensions (rows, columns, and reduction axis, respectively). This block size determines:

How much work each thread block handles,
How many blocks (and therefore how many SMs) are active at once.

On an Nvidia L40S, with 142 SMs, the optimized Triton kernel selects a configuration that uses only 128 blocks—meaning 14 SMs remain idle. Despite this underutilization, the kernel achieves an execution time of ~13 microseconds and a throughput of approximately 150 TFLOPS—less than half of the GPU’s rated 362 TFLOPS. Interestingly, this performance is on par with Nvidia’s highly tuned cuBLAS library, indicating that even the most optimized matrix multiplication implementations hit the same ceiling when dealing with small, standalone workloads.

Now here’s the critical insight: when we force Triton to pick block sizes that result in ≥142 active blocks—fully engaging every SM—the performance actually gets worse, dropping to ~17 microseconds. That’s because with smaller blocks per SM, the amount of work each SM does shrinks, and there are fewer warps available to hide memory latency. The GPU spends more time waiting and less time computing.

This demonstrates a key limitation: Nvidia GPUs are not optimized for low-granularity workloads, where small units of work must be done with extreme speed. Instead, they are designed to deliver high throughput over large, distributed workloads—the exact opposite of what real-time inference requires.

This problem of memory latency exposure—amplified by small batch sizes and low granularity—is what prevents even the most optimized kernels, including those built with aggressive fusion, from getting anywhere close to silicon peak in real-time applications. Whether it’s a standalone matrix multiplication or a large fused forward pass combining multiple layers into a single kernel, the result is the same: latency bottoms out well before performance peaks. These limitations aren’t simply due to lack of optimization—they’re a direct consequence of how GPU architectures are designed, and they persist even with the most carefully engineered large kernel fusion strategies.

Why Even Fused Kernels Can’t Overcome GPU Architectural Limits

Despite the theoretical benefits of large kernel fusion, several fundamental limitations still remain—especially in the context of single-batch, real-time inference. These issues arise not from suboptimal programming, but from inherent GPU architectural constraints that fusion simply can’t bypass:

1. Deadlocks and Underutilization from Low Granularity

When a small workload—like a single batch—is distributed across a large number of SMs, each SM ends up handling only a tiny fragment of the overall computation. For example, just a few rows or tokens per SM in a speech model. This severely limits granularity, and in turn, the number of SMs that can be used safely without risking synchronization deadlocks.

Even with fusion, the GPU cannot safely scale this computation across all its SMs:

Too few warps per SM = memory latency is exposed.
Too many SMs = inter-SM coordination risks deadlocks.

This leads to forced underutilization of the hardware. In practice, this means that you hit a hard limit on latency, which is nowhere near the sub-50ms bar required for smooth, human-like interactions.

2. Memory Latency Can’t Be Hidden at Small Scales

Even the L1 cache, considered the fastest local memory on the SM, has a latency of ~33 cycles. When workloads are split too finely—like in large fused kernels handling batch-1 inference—these latencies become visible and aren’t effectively hidden by warp scheduling, because there aren’t enough active warps to swap in and out.

This means that even the best-placed data in shared or L1 memory still adds latency, especially when memory access is irregular or fragmented across layers.

3. No Network-on-Chip (NOC) for Cross-SM Coordination

Each SM, even in a fused kernel, typically ends up working on the same layer at any point in time—because there’s no mechanism to let SMs communicate directly. Unlike some AI-specific chips that include a Network-on-Chip (NOC) for fast intra-chip data exchange, Nvidia GPUs rely on global DRAM for all inter-SM communication.

This makes it impossible to parallelize the model across layers—for instance, assigning SM groups to different parts of the pipeline (e.g., encoder vs decoder), which could reduce overall latency. Without fast on-chip communication, such pipelining becomes infeasible.

4. Occupancy Bottlenecked by L1 Size and Warp Context

Every SM has to maintain the context of all its active warps—registers, local memory, and thread state. But this is gated by the amount of available L1/shared memory per SM. So, to maintain high occupancy (i.e., more warps to hide memory latency), you need more L1 capacity.

In fused kernels where multiple operations share memory-heavy buffers or intermediate states, the L1 memory becomes the bottleneck, reducing the number of active warps- and, again, exposing memory latency instead of hiding it.

Together, these challenges explain why even large fused kernels, carefully designed and optimized, fail to unlock the GPU’s full potential for real-time workloads. The fundamental mismatch between GPU architecture and low-latency, small-batch execution leads to underutilization, deadlocks, and memory stalls—putting a hard floor on inference latency, regardless of how sophisticated the kernel design is.

Conclusion: Fusion Isn’t Enough- The Future Demands Architectural Rethinking

Despite aggressive optimizations like large kernel fusion, the fundamental architecture of Nvidia GPUs limits their effectiveness for real-time, low-batch inference. From memory latency that can’t be hidden at small workloads, to deadlocks from over-distributed compute, to the absence of inter-SM communication—these bottlenecks form a hard boundary that fusion techniques alone simply cannot break through.

To truly unlock sub-50ms latency for natural, fluid conversations, we need more than better software—we need fundamentally different hardware. Specifically, architectures that support in-memory compute, where operations are performed directly where the data resides—eliminating costly round-trips to global memory. Combined with a Network-on-Chip (NoC) that enables fast, low-latency communication between compute units, this allows for fine-grained parallelism, layer-level pipelining, and synchronized execution across modules—all of which are essential for breaking through the latency floor imposed by conventional GPUs.

At Smallest AI, we’re building exactly that. Our next-generation platform is being developed on in-memory compute architectures, with the goal of pushing real-time voice agents below the 50ms latency mark- a significant leap forward even from our current best-in-class Lightning V2, which already leads the market in minutes-per-dollar performance.

We’re excited to be moving toward true real-time AI conversations, where latency becomes imperceptible- and responsiveness finally matches human expectations.