Inference Optimization - How to Optimize a Model for Latency?

Imagine if you had to check your map every single day just to get from home to the office. Every turn, every lane, every signal is a constant stream of decisions, each one slowing you down just a little. Now imagine learning the route on day one. No map, no second-guessing, just smooth, effortless movement. You’re not just faster, you’re freer. Your mind is lighter, focused on the journey, not the directions.

That’s optimizing for latency. It’s not just about speed. It’s about removing every unnecessary check, every wasted pause, every hesitation. It’s about creating flow. When your system knows exactly where it’s going, it moves with purpose, with clarity. Optimization is about making the work disappear.

Let’s get practical!

Let's begin with a perfectly functional transformer model which will act as our baseline and one that gets the job done. Think of it as a dependable vehicle that successfully transports you from point A to point B. There's nothing wrong with it, it works exactly as designed. But like any good engineer, we can see opportunities to make it even better.

Here’s a minimalistic implementation of the BERT inference using transformers.

import torch
from transformers import AutoModel, AutoTokenizer
import time

# Small transformer model - our baseline implementation
model_name = "distilbert-base-uncased"
model = AutoModel.from_pretrained(model_name).to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Sample text
text = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to('cuda')
# Baseline inference - our current working implementation
model.eval()
with torch.no_grad():
   for i in range(100):
       if i == 0:
           start_time = time.time()
       outputs = model(**inputs)
   baseline_time = time.time() - start_time
print(f"Baseline time for 100 inferences: {baseline_time:.4f} seconds")

>>> Baseline time for 100 inferences: 0.3587 seconds

torch.compile() - Learning the Route

Now let's enhance our baseline with torch.compile() - PyTorch's built-in optimization that transforms our dynamic execution into a more efficient static graph. Think of this as the moment when you become familiar enough with your route that you can drive it smoothly without constantly consulting directions.

# Enhance our baseline with compilation
compiled_model = torch.compile(model)

>>> Compiled time for 100 inferences: 0.1874 seconds

>>> Improvement over baseline: 1.91x

It traces through our model's execution once, recording every operation and decision point. It then creates an optimized computation path. The dynamic Python execution is transformed into static operations similar to compiled C code. PyTorch traces through your model's execution once, recording every operation, every tensor shape, every decision point. It then optimizes this trace, eliminating redundant checks and creating a streamlined execution path as shown in figure 1. The countless if-else statements that Python normally evaluates, checking tensor shapes, determining which CUDA kernels to use, and deciding memory layouts are reduced to a predetermined sequence of operations.

CUDA Graphs - The Highway System for Your Computations

torch.compile() has an even more powerful trick up its sleeve: CUDA Graphs. If regular CUDA kernel launches are like city driving where you stop at lights, yield to traffic, deal with unpredictable conditions, then CUDA Graphs are like having your own private highway with all green lights.

# Enable CUDA Graphs for maximum throughput
compiled_model_with_cudagraphs = torch.compile(model, mode="max-autotune")

>>> CUDA Graph time for 100 inferences: 0.0259 seconds

>>> Speedup over baseline: 13.09x

CUDA Graphs eliminate the overhead of launching individual kernels. Normally, each operation requires a round trip between CPU and GPU - the CPU instructs the GPU what to do, waits for acknowledgment, then sends the next instruction. With CUDA Graphs, the entire sequence of GPU operations is recorded once and then replayed as a single unit as shown in figure 2.

Max-autotune - Is CUDA graph all that it generates?

The max-autotune mode doesn't just optimize your existing operations, it generates entirely new ones. Think of it as having an expert mechanic who doesn't just tune your current engine but builds you a completely custom engine designed specifically for your route. It's generating custom Triton kernels specifically for your model's operations.

The process works like this -> PyTorch generates multiple Triton kernel implementations for each operation, benchmarks them all, and selects the fastest one. It's like having a team of expert drivers test different routes to your destination and then teaching you the optimal path they discovered. These custom kernels often outperform hand-written CUDA implementations because they're generated with perfect knowledge of your specific use case. They know exactly what tensor shapes to expect, what memory access patterns will occur, and how to minimize data movement between GPU memory hierarchies.

But wait, what if my route changes daily?

What happens when your model needs to handle different input sizes? Imagine if your daily commute route changed based on the weather, traffic, or your mood. This is the dynamic shape problem, when your transformer needs to process sentences of varying lengths, or your computer vision model handles images of different resolutions. The challenge is that CUDA Graphs are inherently static - they record a specific sequence of operations with fixed tensor shapes. If your input shapes change, the graph becomes invalid, like having memorized directions that only work when traffic flows in a specific pattern.

This is where max-autotune-no-cudagraphs shine. It's like having a highly skilled driver who knows multiple optimal routes and can adapt in real-time without losing the benefits of experience and expertise.

# Compile with max-autotune but without CUDA graphs for dynamic shapes
flexible_model=torch.compile(model, mode="max-autotune-no-cudagraphs")

The max-autotune-no-cudagraphs mode gives you the best of both worlds. You still get the custom Triton kernels generated for your specific operations, the graph optimizations that eliminate redundant computations, and the fused operations that reduce memory bandwidth. What you lose is the CUDA Graph replay mechanism, but what you gain is the ability to handle varying input shapes without recompilation.

This approach is particularly powerful for production systems where you can't predict input sizes in advance. It's like having an expert taxi driver who doesn't need to memorize one specific route but has internalized the principles of efficient navigation and can apply them to any destination.

Lots of talk about model optimization by improving operations efficiency. What else?

The Long Drive Advantage -> How Pre-Allocating Tensors is Your Full Tank

Imagine you are out for a long trip with your tank full, you won't have to stop for refills and can cover the journey at once. Similarly, pre-allocating tensors helps your model cruise through inference without the stuttering stops of memory allocation. Every time your model asks the system for memory mid-inference, it's like pulling over to search for a gas station, the journey halts, precious time ticks away, and the smooth rhythm breaks.

Each time PyTorch creates a new tensor during inference, it's like encountering an unexpected toll booth on your highway. The request seems instant, but beneath the surface lurks a complex dance: finding available memory, checking alignment requirements, updating allocation tables, and sometimes even defragmenting existing memory blocks.

Traditional tensor allocation suffers from an average fragmentation of 21.3% at larger batch sizes. Memory optimization techniques eliminate this fragmentation entirely while reducing overall memory usage by 20-70%, depending on the neural network architecture. But the real transformation happens when you combine pre-allocation with intelligent memory pooling strategies.

Let’s understand with a practical example. Here’s an example demonstrating the difference between a newly allocated tensor and a pre-allocated tensor.

import time
import numpy as np
def model_inference(input_array, pre_allocated_array=None):
   # Simulate a simple model operation using matrix multiplication
   weight = np.random.randn(input_array.shape[1], 3)
   if pre_allocated_array is None:
       # Allocate output array dynamically (like stopping for fuel every time)
       output = np.dot(input_array, weight)
   else:
       # Use pre-allocated array (like starting with a full tank)
       np.copyto(pre_allocated_array, np.dot(input_array, weight))
       output = pre_allocated_array
   return output

# Input array simulating a batch of data (batch_size=1000, features=512)
input_array = np.random.randn(1000, 512)

# Pre-allocate output array (batch_size=1000, output_features=3)
pre_allocated_array = np.empty((1000, 3))

# Measure time without pre-allocation
start_time = time.time()
for _ in range(100):
   output_dynamic = model_inference(input_array)
end_time = time.time()
dynamic_time = end_time - start_time

# Measure time with pre-allocation
start_time = time.time()
for _ in range(100):
   output_pre_alloc = model_inference(input_array, pre_allocated_array)
end_time = time.time()
pre_alloc_time = end_time - start_time

print(f"Time without pre-allocation: {dynamic_time:.4f} seconds")
print(f"Time with pre-allocation:    {pre_alloc_time:.4f} seconds")

>>> Time without pre-allocation: 0.1738 seconds

>>> Time with pre-allocation: 0.0353 seconds

Every memory allocation requires system calls, permission checks, and administrative bookkeeping. Pre-allocation handles this once at the start instead of thousands of times per inference, as shown in figure 5. Modern GPUs and CPUs have sophisticated prediction mechanisms that work best with consistent patterns. It's like cruise control once the system knows what's coming, it can optimize accordingly.

So far so good, can we squeeze out more?

Yes! Your GPU can learn to combine multiple operations into one fluid motion. This is the magic of kernel fusion. It is the art of teaching your system to stop making unnecessary trips to memory and instead keep everything flowing smoothly. Your GPU has two types of memory: the lightning-fast registers right next to the compute cores, and the slower global memory that requires a trip across town. Traditional operations are like making separate trips to the store for each ingredient. Fused operations? That's your smart shopping list that gets everything in one smooth journey.

Let’s try to optimise simple addition and multiplication operations using fused kernels.

# Unfused Operations
def scenic_route(a, b, c, value):
   temp = b * c          # Trip 1: Compute, write to memory
   temp = temp * value   # Trip 2: Read, compute, write back 
   result = a + temp     # Trip 3: Read both, compute, write result
   return result

# Fused Operations
def express_lane(a, b, c, value):
   return torch.addcmul(a, b, c, value=value)  # One smooth journey

Instead of launching three separate GPU kernels, each with its own memory overhead, addcmul does it all in a single operation. Our intermediate results never touch the slow global memory. They flow directly through the GPU's fast registers like water through a perfectly engineered pipeline as shown in figure 6.

End of our journey

From roaming city streets to custom-built highways, from pit stops for fuel to seamless, uninterrupted cruises, this journey through inference optimization has been about more than just speed. It's been about understanding the terrain, tuning the engine, and finally, letting the system flow with purpose and precision.

We started with a reliable model and turned it into a performance machine, using torch.compile() to track routes, CUDA Graphs to remove the traffic lights, max-autotune to craft the perfect engine, and memory optimizations to squeeze every bit of fuel.

But remember, optimization is never truly over. Like any road, new conditions arise, shapes change, models evolve, workloads shift. The key is not just in knowing one path, but in mastering the skill of adapting swiftly and intelligently.

So as we wrap this drive, take with you not just the tools, but the mindset of clarity, of efficiency, and of creating systems that don’t just work, but flow.

Happy optimizing!

Mon Jun 30 2025 • 13 min Read

Inference Optimization - How to Optimize a Model for Latency?

Nityanand Mathur

Let’s get practical!

torch.compile() - Learning the Route

CUDA Graphs - The Highway System for Your Computations

Max-autotune - Is CUDA graph all that it generates?

But wait, what if my route changes daily?

The Long Drive Advantage -> How Pre-Allocating Tensors is Your Full Tank

Recent Blog Posts

Smallest AI vs Observe.AI: Why Full-Stack Voice Infrastructure Wins

Why Smallest AI beats Observe.AI: modular voice architecture, Lightning V2 TTS, transparent pricing, and on-premise deployment options. Complete 2025 review.

Smallest AI vs Poly AI: Best Voice Agent Alternative 2025

Discover why Smallest AI outperforms Poly AI with 100ms latency, modular architecture, and real-time voice interruption. Compare features, pricing & use cases for 2025.

Evaluating Lightning ASR Against Leading Streaming Speech Recognition Models