The transformer architecture generates tokens one at a time during inference. For each new token, the model recomputes attention over all previous tokens. This is straightforward to implement and theoretically pure, but computationally catastrophic. Processing a 100K token context with a 7B model takes hours without KV caching; with it, takes seconds.
KV cache is one of those elegant optimizations that produces an order-of-magnitude speedup with a single insight. Understanding it reveals why some inference engines scale to long contexts while others hit memory cliffs.
What Is KV Cache?
In the standard transformer attention mechanism, given an input sequence, you compute three matrices: Query (Q), Key (K), and Value (V). Attention is then:
Attention(Q, K, V) = softmax(QK^T / √d) V
During autoregressive decoding (generating one token at a time), you compute attention for each new token against all previous tokens. But here's the key insight:
The Key and Value matrices don't change. Only the Query changes with each new token.
This is because the context (all previous tokens) doesn't change — we're just adding one more token and need to attend to everything before it. So instead of recomputing K and V for the entire sequence every time, we cache them and reuse them.
The Memory Mathematics
For a 7B model with hidden dimension 4096 and attention heads splitting that into 32 heads (128 per head):
- Each token: 2 × 4096 values (K and V) × 2 bytes (FP16) = 16,384 bytes = 16KB
- 100K token context: 100,000 × 16KB = 1.6GB per layer
- 32 transformer layers: 1.6GB × 32 = 51.2GB
This is substantial but manageable on modern GPUs. Without KV cache, you'd recompute K and V 100K times, each computation requiring the model's full forward pass — prohibitively expensive.
Why This Matters for Inference Speed
Without KV cache:
- Process 100K token context: compute attention once for the full sequence (expensive but single pass)
- Generate token 100,001: recompute attention over all 100K previous tokens (full forward pass)
- Generate token 100,002: recompute attention again
- Total work: essentially 100K+ full model forward passes
With KV cache:
- Process 100K token context: compute attention once (expensive, unavoidable)
- Cache K and V for all 100K tokens
- Generate token 100,001: compute Q for only the new token, reuse cached K and V (tiny computation)
- Generate token 100,002: compute Q for only the new token, reuse cached K and V again
- Total work: 1 full pass + 100K trivial passes
The speedup is roughly 100x for generating a single token after processing a 100K context, with diminishing returns as context grows.
The Memory-Latency Tradeoff
KV cache trades memory for speed. This creates interesting engineering decisions:
Memory-First Strategy: Store full precision K and V (FP32). Maximum memory consumption, but attention scores have higher precision.
Speed-First Strategy: Quantize K and V to INT8. Saves ~75% memory, with minimal accuracy loss for most tasks.
Hybrid Strategy: Store K and V in FP16, recompute Q in FP32, perform attention in FP32. Balance precision and memory.
Practical Implementations
Flash Attention: Optimized CUDA kernels that fuse attention operations. Doesn't eliminate KV cache but makes it faster and more memory-efficient. Particularly valuable for long-context models.
Paged Attention (vLLM): Instead of allocating contiguous memory for each sequence's KV cache, use paging similar to OS page tables. Dramatically reduces memory fragmentation and enables better scheduling of multiple requests.
Continuous Batching with Paging: Process multiple sequences in parallel, each with its own KV cache pages. Eliminates the "static batch" requirement that wastes GPU compute on finished sequences.
The Long-Context Problem
As context windows grow from 4K to 32K to 200K, KV cache becomes the bottleneck. A 200K context window multiplies memory requirements by 50x compared to a 4K window.
This creates tension:
- Users want longer contexts (full document processing, memory of entire conversations)
- Longer contexts require more KV cache
- More KV cache means fewer sequences can fit in memory
- Fewer sequences mean lower GPU utilization and higher cost per token
Solutions being explored:
- Sparse Attention: Cache only recent tokens and periodic tokens from history. Reduces memory ~10x but with accuracy loss.
- Attention Head Pruning: Some attention heads are redundant. Remove low-importance heads and corresponding KV cache.
- KV Cache Compression: Quantize K and V more aggressively or use lossy compression.
- Hierarchical Attention: Process documents in chunks, summarize earlier chunks, attend over summaries instead of full cache.
Common Pitfalls
Pitfall 1: Not Pre-allocating Cache If you allocate KV cache dynamically as tokens are processed, memory allocation becomes a bottleneck. Pre-allocate max cache at setup.
Pitfall 2: Cache Thrashing Processing sequences with varying lengths in batches causes memory fragmentation. Use paged attention or size-aware scheduling.
Pitfall 3: Ignoring Attention Precision Quantizing K and V aggressively can hurt performance on reasoning tasks. Benchmark your specific use case.
Pitfall 4: Caching for All Layers Unnecessarily Early layers' attention patterns are less task-specific and cache less effectively. Some systems skip KV cache for early layers.
The Future
KV cache is a temporary optimization. As model architectures evolve — from transformers to state-space models, mixture-of-experts, or entirely new paradigms — KV cache may become irrelevant.
Until then, it's the foundation enabling long-context models. Understanding it is understanding why 200K context windows are possible but 2M token windows require architectural rethinking.