KV Cache Eviction for MLA-based Long-Context LLMs
A reference implementation of H2O-style heavy-hitter + recency KV cache eviction for transformer architectures that use Multi-head Latent Attention (MLA) as introduced in DeepSeek V3 and used in Kimi K2 / K2.6.
Maintained by GENOMA LABS / research.
What this is
A drop-in monkey-patch for DeepseekV3Attention layers in HuggingFace transformers that adds importance-based KV cache eviction at inference time. No retraining required. The model loads normally; you call install_kv_eviction(model, budget=4096) and from that point eviction runs automatically during generation.
The technique is the canonical H2O recipe from Zhang et al., 2023 (NeurIPS 2023, arXiv:2306.14048) adapted to MLA's specific cache layout. Heavy hitters (tokens with the highest accumulated attention mass across all heads and layers) plus a small set of attention-sink tokens at the start and a recency window at the end are retained; the rest are evicted when the cache exceeds the budget.
Why this matters
HuggingFace's reference MLA implementation caches the expanded K/V tensors, not the compressed latent. The cache footprint grows quickly:
| Context length | Full KV cache (canonical 61-layer DeepseekV3 layout) |
|---|---|
| 32K | ~82 GB |
| 128K | ~328 GB |
| 1M | ~2.5 TB |
These numbers exceed available VRAM well before the model's nominal context window is reached. KV cache eviction is the lever that makes long-context inference economically viable on existing dense models without architectural changes or retraining.
| Eviction budget | Cache size (61 layers) |
|---|---|
budget=4096 |
~10.2 GB |
budget=8192 |
~20.4 GB |
budget=16384 |
~40.8 GB |
A 4096-token budget leaves ~70 GB free that would otherwise be cache. That headroom is where you fit longer prompts, larger batch sizes, or just keep more concurrent requests live on the same hardware.
Compatible architectures
Verified against HuggingFace reference implementations of:
- DeepSeek V3 and V3.2 family
- Kimi K2 and K2.6 (uses the DeepseekV3Attention layer pattern)
- Any future MLA-based model that subclasses
DeepseekV3Attention
For non-MLA architectures (standard MHA, GQA, MQA), the same H2O recipe applies, but the cache-management code paths differ and the patch in src/kv_eviction_mla.py would need adapting. Pull requests welcome.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
from kv_eviction_mla import install_kv_eviction, reset_eviction_scores
model_id = "deepseek-ai/DeepSeek-V3" # or moonshotai/Kimi-K2 etc.
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tok = AutoTokenizer.from_pretrained(model_id)
# Patch the model in-place. Eviction runs automatically from this point.
install_kv_eviction(
model,
budget=4096, # max KV tokens kept per layer (excluding sinks + recent)
n_sink=4, # tokens at start always kept
n_recent=512, # tokens at end always kept
evict_every=1, # evict every N generated tokens (1 = every step)
)
# Generate normally
inputs = tok("Your long prompt goes here...", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tok.decode(out[0]))
# Between independent generations, reset accumulated scores
reset_eviction_scores(model)
How to choose a budget
The right budget depends on your workload's distribution of attention mass. Some recipes from the H2O paper and follow-on literature:
- Streaming chat / code completion:
budget=2048, n_sink=4, n_recent=256. The recent window dominates; the budget acts as a memory of earlier-session context. - RAG / long-document QA:
budget=4096, n_sink=4, n_recent=512. Larger budget preserves the heavy-hitter facts in the middle of the document. - Code review / agent loops over large diffs:
budget=8192, n_sink=4, n_recent=1024. More budget for long technical context where the relevant facts can be anywhere.
Stress-test on RULER 128K NIAH or your own task-level evaluation before deploying. Eviction policies are workload-sensitive and the elbow of the quality-vs-budget curve moves with the distribution.
What this is NOT
- Not a from-scratch attention reimplementation. The forward path of
DeepseekV3Attentionruns as-is; only the cache management changes. - Not a training-time technique. This applies at inference. Training-time KV compression is a different toolset.
- Not an architecture replacement. This composes with sub-quadratic attention, hybrids, sparse mechanisms, etc. The point is to address the cache-bandwidth bottleneck, which is orthogonal to the choice of attention mechanism.
For background on where this sits in the broader long-context-optimization landscape, see GENOMA LABS' research handbook on sub-quadratic attention (link will activate once the handbook flips public; sibling repository to this one).
What's in this repository
src/
kv_eviction_mla.py # the patch + smoke test (277 LOC)
notebooks/
01_smoke_test_walkthrough.md # annotated explanation of the smoke test + memory model
docs/
HOW_IT_WORKS.md # architectural notes: where eviction hooks into the layer
LICENSE # Apache 2.0
README.md # this file
Roadmap
- H2O eviction patch for DeepseekV3Attention (transformers 4.x API)
- Smoke-test on a fake-attention layer (no GPU required)
- Multi-step validation across 1,000 generation steps ร 4 layers โ eviction logic verified, cache stabilizes at expected bound, no overshoot, 913 steps/sec on CPU. See
notebooks/02_validation_results.mdandresults/validate_eviction_random_init.csv. - Real-weights demo on Kimi K2.6 layer 0 โ loaded actual published Kimi K2.6 attention weights (101M params, all 7 weights: q_a/b_proj, q_a_layernorm, kv_a_proj_with_mqa, kv_b_proj, kv_a_layernorm, o_proj), ran a single full-prefix forward over 256 tokens on TITAN RTX, applied H2O policy. Result: kept heavy-hitters score 3.66x higher than evicted tokens on real Kimi attention distributions. See
notebooks/03_kimi_real_weights_demo.mdandresults/kimi_layer_eviction_demo.csv. - API port to transformers 5.x โ patch currently targets the
DynamicCache.key_cache / value_cachelist API; transformers 5.x usesDynamicCache.layers[i]. The eviction logic is unchanged across versions; only the cache-plumbing differs. - RULER 128K benchmark on a real MLA model with eviction at 4 budget levels โ planned target: Kimi K2.6 (BF16) once full-model integration via the 5.x port lands. Will publish CSV + analysis as a sibling repository (
GenomaLabs-com/h2o-eviction-ruler-bench). - SnapKV-style prompt-end compression composed on top of H2O eviction.
- Port to standard MHA / GQA attention classes (Llama, Qwen, Mistral, Gemma).
Pull requests for any of these are welcome. Issues, especially with reproducible failure cases, are even more welcome.
transformers version compatibility
Currently aligned with transformers 4.x KV cache API. The validation script scripts/validate_eviction_random_init.py validates the eviction logic against a mock cache and runs on any Python 3.10+ environment with PyTorch โ no transformers dependency. Real-model integration (install_kv_eviction(model, ...)) on transformers 5.x is on the roadmap; the rework is mechanical (port the cache-slicing code paths from key_cache[i] to layers[i]) but not yet shipped.
Citation
If this implementation is useful for your work, please cite the underlying H2O paper:
@inproceedings{zhang2023h2o,
title={H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models},
author={Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and others},
booktitle={NeurIPS},
year={2023}
}
And optionally this implementation:
GENOMA LABS / research. KV Cache Eviction for MLA-based Long-Context LLMs.
HuggingFace, 2026. https://huggingface.co/GenomaLabs-com/kv-cache-eviction-mla
License
Apache License 2.0. See LICENSE.