--- license: apache-2.0 datasets: - open-r1/OpenR1-Math-220k base_model: - Qwen/Qwen3-8B tags: - math - trimkv - KV - Cache - Compression --- > TRIM-KV is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference. The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction. The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.

### Why TRIM-KV? It's fast

It's smart

And it's interpretable

--- ## Getting Started ### Requirements - Python 3.11 or higher (tested with 3.12) - PyTorch 2.7.0 or higher (tested with 2.8.0) - FlashAttention 2.7.2.post1 or higher (tested with 2.8.0) - Transformers 4.57.1 ```sh pip install -r requirements.txt ``` This is a minimal set of requirements for training purposes. Additional dependencies may be needed for running specific experiments. We provided a full example of the environment used in our experiments in [`examples/env.yaml`](examples/env.yaml). ### Installation From the root of the repo: ```sh git clone https://github.com/ngocbh/trimkv.git cd trimkv pip install -e . ```` --- ## Quick Start ```python import torch from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM from trimkv.cache_utils import TrimKVCache from transformers import AutoTokenizer model_path = "" download_from = "huggingface" # options: "wandb", "local", "huggingface" model = TrimKVQwen3ForCausalLM.from_pretrained( model_path, torch_dtype=torch.bfloat16, load_trimkv_weights=True, download_from=download_from, use_cache=True, device_map="cuda", ) # Configure TRIM-KV settings model.config._attn_implementation = "flash_attention_2" model.config.compress_memory = True model.config.memory_size = 512 model.config.buffer_size = 128 tokenizer = AutoTokenizer.from_pretrained( model.config.base_model, use_fast=True, padding_side="left", ) # Use model.generate as normal. # Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate ``` For a runnable end-to-end example, see [`examples/test_qwen3.py`](examples/test_qwen3.py). ## Released Models | Base Model | TRIM-KV Checkpoints | Training Datasets | Training Context Len | Training $M$ | |------------------------------|-----------------------------------------------|--------------------------|-------------------------|--------------| | Qwen3-1.7B | [TRIM-KV-Qwen3-1.7B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-1.7B-Math) | OpenR1-Math-220k | 16K | 512 | | Qwen3-4B | [TRIM-KV-Qwen3-4B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-4B-Math) | OpenR1-Math-220k | 16K | 512 | | Qwen3-8B | [TRIM-KV-Qwen3-8B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-8B-Math) | OpenR1-Math-220k | 16K | 512 | | Qwen3-14B | [TRIM-KV-Qwen3-14B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-14B-Math) | OpenR1-Math-220k | 16K | 512 | | Qwen3-4B-Instruct-2507 | [TrimKV-Qwen3-4B-Instruct-2507](https://huggingface.co/ngocbh/TrimKV-Qwen3-4B-Instruct-2507) | Synth-Long, BookSum, Buddhi | 128K | 4096 | | Phi-3-mini-128k-instruct | [TrimKV-Phi-3-mini-128k-instruct](https://huggingface.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct) | LongAlpaca | 128K | 2048 | ---