Improve model card: Add InfLLM-V2 paper details and comprehensive citations

This PR improves the model card for `MiniCPM4.1-8B` by:

- Updating the main title to reflect the model's foundation in the `InfLLM-V2` framework.
- Adding a prominent introductory sentence linking directly to the foundational paper "[InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation](https://huggingface.co/papers/2509.24663)".
- Clarifying the navigation links by relabeling the existing "Technical Report" to "MiniCPM4 Technical Report" and adding a new distinct link for the "InfLLM-V2 Paper".
- Updating the "What's New" section to explicitly mention the `InfLLM-V2` framework in relation to the MiniCPM4.1 series.
- Enhancing the "Citation" section to include both the foundational `InfLLM-V2` paper and the existing `MiniCPM4` technical report, ensuring all relevant research is easily citable.

These changes provide clearer context and more complete references for users and researchers.

Files changed (1) hide show

README.md +20 -284

README.md CHANGED Viewed

@@ -1,18 +1,24 @@
 ---
-license: apache-2.0
 language:
 - zh
 - en
-pipeline_tag: text-generation
 library_name: transformers
 ---
 <div align="center">
 <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
 </div>
 <p align="center">
 <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
-<a href="https://arxiv.org/abs/2506.07900" target="_blank">Technical Report</a> |
 <a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a>
 </p>
 <p align="center">
@@ -20,7 +26,7 @@ library_name: transformers
 </p>
 ## What's New
-- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
 - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
 ## Highlights
@@ -187,285 +193,6 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
 }
 ```
-### Inference with [SGLang](https://github.com/sgl-project/sglang)
-You can inference with SGLang using the standard mode and speculative decoding mode.
-#### Speculative Decoding
-For accelerated inference with speculative decoding, follow these steps:
-##### 1. Download MiniCPM4.1 Draft Model
-First, download the MiniCPM4.1 draft model:
-```bash
-cd /your_path
-git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
-```
-##### 2. Install EAGLE3-Compatible SGLang
-The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation:
-```bash
-git clone https://github.com/LDLINGLINGLING/sglang.git
-cd sglang
-pip install -e "python[all]"
-```
-##### 3. Launch SGLang Server with Speculative Decoding
-Start the SGLang server with speculative decoding enabled:
-```bash
-python -m sglang.launch_server \
-  --model-path "openbmb/MiniCPM4.1-8B" \
-  --host "127.0.0.1" \
-  --port 30002 \
-  --mem-fraction-static 0.9 \
-  --speculative-algorithm EAGLE3 \
-  --speculative-draft-model-path "your/path/MiniCPM4_1-8B-Eagle3-bf16" \
-  --speculative-num-steps 3 \
-  --speculative-eagle-topk 1 \
-  --speculative-num-draft-tokens 32 \
-  --temperature 0.7
-```
-##### 4. Client Usage
-The client usage remains the same for both standard and speculative decoding:
-```python
-import openai
-client = openai.Client(base_url=f"http://localhost:30002/v1", api_key="None")
-response = client.chat.completions.create(
-    model="openbmb/MiniCPM4.1-8B",
-    messages=[
-        {"role": "user", "content": "Write an article about Artificial Intelligence."},
-    ],
-    temperature=0.6,
-    max_tokens=32768,
-)
-print(response.choices[0].message.content)
-```
-Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example).
-##### Configuration Parameters
-- `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding
-- `--speculative-draft-model-path`: Path to the draft model for speculation
-- `--speculative-num-steps`: Number of speculative steps (default: 3)
-- `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1)
-- `--speculative-num-draft-tokens`: Number of draft tokens (default: 32)
-- `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9)
-#### Standard Inference (Without Speculative Decoding)
-For now, you need to install our forked version of SGLang.
-```bash
-git clone -b openbmb https://github.com/OpenBMB/sglang.git
-cd sglang
-pip install --upgrade pip
-pip install -e "python[all]"
-```
-You can start the inference server by running the following command:
-```bash
-python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
-```
-Then you can use the chat interface by running the following command:
-```python
-import openai
-client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
-response = client.chat.completions.create(
-    model="openbmb/MiniCPM4.1-8B",
-    messages=[
-        {"role": "user", "content": "Write an article about Artificial Intelligence."},
-    ],
-    temperature=0.6,
-    max_tokens=32768,
-)
-print(response.choices[0].message.content)
-```
-### Inference with [vLLM](https://github.com/vllm-project/vllm)
-You can inference with vLLM using the standard mode and speculative decoding mode.
-#### Speculative Decoding
-For accelerated inference with speculative decoding using vLLM, follow these steps:
-##### 1. Download MiniCPM4.1 Draft Model
-First, download the MiniCPM4.1 draft model and change the `architectures` in config.json as `LlamaForCausalLM`.
-```bash
-cd /your_path
-git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
-```
-##### 2. Install EAGLE3-Compatible vLLM
-The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation:
-```bash
-git clone https://github.com/LDLINGLINGLING/vllm.git
-cd vllm
-pip install -e .
-```
-##### 3. Launch vLLM Server with Speculative Decoding
-Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM4_1-8B-Eagle3-bf16 folder:
-```bash
-VLLM_USE_V1=1 \
-vllm serve openbmb/MiniCPM4.1-8B \
---seed 42 \
---trust-remote-code \
---speculative-config '{
-  "model": "your/path/MiniCPM4_1-8B-Eagle3-bf16",
-  "num_speculative_tokens": 3,
-  "method": "eagle3",
-  "draft_tensor_parallel_size": 1
-}'
-```
-##### 4. Client Usage Example
-The client usage remains the same for both standard and speculative decoding:
-```python
-import openai
-client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
-response = client.chat.completions.create(
-    model="openbmb/MiniCPM4.1-8B",
-    messages=[
-        {"role": "user", "content": "Write an article about Artificial Intelligence."},
-    ],
-    temperature=0.6,
-    max_tokens=32768,
-    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template
-)
-print(response.choices[0].message.content)
-```
-##### vLLM Configuration Parameters
-- `VLLM_USE_V1=1`: Enables vLLM v1 API
-- `--speculative-config`: JSON configuration for speculative decoding
-  - `model`: Path to the draft model for speculation
-  - `num_speculative_tokens`: Number of speculative tokens (default: 3)
-  - `method`: Speculative decoding method (eagle3)
-  - `draft_tensor_parallel_size`: Tensor parallel size for draft model (default: 1)
-- `--seed`: Random seed for reproducibility
-- `--trust-remote-code`: Allow execution of remote code for custom models
-#### Standard Inference (Without Speculative Decoding)
-For now, you need to install the latest version of vLLM.
-```bash
-pip install -U vllm \
-    --pre \
-    --extra-index-url https://wheels.vllm.ai/nightly
-```
-Then you can inference MiniCPM4.1-8B with vLLM:
-```python
-from transformers import AutoTokenizer
-from vllm import LLM, SamplingParams
-model_name = "openbmb/MiniCPM4.1-8B"
-prompt = [{"role": "user", "content": "Write an article about Artificial Intelligence."}]
-tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
-llm = LLM(
-    model=model_name,
-    trust_remote_code=True,
-    max_num_batched_tokens=65536,
-    dtype="bfloat16",
-    gpu_memory_utilization=0.8,
-)
-sampling_params = SamplingParams(top_p=0.95, temperature=0.6, max_tokens=32768)
-outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
-print(outputs[0].outputs[0].text)
-```
-Also, you can start the inference server by running the following command:
-> **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
-```bash
-vllm serve openbmb/MiniCPM4.1-8B --trust-remote-code
-```
-Then you can use the chat interface by running the following code:
-```python
-import openai
-client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
-response = client.chat.completions.create(
-    model="openbmb/MiniCPM4.1-8B",
-    messages=[
-        {"role": "user", "content": "Write an article about Artificial Intelligence."},
-    ],
-    temperature=0.6,
-    max_tokens=32768,
-    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template
-)
-print(response.choices[0].message.content)
-```
-### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
-We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.
-You can install CPM.cu by running the following command:
-```bash
-git clone https://github.com/OpenBMB/cpm.cu.git --recursive
-cd cpm.cu
-python3 setup.py install
-```
-MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
-```json
-{
-    ...,
-    "rope_scaling": {
-        "rope_type": "longrope",
-        "long_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
-        "short_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
-        "original_max_position_embeddings": 65536
-    }
-}
-```
 After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
 ```bash
 python3 tests/test_generate.py
@@ -514,7 +241,8 @@ prompt_text = tokenizer.apply_chat_template(
 - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
 ## Citation
-- Please cite our [paper](https://arxiv.org/abs/2506.07900) if you find our work valuable.
 ```bibtex
 @article{minicpm4,
@@ -522,4 +250,12 @@ prompt_text = tokenizer.apply_chat_template(
   author={MiniCPM Team},
   year={2025}
 }
 ```

 ---
 language:
 - zh
 - en
 library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
 ---
+# MiniCPM4.1-8B: InfLLM-V2 based Dense-Sparse Switchable Attention Model
+This model is presented in the paper [InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation](https://huggingface.co/papers/2509.24663).
 <div align="center">
 <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
 </div>
 <p align="center">
 <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
+<a href="https://arxiv.org/abs/2506.07900" target="_blank">MiniCPM4 Technical Report</a> |
+<a href="https://huggingface.co/papers/2509.24663" target="_blank">InfLLM-V2 Paper</a> |
 <a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a>
 </p>
 <p align="center">
 </p>
 ## What's New
+- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which is designed with the [InfLLM-V2 framework](https://huggingface.co/papers/2509.24663) and can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
 - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
 ## Highlights
 }
 ```
 After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
 ```bash
 python3 tests/test_generate.py
 - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
 ## Citation
+- Please cite our [paper, InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation](https://huggingface.co/papers/2509.24663), if you find our work valuable.
+- Also, consider citing the MiniCPM4 technical report for details specific to the MiniCPM4 series:
 ```bibtex
 @article{minicpm4,
   author={MiniCPM Team},
   year={2025}
 }
+@article{infllmv2,
+  title={{InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation}},
+  author={{The InfLLM-V2 Authors}},
+  journal={arXiv preprint arXiv:2509.24663},
+  year={2025},
+  url={https://huggingface.co/papers/2509.24663},
+}
 ```