nielsr HF Staff commited on
Commit
345e687
·
verified ·
1 Parent(s): 6170bc9

Improve model card: Add InfLLM-V2 paper details and comprehensive citations

Browse files

This PR improves the model card for `MiniCPM4.1-8B` by:

- Updating the main title to reflect the model's foundation in the `InfLLM-V2` framework.
- Adding a prominent introductory sentence linking directly to the foundational paper "[InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation](https://huggingface.co/papers/2509.24663)".
- Clarifying the navigation links by relabeling the existing "Technical Report" to "MiniCPM4 Technical Report" and adding a new distinct link for the "InfLLM-V2 Paper".
- Updating the "What's New" section to explicitly mention the `InfLLM-V2` framework in relation to the MiniCPM4.1 series.
- Enhancing the "Citation" section to include both the foundational `InfLLM-V2` paper and the existing `MiniCPM4` technical report, ensuring all relevant research is easily citable.

These changes provide clearer context and more complete references for users and researchers.

Files changed (1) hide show
  1. README.md +20 -284
README.md CHANGED
@@ -1,18 +1,24 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - zh
5
  - en
6
- pipeline_tag: text-generation
7
  library_name: transformers
 
 
8
  ---
 
 
 
 
 
9
  <div align="center">
10
  <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
  </div>
12
 
13
  <p align="center">
14
  <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="https://arxiv.org/abs/2506.07900" target="_blank">Technical Report</a> |
 
16
  <a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a>
17
  </p>
18
  <p align="center">
@@ -20,7 +26,7 @@ library_name: transformers
20
  </p>
21
 
22
  ## What's New
23
- - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
24
  - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
25
 
26
  ## Highlights
@@ -187,285 +193,6 @@ You can apply the LongRoPE factor modification by modifying the model files. Spe
187
  }
188
  ```
189
 
190
- ### Inference with [SGLang](https://github.com/sgl-project/sglang)
191
-
192
- You can inference with SGLang using the standard mode and speculative decoding mode.
193
-
194
- #### Speculative Decoding
195
-
196
- For accelerated inference with speculative decoding, follow these steps:
197
-
198
- ##### 1. Download MiniCPM4.1 Draft Model
199
-
200
- First, download the MiniCPM4.1 draft model:
201
-
202
- ```bash
203
- cd /your_path
204
- git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
205
- ```
206
-
207
- ##### 2. Install EAGLE3-Compatible SGLang
208
-
209
- The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation:
210
-
211
- ```bash
212
- git clone https://github.com/LDLINGLINGLING/sglang.git
213
- cd sglang
214
- pip install -e "python[all]"
215
- ```
216
-
217
- ##### 3. Launch SGLang Server with Speculative Decoding
218
-
219
- Start the SGLang server with speculative decoding enabled:
220
-
221
- ```bash
222
- python -m sglang.launch_server \
223
- --model-path "openbmb/MiniCPM4.1-8B" \
224
- --host "127.0.0.1" \
225
- --port 30002 \
226
- --mem-fraction-static 0.9 \
227
- --speculative-algorithm EAGLE3 \
228
- --speculative-draft-model-path "your/path/MiniCPM4_1-8B-Eagle3-bf16" \
229
- --speculative-num-steps 3 \
230
- --speculative-eagle-topk 1 \
231
- --speculative-num-draft-tokens 32 \
232
- --temperature 0.7
233
- ```
234
-
235
- ##### 4. Client Usage
236
-
237
- The client usage remains the same for both standard and speculative decoding:
238
-
239
- ```python
240
- import openai
241
-
242
- client = openai.Client(base_url=f"http://localhost:30002/v1", api_key="None")
243
-
244
- response = client.chat.completions.create(
245
- model="openbmb/MiniCPM4.1-8B",
246
- messages=[
247
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
248
- ],
249
- temperature=0.6,
250
- max_tokens=32768,
251
- )
252
-
253
- print(response.choices[0].message.content)
254
- ```
255
-
256
- Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example).
257
-
258
- ##### Configuration Parameters
259
-
260
- - `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding
261
- - `--speculative-draft-model-path`: Path to the draft model for speculation
262
- - `--speculative-num-steps`: Number of speculative steps (default: 3)
263
- - `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1)
264
- - `--speculative-num-draft-tokens`: Number of draft tokens (default: 32)
265
- - `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9)
266
-
267
- #### Standard Inference (Without Speculative Decoding)
268
-
269
- For now, you need to install our forked version of SGLang.
270
-
271
- ```bash
272
- git clone -b openbmb https://github.com/OpenBMB/sglang.git
273
- cd sglang
274
-
275
- pip install --upgrade pip
276
- pip install -e "python[all]"
277
- ```
278
-
279
- You can start the inference server by running the following command:
280
-
281
- ```bash
282
- python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
283
- ```
284
-
285
- Then you can use the chat interface by running the following command:
286
-
287
- ```python
288
- import openai
289
-
290
- client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
291
-
292
- response = client.chat.completions.create(
293
- model="openbmb/MiniCPM4.1-8B",
294
- messages=[
295
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
296
- ],
297
- temperature=0.6,
298
- max_tokens=32768,
299
- )
300
-
301
- print(response.choices[0].message.content)
302
- ```
303
-
304
- ### Inference with [vLLM](https://github.com/vllm-project/vllm)
305
- You can inference with vLLM using the standard mode and speculative decoding mode.
306
-
307
- #### Speculative Decoding
308
-
309
- For accelerated inference with speculative decoding using vLLM, follow these steps:
310
-
311
- ##### 1. Download MiniCPM4.1 Draft Model
312
-
313
- First, download the MiniCPM4.1 draft model and change the `architectures` in config.json as `LlamaForCausalLM`.
314
-
315
- ```bash
316
- cd /your_path
317
- git clone https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3
318
- ```
319
-
320
- ##### 2. Install EAGLE3-Compatible vLLM
321
-
322
- The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation:
323
-
324
- ```bash
325
- git clone https://github.com/LDLINGLINGLING/vllm.git
326
- cd vllm
327
- pip install -e .
328
- ```
329
-
330
- ##### 3. Launch vLLM Server with Speculative Decoding
331
-
332
- Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM4_1-8B-Eagle3-bf16 folder:
333
-
334
- ```bash
335
- VLLM_USE_V1=1 \
336
- vllm serve openbmb/MiniCPM4.1-8B \
337
- --seed 42 \
338
- --trust-remote-code \
339
- --speculative-config '{
340
- "model": "your/path/MiniCPM4_1-8B-Eagle3-bf16",
341
- "num_speculative_tokens": 3,
342
- "method": "eagle3",
343
- "draft_tensor_parallel_size": 1
344
- }'
345
- ```
346
-
347
- ##### 4. Client Usage Example
348
-
349
- The client usage remains the same for both standard and speculative decoding:
350
-
351
- ```python
352
- import openai
353
-
354
- client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
355
-
356
- response = client.chat.completions.create(
357
- model="openbmb/MiniCPM4.1-8B",
358
- messages=[
359
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
360
- ],
361
- temperature=0.6,
362
- max_tokens=32768,
363
- extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
364
- )
365
-
366
- print(response.choices[0].message.content)
367
- ```
368
-
369
- ##### vLLM Configuration Parameters
370
-
371
- - `VLLM_USE_V1=1`: Enables vLLM v1 API
372
- - `--speculative-config`: JSON configuration for speculative decoding
373
- - `model`: Path to the draft model for speculation
374
- - `num_speculative_tokens`: Number of speculative tokens (default: 3)
375
- - `method`: Speculative decoding method (eagle3)
376
- - `draft_tensor_parallel_size`: Tensor parallel size for draft model (default: 1)
377
- - `--seed`: Random seed for reproducibility
378
- - `--trust-remote-code`: Allow execution of remote code for custom models
379
-
380
- #### Standard Inference (Without Speculative Decoding)
381
-
382
- For now, you need to install the latest version of vLLM.
383
-
384
- ```bash
385
- pip install -U vllm \
386
- --pre \
387
- --extra-index-url https://wheels.vllm.ai/nightly
388
- ```
389
-
390
- Then you can inference MiniCPM4.1-8B with vLLM:
391
- ```python
392
- from transformers import AutoTokenizer
393
- from vllm import LLM, SamplingParams
394
-
395
- model_name = "openbmb/MiniCPM4.1-8B"
396
- prompt = [{"role": "user", "content": "Write an article about Artificial Intelligence."}]
397
-
398
- tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
399
- input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
400
-
401
- llm = LLM(
402
- model=model_name,
403
- trust_remote_code=True,
404
- max_num_batched_tokens=65536,
405
- dtype="bfloat16",
406
- gpu_memory_utilization=0.8,
407
- )
408
- sampling_params = SamplingParams(top_p=0.95, temperature=0.6, max_tokens=32768)
409
-
410
- outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
411
-
412
- print(outputs[0].outputs[0].text)
413
- ```
414
-
415
- Also, you can start the inference server by running the following command:
416
- > **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
417
-
418
- ```bash
419
- vllm serve openbmb/MiniCPM4.1-8B --trust-remote-code
420
- ```
421
-
422
- Then you can use the chat interface by running the following code:
423
-
424
- ```python
425
- import openai
426
-
427
- client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
428
-
429
- response = client.chat.completions.create(
430
- model="openbmb/MiniCPM4.1-8B",
431
- messages=[
432
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
433
- ],
434
- temperature=0.6,
435
- max_tokens=32768,
436
- extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
437
-
438
- )
439
-
440
- print(response.choices[0].message.content)
441
- ```
442
-
443
-
444
- ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
445
-
446
- We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.
447
-
448
- You can install CPM.cu by running the following command:
449
-
450
- ```bash
451
- git clone https://github.com/OpenBMB/cpm.cu.git --recursive
452
- cd cpm.cu
453
- python3 setup.py install
454
- ```
455
-
456
- MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
457
- ```json
458
- {
459
- ...,
460
- "rope_scaling": {
461
- "rope_type": "longrope",
462
- "long_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
463
- "short_factor": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],
464
- "original_max_position_embeddings": 65536
465
- }
466
- }
467
- ```
468
-
469
  After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
470
  ```bash
471
  python3 tests/test_generate.py
@@ -514,7 +241,8 @@ prompt_text = tokenizer.apply_chat_template(
514
  - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
515
 
516
  ## Citation
517
- - Please cite our [paper](https://arxiv.org/abs/2506.07900) if you find our work valuable.
 
518
 
519
  ```bibtex
520
  @article{minicpm4,
@@ -522,4 +250,12 @@ prompt_text = tokenizer.apply_chat_template(
522
  author={MiniCPM Team},
523
  year={2025}
524
  }
 
 
 
 
 
 
 
 
525
  ```
 
1
  ---
 
2
  language:
3
  - zh
4
  - en
 
5
  library_name: transformers
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
  ---
9
+
10
+ # MiniCPM4.1-8B: InfLLM-V2 based Dense-Sparse Switchable Attention Model
11
+
12
+ This model is presented in the paper [InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation](https://huggingface.co/papers/2509.24663).
13
+
14
  <div align="center">
15
  <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
16
  </div>
17
 
18
  <p align="center">
19
  <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
20
+ <a href="https://arxiv.org/abs/2506.07900" target="_blank">MiniCPM4 Technical Report</a> |
21
+ <a href="https://huggingface.co/papers/2509.24663" target="_blank">InfLLM-V2 Paper</a> |
22
  <a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a>
23
  </p>
24
  <p align="center">
 
26
  </p>
27
 
28
  ## What's New
29
+ - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which is designed with the [InfLLM-V2 framework](https://huggingface.co/papers/2509.24663) and can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
30
  - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
31
 
32
  ## Highlights
 
193
  }
194
  ```
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
197
  ```bash
198
  python3 tests/test_generate.py
 
241
  - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
242
 
243
  ## Citation
244
+ - Please cite our [paper, InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation](https://huggingface.co/papers/2509.24663), if you find our work valuable.
245
+ - Also, consider citing the MiniCPM4 technical report for details specific to the MiniCPM4 series:
246
 
247
  ```bibtex
248
  @article{minicpm4,
 
250
  author={MiniCPM Team},
251
  year={2025}
252
  }
253
+
254
+ @article{infllmv2,
255
+ title={{InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation}},
256
+ author={{The InfLLM-V2 Authors}},
257
+ journal={arXiv preprint arXiv:2509.24663},
258
+ year={2025},
259
+ url={https://huggingface.co/papers/2509.24663},
260
+ }
261
  ```