Model Card for Model ID

This is our model using our algorithm iw-SFT (importance weighted SFT). For more details of the algorithm please refer to our paper below.

At a high level, the algorithm is motivated by showing first that SFT and RL are connected: the SFT objective lower bounds the RL objective. We show that by reweighting the curated dataset adaptively during training, i.e. iw-SFT, we can obtain a much tighter bound to RL than SFT alone.

We benchmark our algorithm on maths reasoning tasks, see table below.

Authors

Chongli Qin and Jost Tobias Springenberg

Model Details

Blog Post: Supervised Fine Tuning is Reinforcement Learning (and can be improved)

Repository: iw-sft

Paper: Supervised Fine Tuning is Reinforcement Learning (and can be improved)

For all links and general information see [here](For more general information see here.

Evaluation

Metric s1-32B s1.1-32B iw-SFT-32B o1-preview o1 DeepSeek-R1 DeepSeek-R1-Distill-Qwen-32B
# examples 1K 1K 1K ? ? >800K 800K
AIME2024 56.7 56.7 66.7 40.0 74.4 79.8 72.6
AIME2025 I 26.7 60.0 53.3 37.5 ? 65.0 46.1
MATH500 93.0 95.4 94.8 81.4 94.8 97.3 94.3
GPQA-Diamond 59.6 63.6 64.1 75.2 77.3 71.5 62.1
Downloads last month
7
Safetensors
Model size
33B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ChongliQin/iw-SFT-32B

Base model

Qwen/Qwen2.5-32B
Finetuned
(1194)
this model
Quantizations
2 models