Research-EAI commited on
Commit
7b62e5f
·
verified ·
1 Parent(s): f2c7da2

Add details long-context extrapolation

Browse files
Files changed (1) hide show
  1. README.md +47 -2
README.md CHANGED
@@ -70,11 +70,12 @@ Rnj-1 is a family of 8B parameter open-weight, dense models trained from scratch
70
 
71
  # Changelog
72
 
73
- * Update December 18, 2025:
74
  - System prompt and temperature recommendations: Resolve premature truncations and mitigate unprompted code outputs.
75
  - Updates to default chat template.
76
- - Updated STEM and comparables tables.
77
  - Links to model generations for evals.
 
78
 
79
  * Initial version: December 8, 2025
80
 
@@ -169,6 +170,50 @@ The global batch sizes used were:
169
  - 24M tokens for mid-training.
170
  - 16M tokens for SFT.
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  # Recommendations
173
 
174
  ### System Prompt & Temperature
 
70
 
71
  # Changelog
72
 
73
+ * Update December 20, 2025:
74
  - System prompt and temperature recommendations: Resolve premature truncations and mitigate unprompted code outputs.
75
  - Updates to default chat template.
76
+ - Updated evaluation results.
77
  - Links to model generations for evals.
78
+ - Instructions for long-context extrapolation.
79
 
80
  * Initial version: December 8, 2025
81
 
 
170
  - 24M tokens for mid-training.
171
  - 16M tokens for SFT.
172
 
173
+ ### Long-Context Extrapolation (up to 128k)
174
+ Although Rnj-1-Instruct was trained with context lengths up to 32k, the model can be extrapolated to 128k context using YaRN RoPE scaling. This requires the following updates to `config.json`:
175
+ ```diff
176
+ @@
177
+ - "max_position_embeddings": 32768,
178
+ + "max_position_embeddings": 131072,
179
+
180
+ @@
181
+ - "sliding_window": 32768,
182
+ + "sliding_window": 131072,
183
+
184
+ @@
185
+ "rope_scaling": {
186
+ "attn_factor": 1.0,
187
+ "beta_fast": 64.0,
188
+ "beta_slow": 1.0,
189
+ "extrapolation_factor": 1.0,
190
+ - "factor": 4.0,
191
+ + "factor": 16.0,
192
+ "original_max_position_embeddings": 8192,
193
+ "rope_type": "yarn"
194
+ },
195
+ ```
196
+
197
+ Overall, most capabilities are preserved under 128k extrapolation, with performance remaining stable on many coding, math, SWE and FIM benchmarks. However, we do observe select regressions, particularly on some science and performance-based evaluations.
198
+
199
+ | Category | Evals | Rnj-1-instruct | Rnj-1-instruct (128k) |
200
+ |------------|-----------------------|-------|--------------|
201
+ | Coding | MBPP+ | 75.7 | 75.7 |
202
+ | Coding | HE+ | 83.5 | 82.3 |
203
+ | Coding | BigCodeBench-full | 57.1 | 55.3 |
204
+ | Math | AIME 25 | 43.3 | 53.3 |
205
+ | Math | GSM8k | 92.6 | 91.1 |
206
+ | Math | Minerva-MATH-500 | 88.4 | 89.4 |
207
+ | Science | MMLU-STEM | 81.8 | 69.4 |
208
+ | Science | GPQA-Diamond | 38.9 | 41.4 |
209
+ | Env evals | SWE-bench (bash) | 20.8 | 20.1 |
210
+ | Env evals | Performance: Enamel | 49.0 | 39.9 |
211
+ | FIM | HE single-line | 94.9 | 93.5 |
212
+ | FIM | HE multi-line | 77.6 | 76.5 |
213
+ | FIM | HE random-span | 86.1 | 85.1 |
214
+
215
+ We are actively investigating mitigations (including improved scaling strategies and targeted long-context tuning) and expect to close much of this gap in future updates.
216
+
217
  # Recommendations
218
 
219
  ### System Prompt & Temperature