gem-3-small
Model Description
This model is a surgically optimized and distilled version of google/gemma-3-270m, created as the final challenge of Chapter 6 in the book "Rearchitecting LLMs".
- Book: Rearchitecting LLMs
- Framework: OptiPFair
- Technique: Depth Pruning + Knowledge Distillation (Logits-Only with Skew KL Divergence)
- Chapter: Chapter 6 - Knowledge Recovery
Performance & Retention Metrics
The goal of this optimization was to maximize parameter efficiency while maintaining the highest possible retention of the Teacher's capabilities.
Retention Summary (vs Teacher Baseline)
| Metric | Value | Description |
|---|---|---|
| PPL Retention | 77.90% | Linguistic quality preserved (Teacher PPL / Student PPL × 100) |
| Capabilities Retention | 92.86% | Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100) |
| Overall Retention | 85.21% | Combined health score (average of PPL + Capabilities retention) |
Capability Benchmarks (LM Evaluation Harness)
Recovery = How much of the pruning degradation was recovered through distillation.
| Benchmark | Teacher | Pruned (No KD) | Student (After KD) | Recovery |
|---|---|---|---|---|
| Arc Easy | 60.0% | 60.0% | 60.0% | 0.0% |
| Winogrande | 60.0% | 40.0% | 80.0% | 200.0% |
| Hellaswag | 60.0% | 60.0% | 60.0% | 0.0% |
| Lambada Openai | 40.0% | 0.0% | 0.0% | 0.0% |
| Piqa | 60.0% | 60.0% | 60.0% | 0.0% |
| Average | 56.0% | 44.0% | 52.0% | 66.7% |
Linguistic Quality
- Final Perplexity (PPL): 17.02
- Teacher Baseline PPL: 13.26
- Pruned (No KD) PPL: 123.63
- Final Training Loss: 2.8342
Architecture Details
- Teacher Model:
google/gemma-3-270m(18 transformer blocks, 268,098,176 parameters) - Student Model: Pruned to 14 transformer blocks (245,803,648 parameters)
- Layers Removed: 4 layers (indices: [9, 8, 14, 16])
- Parameter Reduction: 8.32%
Training Procedure
Dataset
- Source: Cosmopedia-v2
- Samples: 2,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples)
- Train/Val Split: 80% / 20%
Hyperparameters
- Epochs: 5
- Batch Size: 16 (effective: 64 with gradient accumulation)
- Learning Rate: 4e-05
- Loss Function:
α·CrossEntropy + β·Skew-KLD- Task Loss Weight (α): 0.5
- Logits Loss Weight (β): 0.5
- Skew Interpolation Factor: 0.0
- Temperature: 2.0
- Optimizer: AdamW
- Gradient Clipping: 1.0
Hardware & Training Time
- GPU: NVIDIA A100-SXM4-80GB
- Training Time: 205.9s (3.43 minutes)
- Avg Time per Epoch: 41.2s
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_id = "oopere/gem-3-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Generate text
prompt = "Paris is the capital of"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False,
num_beams=3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations & Intended Use
Intended Use
This is an educational model created as part of the Hands-on Lab in Chapter 6 of "Rearchitecting LLMs". It demonstrates:
- Surgical depth pruning using data-driven layer importance analysis
- Knowledge recovery through logits-only distillation with Skew KL Divergence
- The complete optimization pipeline: Prune → Distill → Evaluate
Not intended for production use. This model serves as a learning artifact and baseline for readers to improve upon.
Limitations
- Training Data: General-purpose Cosmopedia corpus (not domain-specialized)
- Knowledge Coverage: Reduced compared to full-scale models due to structural pruning
- Capabilities: Best suited for simple completion tasks; complex reasoning may be degraded
- Language: English only
Citation
If you use this model or the techniques described in your research or projects, please cite:
Book
@book{martra2026rearchitecting,
author = {Pere Martra},
title = {Rearchitecting LLMs: Structural techniques for efficient models},
publisher = {Manning Publications},
year = {2026},
url = {https://hubs.la/Q040tvtp0}
}
Framework
@software{optipfair2024,
author = {Pere Martra},
title = {OptiPFair: Structural Pruning and Bias Analysis for LLMs},
year = {2024},
url = {https://github.com/peremartra/optipfair}
}
Acknowledgments
This model was created following the methodologies taught in "Rearchitecting LLMs" (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch.
Challenge for readers: Can you improve the retention metrics beyond 85.2%? Try adjusting:
- Layer selection strategy (use cosine similarity analysis)
- Distillation dataset (domain-specific data)
- Loss function weights (α, β, temperature)
- Training epochs and learning rate
Share your results in the book's discussion forum!
- Downloads last month
- 34
Model tree for oopere/gem-3-small
Base model
google/gemma-3-270m