gem-3-small

Model Description

This model is a surgically optimized and distilled version of google/gemma-3-270m, created as the final challenge of Chapter 6 in the book "Rearchitecting LLMs".

Book: Rearchitecting LLMs
Framework: OptiPFair
Technique: Depth Pruning + Knowledge Distillation (Logits-Only with Skew KL Divergence)
Chapter: Chapter 6 - Knowledge Recovery

Performance & Retention Metrics

The goal of this optimization was to maximize parameter efficiency while maintaining the highest possible retention of the Teacher's capabilities.

Retention Summary (vs Teacher Baseline)

Metric	Value	Description
PPL Retention	77.90%	Linguistic quality preserved (Teacher PPL / Student PPL × 100)
Capabilities Retention	92.86%	Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100)
Overall Retention	85.21%	Combined health score (average of PPL + Capabilities retention)

Capability Benchmarks (LM Evaluation Harness)

Recovery = How much of the pruning degradation was recovered through distillation.

Benchmark	Teacher	Pruned (No KD)	Student (After KD)	Recovery
Arc Easy	60.0%	60.0%	60.0%	0.0%
Winogrande	60.0%	40.0%	80.0%	200.0%
Hellaswag	60.0%	60.0%	60.0%	0.0%
Lambada Openai	40.0%	0.0%	0.0%	0.0%
Piqa	60.0%	60.0%	60.0%	0.0%
Average	56.0%	44.0%	52.0%	66.7%

Linguistic Quality

Final Perplexity (PPL): 17.02
Teacher Baseline PPL: 13.26
Pruned (No KD) PPL: 123.63
Final Training Loss: 2.8342

Architecture Details

Teacher Model: google/gemma-3-270m (18 transformer blocks, 268,098,176 parameters)
Student Model: Pruned to 14 transformer blocks (245,803,648 parameters)
Layers Removed: 4 layers (indices: [9, 8, 14, 16])
Parameter Reduction: 8.32%

Training Procedure

Dataset

Source: Cosmopedia-v2
Samples: 2,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples)
Train/Val Split: 80% / 20%

Hyperparameters

Epochs: 5
Batch Size: 16 (effective: 64 with gradient accumulation)
Learning Rate: 4e-05
Loss Function: α·CrossEntropy + β·Skew-KLD
- Task Loss Weight (α): 0.5
- Logits Loss Weight (β): 0.5
- Skew Interpolation Factor: 0.0
- Temperature: 2.0
Optimizer: AdamW
Gradient Clipping: 1.0

Hardware & Training Time

GPU: NVIDIA A100-SXM4-80GB
Training Time: 205.9s (3.43 minutes)
Avg Time per Epoch: 41.2s

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "oopere/gem-3-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Paris is the capital of"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False,
    num_beams=3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations & Intended Use

Intended Use

This is an educational model created as part of the Hands-on Lab in Chapter 6 of "Rearchitecting LLMs". It demonstrates:

Surgical depth pruning using data-driven layer importance analysis
Knowledge recovery through logits-only distillation with Skew KL Divergence
The complete optimization pipeline: Prune → Distill → Evaluate

Not intended for production use. This model serves as a learning artifact and baseline for readers to improve upon.

Limitations

Training Data: General-purpose Cosmopedia corpus (not domain-specialized)
Knowledge Coverage: Reduced compared to full-scale models due to structural pruning
Capabilities: Best suited for simple completion tasks; complex reasoning may be degraded
Language: English only

Citation

If you use this model or the techniques described in your research or projects, please cite:

Book

@book{martra2026rearchitecting,
  author    = {Pere Martra},
  title     = {Rearchitecting LLMs: Structural techniques for efficient models},
  publisher = {Manning Publications},
  year      = {2026},
  url       = {https://hubs.la/Q040tvtp0}
}

Framework

@software{optipfair2024,
  author = {Pere Martra},
  title  = {OptiPFair: Structural Pruning and Bias Analysis for LLMs},
  year   = {2024},
  url    = {https://github.com/peremartra/optipfair}
}

Acknowledgments

This model was created following the methodologies taught in "Rearchitecting LLMs" (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch.

Challenge for readers: Can you improve the retention metrics beyond 85.2%? Try adjusting:

Layer selection strategy (use cosine similarity analysis)
Distillation dataset (domain-specific data)
Loss function weights (α, β, temperature)
Training epochs and learning rate

Share your results in the book's discussion forum!

Downloads last month: 34

Safetensors

Model size

0.2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oopere/gem-3-small

Base model

google/gemma-3-270m

Finetuned

(139)

this model

oopere
/

gem-3-small