FStudent / README.md

Update model card with YAML metadata and detailed information

dbfad64 verified 8 months ago

5.76 kB

	---
	language:
	- en
	license: mit
	tags:
	- phi-3
	- distillation
	- knowledge-distillation
	- lora
	- code-generation
	- python
	datasets:
	- Shuu12121/python-codesearch-dataset-open
	model-index:
	- name: FStudent
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: custom
	name: Distillation Evaluation
	metrics:
	- name: Speedup Factor
	type: speedup
	value: 2.5x
	verified: false
	---

	# FStudent: Distilled Phi-3 Model

	FStudent is a knowledge-distilled version of Microsoft's Phi-3-mini-4k-instruct model, trained through a comprehensive distillation pipeline that combines teacher-student learning with self-study mechanisms.

	## Model Description

	FStudent was created using a multi-stage distillation pipeline that transfers knowledge from a larger teacher model (Phi-4) to the smaller Phi-3-mini-4k-instruct model. The model was trained using LoRA adapters, which were then merged with the base model to create this standalone version.

	### Training Data

	The model was trained on a diverse set of data sources:

	1. PDF Documents: Technical documentation and domain-specific knowledge
	2. Python Code Dataset: Code examples from the [Shuu12121/python-codesearch-dataset-open](https://huggingface.co/datasets/Shuu12121/python-codesearch-dataset-open) dataset
	3. Teacher-Generated Examples: High-quality examples generated by the Phi-4 teacher model

	### Training Process

	The distillation pipeline consisted of six sequential steps:

	1. Content Extraction & Enrichment: PDF files were processed to extract and enrich text data
	2. Teacher Pair Generation: Training pairs were generated using the Phi-4 teacher model
	3. Distillation Training: The student model (Phi-3) was trained using LoRA adapters with the following parameters:
	- Learning rate: 1e-4
	- Batch size: 4
	- Gradient accumulation steps: 8
	- Mixed precision training
	- 4-bit quantization during training
	4. Model Merging: The trained LoRA adapters were merged with the base Phi-3 model
	5. Student Self-Study: The model performed self-directed learning on domain-specific content
	6. Model Evaluation: The model was evaluated against the teacher model for performance

	### Model Architecture

	- Base Model: microsoft/Phi-3-mini-4k-instruct
	- Parameter-Efficient Fine-Tuning: LoRA adapters (merged into this model)
	- Context Length: 4K tokens
	- Architecture: Transformer-based language model

	## Intended Uses

	This model is designed for:

	- General text generation tasks
	- Python code understanding and generation
	- Technical documentation analysis
	- Question answering on domain-specific topics

	## Performance and Limitations

	### Strengths

	- Faster inference compared to larger models (approximately 2.5x speedup)
	- Maintains much of the capability of the teacher model
	- Enhanced code understanding due to training on Python code datasets
	- Good performance on technical documentation analysis

	### Limitations

	- May not match the full capabilities of larger models on complex reasoning tasks
	- Limited context window compared to some larger models
	- Performance on specialized domains not covered in training data may be reduced

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load the model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("forge1825/FStudent")
	tokenizer = AutoTokenizer.from_pretrained("forge1825/FStudent")

	# Generate text
	input_text = "Write a Python function to calculate the Fibonacci sequence:"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=512)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Quantized Usage

	For more efficient inference, you can load the model with quantization:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	import torch

	# 4-bit quantization configuration
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.float16
	)

	# Load the model with quantization
	model = AutoModelForCausalLM.from_pretrained(
	"forge1825/FStudent",
	device_map="auto",
	quantization_config=quantization_config
	)
	tokenizer = AutoTokenizer.from_pretrained("forge1825/FStudent")
	```

	## Training Details

	- Training Framework: Hugging Face Transformers with PEFT
	- Optimizer: AdamW
	- Learning Rate Schedule: Linear warmup followed by linear decay
	- Training Hardware: NVIDIA GPUs
	- Distillation Method: Knowledge distillation with teacher-student architecture
	- Self-Study Mechanism: Curiosity-driven exploration with hierarchical context

	## Ethical Considerations

	This model inherits the capabilities and limitations of its base model (Phi-3-mini-4k-instruct). While efforts have been made to ensure responsible behavior, the model may still:

	- Generate incorrect or misleading information
	- Produce biased content reflecting biases in the training data
	- Create code that contains bugs or security vulnerabilities

	Users should validate and review the model's outputs, especially for sensitive applications.

	## Citation and Attribution

	If you use this model in your research or applications, please cite:

	```
	@misc{forge1825_fstudent,
	author = {Forge1825},
	title = {FStudent: Distilled Phi-3 Model},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/forge1825/FStudent}}
	}
	```

	## Acknowledgements

	- Microsoft for the Phi-3-mini-4k-instruct base model
	- Hugging Face for the infrastructure and tools
	- The creators of the Python code dataset used in training