Qwen3-1.7B-FlashHead-W4A16
Optimized version of Qwen3-1.7B using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:
- FlashHead
- Quantization (W4A16)
- Custom vLLM generation via
embedl-models
FlashHead matches the Qwen3-1.7B baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
Quickstart
Launch a chat window with commands for /reset and /exit with
pip install embedl-models
python3 -m embedl.models.vllm.demo --model embedl/Qwen3-1.7B-FlashHead-W4A16
Model Details
| Field | Value |
|---|---|
| Base Model | Qwen3-1.7B |
| Input / Output | Text → Text |
| Release Date | 2025-12-08 |
| Version | 1.0 |
| Optimizations | FlashHead LM Head, Quantization (W4A16) |
| Developers | Embedl |
| Licenses | Upstream: Apache 2.0. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
| Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
Optimizations
- FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
- Quantization (W4A16) - large reduction in memory footprint and latency.
- Custom Runtime Integration - compatible with vLLM (0.10.2) via the
embedl-modelspackage.
Performance
Token Generation Speed (RTX 3500 Ada, batch size = 1)
| Precision | Tokens/sec | Speedup vs BF16 |
|---|---|---|
| BF16 baseline | 100 | 1.0× |
| FlashHead (Embedl) | 114 | 1.14× |
| W4A16 baseline | 206 | 2.06x× |
| FlashHead W4A16 (Embedl) | 271 | 2.27× |
FlashHead improves end-to-end speed by 1.32× over state-of-the-art, while maintaining full accuracy parity.
Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.
Accuracy (Parity with Baseline)
| Method | MMLU-Pro | IFEval | BBH | TruthfulQA | GSM8K |
|---|---|---|---|---|---|
| Baseline | 0.38 | 0.24 | 0.45 | 0.47 | 0.13 |
| FlashHead | 0.38 | 0.25 | 0.45 | 0.47 | 0.12 |
FlashHead closely matches baseline accuracy.
Installation
pip install embedl-models
The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.
Usage Examples
Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).
vLLM Inference
from vllm import SamplingParams
from transformers import AutoTokenizer
from embedl.models.vllm import LLM
model_id = "embedl/Qwen3-1.7B-FlashHead-W4A16"
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
messages = [{"role": "user", "content": "Write a haiku about coffee."}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
sampling = SamplingParams(
max_tokens=1024,
temperature=0.6,
top_p=0.95,
top_k=20,
)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
output = llm.generate([text], sampling)
print(output[0].outputs[0].text)
Interactive REPL Example
The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.
import asyncio
from embedl.models.vllm.demo import run_repl
model_id = "embedl/Qwen3-1.7B-FlashHead-W4A16"
if __name__ == "__main__":
asyncio.run(
run_repl(
model=model_id,
max_model_len=131072
)
)
⚠️ Important Warning: Hugging Face Transformers Support
FlashHead is currently not applied when using the Hugging Face
transformerspipeline.
Generation throughtransformerswill fall back to the standard dense LM head, disabling FlashHead acceleration.For now, we strongly recommend using the vLLM integration (
embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.Full support for the Hugging Face
transformerspipeline with FlashHead integration will be released in the coming days.
Limitations
- Limited to vLLM 0.10.2 (pinned dependency)
- Batch size = 1 (real-time generation)
- Currently optimized for NVIDIA RTX GPUs
Roadmap
Planned improvements:
- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harnessintegration for detailed accuracy evaluation- Upstream support in Transformers and vLLM
- Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
- Broader model coverage (larger models, VLMs, VLAs)
License
- Upstream: Apache Licence 2.0.
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
Enterprise & Commercial Inquiries sales@embedl.com
Technical Issues & Early Access https://github.com/embedl/embedl-models
More Information & Model Releases https://embedl.com
Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: sales@embedl.com
- Downloads last month
- 33
