Qwen3-1.7B-FlashHead-W4A16

Optimized version of Qwen3-1.7B using Quantization and FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Quantization (W4A16)
Custom vLLM generation via embedl-models

FlashHead matches the Qwen3-1.7B baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.

Quickstart

Launch a chat window with commands for /reset and /exit with

pip install embedl-models
python3 -m embedl.models.vllm.demo --model embedl/Qwen3-1.7B-FlashHead-W4A16

Model Details

Field	Value
Base Model	Qwen3-1.7B
Input / Output	Text → Text
Release Date	2025-12-08
Version	1.0
Optimizations	FlashHead LM Head, Quantization (W4A16)
Developers	Embedl
Licenses	Upstream: Apache 2.0. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Quantization (W4A16) - large reduction in memory footprint and latency.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision	Tokens/sec	Speedup vs BF16
BF16 baseline	100	1.0×
FlashHead (Embedl)	114	1.14×
W4A16 baseline	206	2.06x×
FlashHead W4A16 (Embedl)	271	2.27×

FlashHead improves end-to-end speed by 1.32× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

Accuracy (Parity with Baseline)

Method	MMLU-Pro	IFEval	BBH	TruthfulQA	GSM8K
Baseline	0.38	0.24	0.45	0.47	0.13
FlashHead	0.38	0.25	0.45	0.47	0.12

FlashHead closely matches baseline accuracy.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from transformers import AutoTokenizer
from embedl.models.vllm import LLM

model_id = "embedl/Qwen3-1.7B-FlashHead-W4A16"

if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

    messages = [{"role": "user", "content": "Write a haiku about coffee."}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True,
    )

    sampling = SamplingParams(
        max_tokens=1024,
        temperature=0.6,
        top_p=0.95,
        top_k=20,
    )

    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    output = llm.generate([text], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Qwen3-1.7B-FlashHead-W4A16"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Advanced mixed precision quantization
Huggingface transformers generation
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Apache Licence 2.0.
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Downloads last month: 33

Safetensors

Model size

0.8B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Qwen3-1.7B-FlashHead-W4A16

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(129)

this model