inference-optimization
/

Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Head

compressed-tensors

Model card Files Files and versions

Accuracy

Task	Context Length	meta-llama/ Llama-3.1-8B-Instruct	Llama-3.1-8B-Instruct- FP8-dynamic- QKV-Cache-FP8- Per-Head	Llama-3.1-8B-Instruct- FP8-dynamic- QKV-Cache-FP8- Per-Tensor	Llama-3.1-8B-Instruct- QKV-Cache-FP8- Per-Head	Llama-3.1-8B-Instruct- QKV-Cache-FP8- Per-Tensor
NIAH Single 2	4096	100.00	100.00	100.00	100.00	100.00
	16384	100.00	100.00	100.00	100.00	100.00
	32768	100.00	100.00	100.00	100.00	100.00
	65536	100.00	100.00	100.00	100.00	100.00
	131072	99.2	99.6	99.4	99.4	99.0

Downloads last month: 56

Safetensors

Model size

8B params

Tensor type

BF16

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-optimization/Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Head

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

(543)

this model

Collection including inference-optimization/Llama-3.1-8B-Instruct-QKV-Cache-FP8-Per-Head

KV Cache Quantization

Collection on FP8 Quantization of Weights, Activations and KV Cache • 12 items • Updated 9 days ago