Faster TG using Draft Model!
I just did some limited testing and seems like some TG speed-ups might be possible using jukofyork's draft models here: https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF/discussions/2#687e91bec6e493d1b5c6fc8d
I included an example command and it is showing 5%+ possible uplift in some configurations. Might be more possible, but I've never tried it before and am just experimenting.
Cool!
Is there anyway to actually know it's working with ik_llama.cpp? I tried verbose output flag but don't even see anything there either. Mainline prints out some acceptance rate% for the draft model to show. Your numbers there are legit speed ups but that's pretty hard to perceive it as even working on my side just running random tests.
Ahh I also didn't see any printouts but yes just measured a/b testing with llama-sweep-bench using my best configuration for llama-server (e.g. layer offloads etc) and holding everything constant except using draft model or not.
Since ik_llama.cpp fork is down/missing right now, once it is back up (possibly at its new location) you can ask there and if not open an issue to see if it would be easy to add. Sorry I didn't glance at the code even as I'm busy cooking new Qwen-235B since the update hahah...
Follow along here for news on ik_llama.cpp github repo progress: https://huggingface.co/ikawrakow/Qwen3-30B-A3B/discussions/2
Thanks!
No worries! Yeah, the repo being down hurts, would've checked myself with that handy repo search feature π
Looking forward to the new Qwen-235B quants, thanks so much for these! π
Thanks for the IQ1_KT. It's slower than UD-1Q_S but saves me around 20GB of system memory so I can run other things without swap.
Yeah, seems to be CPU bound. I'm downloading the tiny quant since it's got similar ppl to IQ1_S but should be able to fit more on GPUs.
Question about that chart (I was confused by it earlier): Where is that IQ2_KS quant that's < 300GB in size?
yeah the smol is ik's latest IQ1_KT quant, brand new merged the day his repo got disappeared! SOTA 1.75BPW quant! good stuff!
Here is that IQ2_KS: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/tree/main/IQ2_KS
You can look in thee recipe here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF#-v02-iq2_ks-290327-gib-2429-bpw and see it is using another week old SOTA quant the IQ2_KL which is a little bigger 2.69 BPW quant with great PPL for the size.
I'm reporting values in GiB Gibibytes 1 Gibibyte (GiB) is equal to approximately 1.073741824 Gigabytes (GB).
1 GiB = 1024 * 1024 * 1024 Bytes
1 GB = 1000 * 1000 * 1000 Bytes
$ du -hc /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/*.gguf
42G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00001-of-00007.gguf
42G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00002-of-00007.gguf
42G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00003-of-00007.gguf
42G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00004-of-00007.gguf
42G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00005-of-00007.gguf
42G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00006-of-00007.gguf
41G /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00007-of-00007.gguf
291G total # this is in GiB
So about 290.327 GiB is 311.7362 GB (metric unit)
That is probably the confusion. I try to use GiB and specify it but i do get lazy and use GB sometimes and interchange them. It can and does matter at larger sizes though!

