Faster TG using Draft Model!

by ubergarm - opened Jul 21

Owner Jul 21

I just did some limited testing and seems like some TG speed-ups might be possible using jukofyork's draft models here: https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF/discussions/2#687e91bec6e493d1b5c6fc8d

I included an example command and it is showing 5%+ possible uplift in some configurations. Might be more possible, but I've never tried it before and am just experimenting.

Cool!

Sinononon

Jul 21

Is there anyway to actually know it's working with ik_llama.cpp? I tried verbose output flag but don't even see anything there either. Mainline prints out some acceptance rate% for the draft model to show. Your numbers there are legit speed ups but that's pretty hard to perceive it as even working on my side just running random tests.

ubergarm

Owner Jul 21

@Sinononon

Ahh I also didn't see any printouts but yes just measured a/b testing with llama-sweep-bench using my best configuration for llama-server (e.g. layer offloads etc) and holding everything constant except using draft model or not.

Since ik_llama.cpp fork is down/missing right now, once it is back up (possibly at its new location) you can ask there and if not open an issue to see if it would be easy to add. Sorry I didn't glance at the code even as I'm busy cooking new Qwen-235B since the update hahah...

Follow along here for news on ik_llama.cpp github repo progress: https://huggingface.co/ikawrakow/Qwen3-30B-A3B/discussions/2

Thanks!

Sinononon

Jul 21

No worries! Yeah, the repo being down hurts, would've checked myself with that handy repo search feature 😔

Looking forward to the new Qwen-235B quants, thanks so much for these! 😃

gghfez

Jul 22

Thanks for the IQ1_KT. It's slower than UD-1Q_S but saves me around 20GB of system memory so I can run other things without swap.

ubergarm

Owner Jul 22

@gghfez

Yeah the IQ1_KT can be slower on TG i'm guessing due to CPU overhead of calculating Trellis. But yeah it is smaller and much better perplexity too haha...

gghfez

Jul 22

Yeah, seems to be CPU bound. I'm downloading the tiny quant since it's got similar ppl to IQ1_S but should be able to fit more on GPUs.

Question about that chart (I was confused by it earlier): Where is that IQ2_KS quant that's < 300GB in size?

ubergarm

Owner Jul 22

•

edited Jul 22

@gghfez

yeah the smol is ik's latest IQ1_KT quant, brand new merged the day his repo got disappeared! SOTA 1.75BPW quant! good stuff!

Here is that IQ2_KS: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/tree/main/IQ2_KS

You can look in thee recipe here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF#-v02-iq2_ks-290327-gib-2429-bpw and see it is using another week old SOTA quant the IQ2_KL which is a little bigger 2.69 BPW quant with great PPL for the size.

gghfez

Jul 22

yeah okay, but that's

44.8+44.5+44.7+44.9+44.5+44.7+44
312.1

Am I misunderstanding the chart?

ubergarm

Owner Jul 22

@gghfez

I'm reporting values in GiB Gibibytes 1 Gibibyte (GiB) is equal to approximately 1.073741824 Gigabytes (GB).

1 GiB = 1024 * 1024 * 1024 Bytes
1 GB = 1000 * 1000 * 1000 Bytes

$ du -hc /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/*.gguf
42G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00001-of-00007.gguf
42G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00002-of-00007.gguf
42G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00003-of-00007.gguf
42G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00004-of-00007.gguf
42G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00005-of-00007.gguf
42G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00006-of-00007.gguf
41G     /mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KS/Kimi-K2-Instruct-IQ2_KS-00007-of-00007.gguf
291G    total # this is in GiB

So about 290.327 GiB is 311.7362 GB (metric unit)

That is probably the confusion. I try to use GiB and specify it but i do get lazy and use GB sometimes and interchange them. It can and does matter at larger sizes though!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment