ubergarm/Kimi-K2-Instruct-GGUF

mtcl

Jul 14

I have 512 GB RAM and would love to run maybe IQ3 on it. Can't wait!!

isevendays

Jul 14

I have also 512 GB RAM and 48 Gb VRAM, can't wait for the ggufs!

ubergarm

Owner Jul 14

haha yeah this giant is pretty exciting to try out! big thanks once again to Wendell at level1techs for the huge rig with plenty of fast kioxia flash drives as I'm chewing through TBs of disk and loading ~1TB of weights into RAM takes a hot minute lol...

Also keep an eye on https://huggingface.co/anikifoss as he likes the bigger size quants! Thanks @anikifoss again for the PR to ik_llama.cpp! haha

gopi87

Jul 15

haha yeah this giant is pretty exciting to try out! big thanks once again to Wendell at level1techs for the huge rig with plenty of fast kioxia flash drives as I'm chewing through TBs of disk and loading ~1TB of weights into RAM takes a hot minute lol...

Also keep an eye on https://huggingface.co/anikifoss as he likes the bigger size quants! Thanks @anikifoss again for the PR to ik_llama.cpp! haha

i just jumbed to your recent brance kimi k2 does unsolth kimi k2 work ?

mtcl

Jul 15

I can load the unsloth's UD IQ3 XXS on my system. And token/seconds for either prompt processing or generation is bad! Like 5tk/sex when I get about 12tk/sec on DeepSeek R1 with 37 billion active parameters. This with 32B, I was hopeful that it'll do better :)

I'll wait for @ubergarm to finish uploading his model so I can try this out...

ubergarm

Owner Jul 15

•

edited Jul 15

@mtcl @gopi87 @isevendays

fwiw I have heard some folks are having issues with mainline/unsloth quants right now, though I've also seen unsloth Daniel show it working for him in the mainline llama.cpp PR discussions. I personally have not tried it.

I'm currently uploading my first SOTA quant using the brand new IQ2_KL quant which is pretty amazing for the size. This blend is looking good so far. I have some folks testing it currently and it is coherent and replying correctly. You will need to get the latest version of ik_llama.cpp including this PR which fixes the built in completions endpoint chat template. EDIT This PR just landed in main so just build latest tip of main for ik_llama.cpp for this one now!

Upload should complete within an hour!!

Don't worry, I'll follow up with some bigger quants for you guys with monster rigs!!! ;p

Also I'll give some more usage examples, but it is basically the same as running deepseek except instead of first 3 layers being dense e.g. [0-2] it is only a single dense layer. so instead of starting from 3 to offload routed exps, start from 1 e.g.:

-ot ngl 99 \
-ot -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
-ot -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
-ot exps=CPU \

gopi87

Jul 15

@mtcl @gopi87 @isevendays

fwiw I have heard some folks are having issues with mainline/unsloth quants right now, though I've also seen unsloth Daniel show it working for him in the mainline llama.cpp PR discussions. I personally have not tried it.

I'm currently uploading my first SOTA quant using the brand new IQ2_KL quant which is pretty amazing for the size. This blend is looking good so far. I have some folks testing it currently and it is coherent and replying correctly. You will need to get the latest version of ik_llama.cpp including this PR which fixes the built in completions endpoint chat template. EDIT This PR just landed in main so just build latest tip of main for ik_llama.cpp for this one now!

Upload should complete within an hour!!

Don't worry, I'll follow up with some bigger quants for you guys with monster rigs!!! ;p

Also I'll give some more usage examples, but it is basically the same as running deepseek except instead of first 3 layers being dense e.g. [0-2] it is only a single dense layer. so instead of starting from 3 to offload routed exps, start from 1 e.g.:
-ot ngl 99 \
-ot -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
-ot -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
-ot exps=CPU \

any quant at 250gb range ?

SFPLM

Jul 16

•

edited Jul 16

@ubergarm

I personally hope your IQ4 quant is more like a low 4bit like IQ4XS as opposed to a high 4bit like Ktransformers’ own Q4KM quant. Factoring in context, a file size 612 GB isn’t practical for me so really hoping for something 550 Gb or less.

Also will IQ3 be on the menu after IQ4?

ubergarm

Owner Jul 16

@SFPLM

550 Gb or less.

I gotchu fam: IQ4_KS 550.428 GiB (4.604 BPW)

the perplexity is almost done, i'll update the README first then start uploading the ggufs!

ubergarm

Owner Jul 16

any quant at 250gb range ?

that is getting into sub 2 bpw range which is tricky... i've heard a rumor that ik is working on an IQ1_KT 1.75 BPW trellis style quant (qtip kinda like exl3) that could be faster at inferencing than iq1_s type stuff possible... so that might give some hope in that size soon outside of IQ1_S type stuff which i could try...

i might try next IQ2_KL down and IQ2_KS (gate|up) which would put us around 3.0 BPW... will see!

SFPLM

Jul 16

•

edited Jul 16

That said if there was say a file size of 550 and I have only 512 GB ram but have a bit OF VRAM like 6000 ADAs or an RTX Pro 6000 could we use the vram so that I can load less into dram? With Deepseek V3 0324 your IQ4 quant it takes like 420 GB of ram and like 30-40 GB of VRAM with my config but i am not sure if offload more layers to gpu can lessen ram load

@mtcl sorry for cc maybe you know about this?

ubergarm

Owner Jul 16

@sfplm

if offload more layers to gpu can lessen ram load

Absolutely! Say you have a single 96GB VRAM GPU and 512 GB RAM that is 608GB total. So you offload enough layers onto the GPU and the rest on CPU/RAM and it will fit. It can take some trial and error to dial in the tensor -ot overrides. Check out the new example for multi-GPU in the model card i just updated. It might be possible to do 512GB RAM + 48GB VRAM but it is gonna be very tight and probably not much context or increased -ub size for extra PP speed.

Basically just add until you OOM then back off by one layer:

    -ngl 99 \
    -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \

Oops i made a typo on the model card and need to close my </details> tag, this info is buried int he second "secret recipe" fold, lol.. I'll fix as soon as the weights finish up in an hour or so...

ubergarm

Owner Jul 16

Getting great speeds even CPU only with these big models! I have some more example commands and details on how I'm running it here: https://github.com/ikawrakow/ik_llama.cpp/pull/612#issuecomment-3076539817

Shown optimized for PP. You can get a little more TG going with -rtr and smaller batch sizes probably.

phakio

Jul 16

following along for a possible smolboi version, just built a sapphire rapids setup with 2x 3090s and 1x 4090, but i could only muster up enough clams for 256gb ddr5 rdimms at the moment.

for the past few weeks, honestly the deepseek v3 0324 iq1 has been great for my use case, but i found i was limited by my old 7950x memory controllers. i expect a large uplift especially in prompt processing with my new setup.

excited to see what's to come!

ubergarm

Owner Jul 16

@phakio

I just cooked one size up from ye olde smolboi that will probably fit in your new sapphire rapids with combined 328GB V/RAM:

IQ2_KS 286.624 GiB (2.398 BPW)

Uploading now!

MavenDE

Jul 16

Guys who have 512GB RAM and 48GV VRAM - what exact PCs do you have, what operating system?

anikifoss

Jul 16

•

edited Jul 16

Usually Threadrippers and decommissioned servers with Linux.

isevendays

Jul 16

MavenDE
Guys who have 512GB RAM and 48GV VRAM - what exact PCs do you have, what operating system?

Dell R740 with Nvidia 4090d 48Gb and 480 GB of RAM.
Proxmox / Debian LXC with Nvidia Toolkit.

Real world numbers from Q2_KL
INFO [ print_timings] prompt eval time = 73369.96 ms / 3004 tokens ( 24.42 ms per token, 40.94 tokens per second) | tid="131974847598592" timestamp=1752671392 id_slot=0 id_task=0 t_prompt_processing=73369.96 n_prompt_tokens_processed=3004 t_token=24.424087882822906 n_tokens_second=40.9431871027325
INFO [ print_timings] generation eval time = 6233.78 ms / 24 runs ( 259.74 ms per token, 3.85 tokens per second) | tid="131974847598592" timestamp=1752671392 id_slot=0 id_task=0 t_token_generation=6233.783 n_decoded=24 t_token=259.74095833333337 n_tokens_second=3.8499896451320166
...
INFO [ print_timings] prompt eval time = 10518.76 ms / 80 tokens ( 131.48 ms per token, 7.61 tokens per second) | tid="131974847598592" timestamp=1752673182 id_slot=1 id_task=423 t_prompt_processing=10518.764 n_prompt_tokens_processed=80 t_token=131.48454999999998 n_tokens_second=7.605456306463384
INFO [ print_timings] generation eval time = 5119.91 ms / 19 runs ( 269.47 ms per token, 3.71 tokens per second) | tid="131974847598592" timestamp=1752673182 id_slot=1 id_task=423 t_token_generation=5119.906 n_decoded=19 t_token=269.4687368421053 n_tokens_second=3.7110056317440203

gopi87

Jul 16

@sfplm

if offload more layers to gpu can lessen ram load

Absolutely! Say you have a single 96GB VRAM GPU and 512 GB RAM that is 608GB total. So you offload enough layers onto the GPU and the rest on CPU/RAM and it will fit. It can take some trial and error to dial in the tensor -ot overrides. Check out the new example for multi-GPU in the model card i just updated. It might be possible to do 512GB RAM + 48GB VRAM but it is gonna be very tight and probably not much context or increased -ub size for extra PP speed.

Basically just add until you OOM then back off by one layer:
    -ngl 99 \
    -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
Oops i made a typo on the model card and need to close my </details> tag, this info is buried int he second "secret recipe" fold, lol.. I'll fix as soon as the weights finish up in an hour or so...

i just to run model like this with out worry with your branch

CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" --ctx-size 102144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 2048 -ub 2048 --parallel 1 --threads 28 --threads-batch 28 --temp 0.7 --min-p 0.05 -ser 4,1 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080

now i switched to main brance for running kimi now its throwing oom memerroy issues is there any pr didnt merged with main from you ? and i enable numa 2 nodes to see wether i can see speed increse is that caues the error ?

gopi87

Jul 16

btw looks like ik_llama support dot too any quant for that ?

gopi87

Jul 16

@sfplm

if offload more layers to gpu can lessen ram load

Absolutely! Say you have a single 96GB VRAM GPU and 512 GB RAM that is 608GB total. So you offload enough layers onto the GPU and the rest on CPU/RAM and it will fit. It can take some trial and error to dial in the tensor -ot overrides. Check out the new example for multi-GPU in the model card i just updated. It might be possible to do 512GB RAM + 48GB VRAM but it is gonna be very tight and probably not much context or increased -ub size for extra PP speed.

Basically just add until you OOM then back off by one layer:
    -ngl 99 \
    -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
Oops i made a typo on the model card and need to close my </details> tag, this info is buried int he second "secret recipe" fold, lol.. I'll fix as soon as the weights finish up in an hour or so...
i just to run model like this with out worry with your branch

CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" --ctx-size 102144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 2048 -ub 2048 --parallel 1 --threads 28 --threads-batch 28 --temp 0.7 --min-p 0.05 -ser 4,1 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080

now i switched to main brance for running kimi now its throwing oom memerroy issues is there any pr didnt merged with main from you ? and i enable numa 2 nodes to see wether i can see speed increse is that caues the error ?

nvm is just a numa error

ubergarm

Owner Jul 16

@gopi87

btw looks like ik_llama support dot too any quant for that ?

yes ik_llama.cpp can run the mainline quants from bartowski, unsloth, mradermacher etc.

now i switched to main brance for running kimi now its throwing oom memerroy issues is there any pr didnt merged with main from you ? and i enable numa 2 nodes to see wether i can see speed increse is that caues the error ?

everything for kimi-k2 is merged into main on ik_llama.cpp. i have to go now, but if you run in a single numa node that node must have enough RAM for the entire model. if not, glue together all your nodes with numactl --interleave=all llama-server --numa distribute ...

any quant at 250gb range ?

I'm working on a sub 250GB model using experimental new IQ1_KT discussed here: https://github.com/ikawrakow/ik_llama.cpp/pull/616

original-el8

Jul 16

Good stuff everyone.

Here are some stats from my rig (7995wx, 384gb DDR5-6800, 1x RTX Pro 6000 96gb 600w@450w):

INFO [           print_timings] prompt eval time     =   41322.38 ms / 13323 tokens (    3.10 ms per token,   322.42 tokens per second) | tid="140385625272320" timestamp=1752695437 id_slot=0 id_task=641 t_prompt_processing=41322.38 n_prompt_tokens_processed=13323 t_token=3.1015822262253243 n_tokens_second=322.4160854239277
INFO [           print_timings] generation eval time =   27466.76 ms /   634 runs   (   43.32 ms per token,    23.08 tokens per second) | tid="140385625272320" timestamp=1752695437 id_slot=0 id_task=641 t_token_generation=27466.758 n_decoded=634 t_token=43.32296214511041 n_tokens_second=23.08244751710413

Kimi IQ2_KS. Can't complain about 322pp/23ts on this large of a model!

Will be testing IQ2_KL shortly with the same prompt.

gghfez

Jul 17

I bought new ram sticks yesterday, 192GB DDR5 now. I'm able to run the IQ1_S unsloth quant in mainline.

yes ik_llama.cpp can run the mainline quants from bartowski, unsloth, mradermacher etc.

I know this isn't tech support but I don't suppose anyone else gets this when trying to run the IQ1_S with a fresh clone/build?

/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error

Fully kills the driver

[113188.653626] sh (81407): drop_caches: 3
[114284.401772] llama-server[81743]: segfault at 21ba03fd0 ip 00007fe49e824cd7 sp 00007fa2317cf2a0 error 4 in libcuda.so.550.144.03[424cd7,7fe49e4de000+4d5000] likely on CPU 21 (core 27, socket 0)
[114284.401782] Code: ef e8 fd 9c cb ff 83 3d 3e 0c 74 01 01 49 8b 1c 24 76 0a 8b 05 46 0c 74 01 85 c0 74 56 49 8b 44 24 10 41 8b 4c 24 24 48 8b 13 <8b> 00 41 39 c6 74 52 8b b3 40 40 00 00 48 89 f0 89 8c b3 44 40 00

ubergarm

Owner Jul 17

@gghfez

I know this isn't tech support but I don't suppose anyone else gets this when trying to run the IQ1_S with a fresh clone/build?

IQ1_S is not a great quant, but yeah its hard finding sub 2bpw stuff. I'm just getting back on my desk and gonna see how the IQ1_KT is looking on ik's fork, and may release something.

How you compiling ik's fork? This is what I use for hybrid CUDA+CPU inferencing for deepseek/kimi-k2 MLA stuff:

cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1  -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

if that doesn't help, what is the command with which you're running?

jukofyork

Jul 17

I bought new ram sticks yesterday, 192GB DDR5 now. I'm able to run the IQ1_S unsloth quant in mainline.

yes ik_llama.cpp can run the mainline quants from bartowski, unsloth, mradermacher etc.

I know this isn't tech support but I don't suppose anyone else gets this when trying to run the IQ1_S with a fresh clone/build?

/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error

Fully kills the driver

[113188.653626] sh (81407): drop_caches: 3
[114284.401772] llama-server[81743]: segfault at 21ba03fd0 ip 00007fe49e824cd7 sp 00007fa2317cf2a0 error 4 in libcuda.so.550.144.03[424cd7,7fe49e4de000+4d5000] likely on CPU 21 (core 27, socket 0)
[114284.401782] Code: ef e8 fd 9c cb ff 83 3d 3e 0c 74 01 01 49 8b 1c 24 76 0a 8b 05 46 0c 74 01 85 c0 74 56 49 8b 44 24 10 41 8b 4c 24 24 48 8b 13 <8b> 00 41 39 c6 74 52 8b b3 40 40 00 00 48 89 f0 89 8c b3 44 40 00

https://github.com/ikawrakow/ik_llama.cpp/blob/b94f3af56f6fde4845c968115edaa0ac36e36bb7/ggml/src/ggml.c#L15250C14-L15250C35

Are you using the "fused MoE" option?

mtcl

Jul 17

•

edited Jul 17

Here are some stats from my machine.
2xNVidia RTX 6000 Pro and 2XNvidia RTX 5090

IQ2_KL - 2x 6000 -ot parameter 128K

CUDA_VISIBLE_DEVICES="0,1" ./build/bin/llama-server \
  --model /media/mukul/backup/models/ubergarm/Kimi-K2-Instruct-GGUF/IQ2_KL/Kimi-K2-Instruct-IQ2_KL-00001-of-00008.gguf \
  --alias ubergarm/Kimi-K2-Instruct-GGUF \
  --ctx-size 131072 \
  -ctk q8_0  \
  -fa -fmoe \
  -mla 3 \
  -ngl 99\
  -b 4096 -ub 4096 \
  -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
  -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
  -ot exps=CPU \
  --parallel 1 \
  --threads 56 \
  --threads-batch 64 \
  --host 0.0.0.0 \
  --port 10002

main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 56, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   35.934 |   113.99 |   91.367 |    11.21 |
|  4096 |   1024 |   4096 |   35.510 |   115.35 |   85.530 |    11.97 |
|  4096 |   1024 |   8192 |   35.780 |   114.48 |   91.210 |    11.23 |
|  4096 |   1024 |  12288 |   36.087 |   113.50 |   90.702 |    11.29 |
|  4096 |   1024 |  16384 |   36.427 |   112.44 |   94.757 |    10.81 |
|  4096 |   1024 |  20480 |   36.914 |   110.96 |   93.763 |    10.92 |
|  4096 |   1024 |  24576 |   37.213 |   110.07 |   88.753 |    11.54 |
|  4096 |   1024 |  28672 |   37.288 |   109.85 |   84.589 |    12.11 |
|  4096 |   1024 |  32768 |   37.556 |   109.06 |   88.703 |    11.54 |
|  4096 |   1024 |  36864 |   37.891 |   108.10 |   69.181 |    14.80 |
|  4096 |   1024 |  40960 |   38.097 |   107.52 |   95.880 |    10.68 |
|  4096 |   1024 |  45056 |   38.771 |   105.65 |   96.001 |    10.67 |
|  4096 |   1024 |  49152 |   41.426 |    98.88 |  114.007 |     8.98 |
|  4096 |   1024 |  53248 |   38.912 |   105.26 |   85.312 |    12.00 |
|  4096 |   1024 |  57344 |   39.081 |   104.81 |  101.482 |    10.09 |
|  4096 |   1024 |  61440 |   39.276 |   104.29 |  101.727 |    10.07 |
|  4096 |   1024 |  65536 |   39.675 |   103.24 |   95.502 |    10.72 |
|  4096 |   1024 |  69632 |   39.829 |   102.84 |   97.575 |    10.49 |
|  4096 |   1024 |  73728 |   40.495 |   101.15 |  100.179 |    10.22 |
|  4096 |   1024 |  77824 |   40.771 |   100.46 |  103.442 |     9.90 |
|  4096 |   1024 |  81920 |   40.858 |   100.25 |  105.901 |     9.67 |
|  4096 |   1024 |  86016 |   41.056 |    99.77 |  100.122 |    10.23 |
|  4096 |   1024 |  90112 |   41.556 |    98.57 |  103.497 |     9.89 |
|  4096 |   1024 |  94208 |   41.813 |    97.96 |  107.443 |     9.53 |
|  4096 |   1024 |  98304 |   42.142 |    97.19 |  111.211 |     9.21 |
|  4096 |   1024 | 102400 |   41.810 |    97.97 |  107.267 |     9.55 |
|  4096 |   1024 | 106496 |   42.192 |    97.08 |  104.101 |     9.84 |
|  4096 |   1024 | 110592 |   42.494 |    96.39 |  105.393 |     9.72 |
|  4096 |   1024 | 114688 |   42.732 |    95.85 |  112.332 |     9.12 |
|  4096 |   1024 | 118784 |   43.053 |    95.14 |  109.134 |     9.38 |
|  4096 |   1024 | 122880 |   43.748 |    93.63 |  114.136 |     8.97 |
|  4096 |   1024 | 126976 |   58.711 |    69.77 |  113.304 |     9.04 |
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$

IQ4_KS - 2x 5090 and 2x6000 with -ot parameter 128K (best)

CUDA_VISIBLE_DEVICES="0,1,2,3" ./build/bin/llama-server \
  --model /media/mukul/extra/models/ubergarm/Kimi-K2-Instruct-GGUF/IQ4_KS/Kimi-K2-Instruct-IQ4_KS-00001-of-00013.gguf \
  --alias ubergarm/Kimi-K2-Instruct-IQ4_KS \
  --ctx-size 131072 \
  -ctk q8_0  \
  -fa -fmoe \
  -mla 3 \
  -ngl 99\
  -b 4096 -ub 4096 \
  -ot "blk\.(1|2|3|4|5|6|7)\.ffn_.*=CUDA0" \
  -ot "blk\.(8|9|10|11|12|13|14)\.ffn_.*=CUDA1" \
  -ot "blk\.(15)\.ffn_.*=CUDA2" \
  -ot "blk\.(17)\.ffn_.*=CUDA3" \
  -ot exps=CPU \
  --parallel 1 \
  --threads 56 \
  --threads-batch 64 \
  --host 0.0.0.0 \
  --port 10002

main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 56, n_threads_batch = 64

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   48.541 |    84.38 |   80.945 |    12.65 |
|  4096 |   1024 |   4096 |   46.096 |    88.86 |   73.455 |    13.94 |
|  4096 |   1024 |   8192 |   46.168 |    88.72 |   76.755 |    13.34 |
|  4096 |   1024 |  12288 |   46.280 |    88.51 |   81.525 |    12.56 |
|  4096 |   1024 |  16384 |   46.535 |    88.02 |   85.476 |    11.98 |
|  4096 |   1024 |  20480 |   47.028 |    87.10 |   83.644 |    12.24 |
|  4096 |   1024 |  24576 |   47.278 |    86.64 |   78.471 |    13.05 |
|  4096 |   1024 |  28672 |   46.223 |    88.61 |   82.925 |    12.35 |
|  4096 |   1024 |  32768 |   48.233 |    84.92 |   94.340 |    10.85 |
|  4096 |   1024 |  36864 |   60.244 |    67.99 |   83.756 |    12.23 |
|  4096 |   1024 |  40960 |   47.019 |    87.11 |   99.698 |    10.27 |
|  4096 |   1024 |  45056 |   47.172 |    86.83 |   77.333 |    13.24 |
|  4096 |   1024 |  49152 |   47.439 |    86.34 |   78.607 |    13.03 |
|  4096 |   1024 |  53248 |   47.452 |    86.32 |   78.862 |    12.98 |
|  4096 |   1024 |  57344 |   47.551 |    86.14 |   90.759 |    11.28 |
|  4096 |   1024 |  61440 |   47.661 |    85.94 |   91.470 |    11.19 |
|  4096 |   1024 |  65536 |   47.980 |    85.37 |   85.413 |    11.99 |
|  4096 |   1024 |  69632 |   48.122 |    85.12 |   83.660 |    12.24 |
|  4096 |   1024 |  73728 |   48.568 |    84.34 |   83.934 |    12.20 |
|  4096 |   1024 |  77824 |   48.791 |    83.95 |   84.952 |    12.05 |
|  4096 |   1024 |  81920 |   49.091 |    83.44 |   85.207 |    12.02 |
|  4096 |   1024 |  86016 |   50.188 |    81.61 |   86.625 |    11.82 |
|  4096 |   1024 |  90112 |   49.560 |    82.65 |   87.692 |    11.68 |
|  4096 |   1024 |  94208 |   49.815 |    82.22 |   87.924 |    11.65 |
|  4096 |   1024 |  98304 |   49.734 |    82.36 |   89.837 |    11.40 |
|  4096 |   1024 | 102400 |   92.639 |    44.21 |  110.396 |     9.28 |
|  4096 |   1024 | 106496 |   94.440 |    43.37 |  117.939 |     8.68 |
|  4096 |   1024 | 110592 |   62.571 |    65.46 |  119.668 |     8.56 |
|  4096 |   1024 | 114688 |   50.397 |    81.28 |   93.243 |    10.98 |
|  4096 |   1024 | 118784 |   50.644 |    80.88 |   93.611 |    10.94 |
|  4096 |   1024 | 122880 |   51.543 |    79.47 |   95.323 |    10.74 |
|  4096 |   1024 | 126976 |   53.588 |    76.43 |   95.850 |    10.68 |

ubergarm

Owner Jul 17

•

edited Jul 17

@jukofyork @gghfez

Are you using the "fused MoE" option?

tl;dr;

update: yes seems like at least one UD quant is broken, it fixes it to remove -fmoe though with a speed penalty. Thanks juk, and interesting work you're up to with the tiny draft model!

details

Huh I'm getting some issues too running ik's fork trying to test perplexity on https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF?show_file_info=UD-IQ1_S%2FKimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf

# compiled CPU only backend
cmake -B build -DGGML_CUDA=0 -DGGML_BLAS=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# command
model=/mnt/raid/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ1_S/Kimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf
numactl -N 0 -m 0 \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa -fmoe \
    -mla 3 \
    --ctx-size 512 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192

llama_model_loader: - type  f32:  365 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q4_K:   56 tensors
llama_model_loader: - type q5_K:   35 tensors
llama_model_loader: - type q6_K:   18 tensors
llama_model_loader: - type iq2_xxs:   24 tensors
llama_model_loader: - type iq3_xxs:   49 tensors
llama_model_loader: - type iq1_s:   82 tensors
llama_model_loader: - type iq3_s:  158 tensors
llama_model_loader: - type iq2_s:   34 tensors
llama_model_loader: - type iq4_xs:  139 tensors
llama_model_loader: - type iq1_m:   14 tensors

llm_load_print_meta: model size       = 261.979 GiB (2.192 BPW)

Computed blk.53.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.55.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU


llama_new_context_with_model: attn_max_b = 0                                                                                       18:38:06 [180/1955]
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init:        CPU KV buffer size =   137.25 MiB
llama_new_context_with_model: KV self size  =  137.25 MiB, c^KV (f16):  137.25 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     2.50 MiB
llama_new_context_with_model:        CPU compute buffer size =   334.00 MiB
llama_new_context_with_model: graph nodes  = 3525
llama_new_context_with_model: graph splits = 1
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.
c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/
src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/project
s/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal errorfatal errorfatal error/home/w/projects/ik_llama.
cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error

I'll try to disable -fmoe and see if it runs. I've heard some other folks having issues specifically with the UD quants on both ik/mainline but was working okay with unsloths "non-UD" versions. It wasn't this error, but over here: https://github.com/ggml-org/llama.cpp/pull/14654#issuecomment-3071665949

UPDATE

yup this UD seems broken with -fmoe. removing that it is running now albiet at a speed penalty:

Computed blk.59.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init:        CPU KV buffer size =   137.25 MiB
llama_new_context_with_model: KV self size  =  137.25 MiB, c^KV (f16):  137.25 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     2.50 MiB
llama_new_context_with_model:        CPU compute buffer size =   334.00 MiB
llama_new_context_with_model: graph nodes  = 3645
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 128 (n_threads_batch = 192) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 646.754 ms
perplexity: calculating perplexity over 568 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 16.55 seconds per pass - ETA 39.15 minutes

ubergarm

Owner Jul 17

@mtcl

Here are some stats from my machine.
2xNVidia RTX 6000 Pro and 2XNvidia RTX 5090

Are you offloading as many layers as possible in both scenarios? If not you can get faster right? Maybe open a different dicussion for benchmarking please.

Also keep an eye out for these smaller quants which you'll probably be able to fully offload https://github.com/ikawrakow/ik_llama.cpp/pull/616

gghfez

Jul 18

Thanks jukofyork and ubergram, -fmoe was the issue.

jukofyork

Jul 18

Thanks juk, and interesting work you're up to with the tiny draft model!

Managed to get it to train:

https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0
https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF

looks like around 10% increase in acceptance rate over the untrained version too!

gopi87

Jul 18

https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF/tree/main/UD-TQ1_0

i was using this quant under the 256 ram since my hard disk very slow like 600mb writing and 200 mb reading i cant use the 286gb the memory bandwight i am getiing within ram is 10gb so its fine and response some what okay just remove the default prompt like your help assitam because its cause low quality response and also remove the ser and i just order the new disk. and does anyone know ik_llama support functional calling and tool calling ?

ubergarm

Owner Jul 18

•

edited Jul 18

@gopi87

Thanks jukofyork and ubergram, -fmoe was the issue.

Please check out and keep an eye on this new ik_llama.cpp issue regarding -fmoe support for IQ1_M ffn tensors: https://github.com/ikawrakow/ik_llama.cpp/issues/626

EDIT okay ik now supports that IQ1_M quant time with MMQ so you can get the speed boost with -fmoe again! Pull tip of main and rebuild.

I've measured PPL on that UD-TQ1_0 and it is a bit high for its size. Though I've not released my IQ1_KT sub 2.0 BPW models yet as that PR is still open on ik's for.

does anyone know ik_llama support functional calling and tool calling ?

Give it a try and let us know. The api endpoint is not as updated as mainline with that 3rd party openai compliant library. Some chatter about using a front end to handle it properly here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943 by @mtcl . Please update that github issue if you try it out!

EDIT Also @gopi87 there is a new PR for tool calling if you want to try it: https://github.com/ikawrakow/ik_llama.cpp/pull/628

Also thanks @jukofyork for the draft model, I've not tried it yet one guy used your earlier one for deepseek kinda like so psure:

  -md  ~/Qwen3-0.6B-Base.i1-IQ4_XS.gguf \
  -ngld 99 \
  --draft 16 \

I might have to try out some llama-sweep-bench with and without the draft model and see how it looks!

gopi87

Jul 19

@gopi87

Thanks jukofyork and ubergram, -fmoe was the issue.

Please check out and keep an eye on this new ik_llama.cpp issue regarding -fmoe support for IQ1_M ffn tensors: https://github.com/ikawrakow/ik_llama.cpp/issues/626

EDIT okay ik now supports that IQ1_M quant time with MMQ so you can get the speed boost with -fmoe again! Pull tip of main and rebuild.

I've measured PPL on that UD-TQ1_0 and it is a bit high for its size. Though I've not released my IQ1_KT sub 2.0 BPW models yet as that PR is still open on ik's for.

does anyone know ik_llama support functional calling and tool calling ?

Give it a try and let us know. The api endpoint is not as updated as mainline with that 3rd party openai compliant library. Some chatter about using a front end to handle it properly here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943 by @mtcl . Please update that github issue if you try it out!

EDIT Also @gopi87 there is a new PR for tool calling if you want to try it: https://github.com/ikawrakow/ik_llama.cpp/pull/628

Also thanks @jukofyork for the draft model, I've not tried it yet one guy used your earlier one for deepseek kinda like so psure:
  -md  ~/Qwen3-0.6B-Base.i1-IQ4_XS.gguf \
  -ngld 99 \
  --draft 16 \
I might have to try out some llama-sweep-bench with and without the draft model and see how it looks!

thanks sir will try this with newly updaed gguf from unsloth

shewin

Jul 19

W790E Sage + QYFS + 512G + RTX5090

IQ3_KS:

llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 44382.00 MiB
llm_load_tensors: CPU buffer size = 630.00 MiB
llm_load_tensors: CUDA0 buffer size = 7618.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 196608
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init: CUDA0 KV buffer size = 6999.78 MiB
llama_new_context_with_model: KV self size = 6999.75 MiB, c^KV (q8_0): 6999.75 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 13015.16 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 3183.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122

main: n_kv_max = 196608, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	47.114	86.81	69.047	14.80
4090	1022	4090	47.603	85.92	70.267	14.54
4090	1022	8180	48.209	84.84	86.623	11.80
4090	1022	12270	81.340	50.28	82.109	12.45
4090	1022	16360	82.138	49.79	85.950	11.89

ubergarm

Owner Jul 19

@shewin nice! might be able to squeeze some more out of it offloading additional routed exps layers to CUDA and adjusting your threads vs threads-batch. would need to see your full command, feel free to open a new discussion for benchmarking

ubergarm

Owner Jul 21

@mtcl

You might have enough VRAM to run the smallest smol-IQ1_KT fully offloaded on VRAM haha... Would be curious to see your benchmarks if you do.

Also keep in mind ik added a fix for that broken UD quant so you can run that small UD quants now with -fmoe for full bonus speed. But for you can get better perplexity in less RAM with the new IQ1_KT quants.