Can't wait!
I have 512 GB RAM and would love to run maybe IQ3 on it. Can't wait!!
I have also 512 GB RAM and 48 Gb VRAM, can't wait for the ggufs!
haha yeah this giant is pretty exciting to try out! big thanks once again to Wendell at level1techs for the huge rig with plenty of fast kioxia flash drives as I'm chewing through TBs of disk and loading ~1TB of weights into RAM takes a hot minute lol...
Also keep an eye on https://huggingface.co/anikifoss as he likes the bigger size quants! Thanks @anikifoss again for the PR to ik_llama.cpp! haha
haha yeah this giant is pretty exciting to try out! big thanks once again to Wendell at level1techs for the huge rig with plenty of fast kioxia flash drives as I'm chewing through TBs of disk and loading ~1TB of weights into RAM takes a hot minute lol...
Also keep an eye on https://huggingface.co/anikifoss as he likes the bigger size quants! Thanks @anikifoss again for the PR to ik_llama.cpp! haha
i just jumbed to your recent brance kimi k2 does unsolth kimi k2 work ?
I can load the unsloth's UD IQ3 XXS on my system. And token/seconds for either prompt processing or generation is bad! Like 5tk/sex when I get about 12tk/sec on DeepSeek R1 with 37 billion active parameters. This with 32B, I was hopeful that it'll do better :)
I'll wait for @ubergarm to finish uploading his model so I can try this out...
fwiw I have heard some folks are having issues with mainline/unsloth quants right now, though I've also seen unsloth Daniel show it working for him in the mainline llama.cpp PR discussions. I personally have not tried it.
I'm currently uploading my first SOTA quant using the brand new IQ2_KL quant which is pretty amazing for the size. This blend is looking good so far. I have some folks testing it currently and it is coherent and replying correctly. You will need to get the latest version of ik_llama.cpp including this PR which fixes the built in completions endpoint chat template. EDIT This PR just landed in main so just build latest tip of main for ik_llama.cpp for this one now!
Upload should complete within an hour!!
Don't worry, I'll follow up with some bigger quants for you guys with monster rigs!!! ;p
Also I'll give some more usage examples, but it is basically the same as running deepseek except instead of first 3 layers being dense e.g. [0-2] it is only a single dense layer. so instead of starting from 3 to offload routed exps, start from 1 e.g.:
-ot ngl 99 \
-ot -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
-ot -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
-ot exps=CPU \
fwiw I have heard some folks are having issues with mainline/unsloth quants right now, though I've also seen unsloth Daniel show it working for him in the mainline llama.cpp PR discussions. I personally have not tried it.
I'm currently uploading my first SOTA quant using the brand new IQ2_KL quant which is pretty amazing for the size. This blend is looking good so far. I have some folks testing it currently and it is coherent and replying correctly. You will need to get the latest version of ik_llama.cpp including this PR which fixes the built in completions endpoint chat template. EDIT This PR just landed in main so just build latest tip of main for ik_llama.cpp for this one now!
Upload should complete within an hour!!
Don't worry, I'll follow up with some bigger quants for you guys with monster rigs!!! ;p
Also I'll give some more usage examples, but it is basically the same as running deepseek except instead of first 3 layers being dense e.g. [0-2] it is only a single dense layer. so instead of starting from 3 to offload routed exps, start from 1 e.g.:
-ot ngl 99 \ -ot -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \ -ot -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \ -ot exps=CPU \
any quant at 250gb range ?
I personally hope your IQ4 quant is more like a low 4bit like IQ4XS as opposed to a high 4bit like Ktransformersβ own Q4KM quant. Factoring in context, a file size 612 GB isnβt practical for me so really hoping for something 550 Gb or less.
Also will IQ3 be on the menu after IQ4?
any quant at 250gb range ?
that is getting into sub 2 bpw range which is tricky... i've heard a rumor that ik is working on an IQ1_KT 1.75 BPW trellis style quant (qtip kinda like exl3) that could be faster at inferencing than iq1_s type stuff possible... so that might give some hope in that size soon outside of IQ1_S type stuff which i could try...
i might try next IQ2_KL down and IQ2_KS (gate|up) which would put us around 3.0 BPW... will see!
That said if there was say a file size of 550 and I have only 512 GB ram but have a bit OF VRAM like 6000 ADAs or an RTX Pro 6000 could we use the vram so that I can load less into dram? With Deepseek V3 0324 your IQ4 quant it takes like 420 GB of ram and like 30-40 GB of VRAM with my config but i am not sure if offload more layers to gpu can lessen ram load
@mtcl sorry for cc maybe you know about this?
if offload more layers to gpu can lessen ram load
Absolutely! Say you have a single 96GB VRAM GPU and 512 GB RAM that is 608GB total. So you offload enough layers onto the GPU and the rest on CPU/RAM and it will fit. It can take some trial and error to dial in the tensor -ot overrides. Check out the new example for multi-GPU in the model card i just updated. It might be possible to do 512GB RAM + 48GB VRAM but it is gonna be very tight and probably not much context or increased -ub size for extra PP speed.
Basically just add until you OOM then back off by one layer:
-ngl 99 \
-ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
-ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
-ot exps=CPU \
Oops i made a typo on the model card and need to close my </details> tag, this info is buried int he second "secret recipe" fold, lol.. I'll fix as soon as the weights finish up in an hour or so...
Getting great speeds even CPU only with these big models! I have some more example commands and details on how I'm running it here: https://github.com/ikawrakow/ik_llama.cpp/pull/612#issuecomment-3076539817
Shown optimized for PP. You can get a little more TG going with -rtr and smaller batch sizes probably.
following along for a possible smolboi version, just built a sapphire rapids setup with 2x 3090s and 1x 4090, but i could only muster up enough clams for 256gb ddr5 rdimms at the moment.
for the past few weeks, honestly the deepseek v3 0324 iq1 has been great for my use case, but i found i was limited by my old 7950x memory controllers. i expect a large uplift especially in prompt processing with my new setup.
excited to see what's to come!
Guys who have 512GB RAM and 48GV VRAM - what exact PCs do you have, what operating system?
Usually Threadrippers and decommissioned servers with Linux.
MavenDE
Guys who have 512GB RAM and 48GV VRAM - what exact PCs do you have, what operating system?
Dell R740 with Nvidia 4090d 48Gb and 480 GB of RAM.
Proxmox / Debian LXC with Nvidia Toolkit.
Real world numbers from Q2_KL
INFO [ print_timings] prompt eval time = 73369.96 ms / 3004 tokens ( 24.42 ms per token, 40.94 tokens per second) | tid="131974847598592" timestamp=1752671392 id_slot=0 id_task=0 t_prompt_processing=73369.96 n_prompt_tokens_processed=3004 t_token=24.424087882822906 n_tokens_second=40.9431871027325
INFO [ print_timings] generation eval time = 6233.78 ms / 24 runs ( 259.74 ms per token, 3.85 tokens per second) | tid="131974847598592" timestamp=1752671392 id_slot=0 id_task=0 t_token_generation=6233.783 n_decoded=24 t_token=259.74095833333337 n_tokens_second=3.8499896451320166
...
INFO [ print_timings] prompt eval time = 10518.76 ms / 80 tokens ( 131.48 ms per token, 7.61 tokens per second) | tid="131974847598592" timestamp=1752673182 id_slot=1 id_task=423 t_prompt_processing=10518.764 n_prompt_tokens_processed=80 t_token=131.48454999999998 n_tokens_second=7.605456306463384
INFO [ print_timings] generation eval time = 5119.91 ms / 19 runs ( 269.47 ms per token, 3.71 tokens per second) | tid="131974847598592" timestamp=1752673182 id_slot=1 id_task=423 t_token_generation=5119.906 n_decoded=19 t_token=269.4687368421053 n_tokens_second=3.7110056317440203
if offload more layers to gpu can lessen ram load
Absolutely! Say you have a single 96GB VRAM GPU and 512 GB RAM that is 608GB total. So you offload enough layers onto the GPU and the rest on CPU/RAM and it will fit. It can take some trial and error to dial in the tensor
-otoverrides. Check out the new example for multi-GPU in the model card i just updated. It might be possible to do 512GB RAM + 48GB VRAM but it is gonna be very tight and probably not much context or increased-ubsize for extra PP speed.Basically just add until you OOM then back off by one layer:
-ngl 99 \ -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \ -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \ -ot exps=CPU \Oops i made a typo on the model card and need to close my
</details>tag, this info is buried int he second "secret recipe" fold, lol.. I'll fix as soon as the weights finish up in an hour or so...
i just to run model like this with out worry with your branch
CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" --ctx-size 102144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 2048 -ub 2048 --parallel 1 --threads 28 --threads-batch 28 --temp 0.7 --min-p 0.05 -ser 4,1 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080
now i switched to main brance for running kimi now its throwing oom memerroy issues is there any pr didnt merged with main from you ? and i enable numa 2 nodes to see wether i can see speed increse is that caues the error ?
btw looks like ik_llama support dot too any quant for that ?
if offload more layers to gpu can lessen ram load
Absolutely! Say you have a single 96GB VRAM GPU and 512 GB RAM that is 608GB total. So you offload enough layers onto the GPU and the rest on CPU/RAM and it will fit. It can take some trial and error to dial in the tensor
-otoverrides. Check out the new example for multi-GPU in the model card i just updated. It might be possible to do 512GB RAM + 48GB VRAM but it is gonna be very tight and probably not much context or increased-ubsize for extra PP speed.Basically just add until you OOM then back off by one layer:
-ngl 99 \ -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \ -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \ -ot exps=CPU \Oops i made a typo on the model card and need to close my
</details>tag, this info is buried int he second "secret recipe" fold, lol.. I'll fix as soon as the weights finish up in an hour or so...i just to run model like this with out worry with your branch
CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" --ctx-size 102144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 2048 -ub 2048 --parallel 1 --threads 28 --threads-batch 28 --temp 0.7 --min-p 0.05 -ser 4,1 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080
now i switched to main brance for running kimi now its throwing oom memerroy issues is there any pr didnt merged with main from you ? and i enable numa 2 nodes to see wether i can see speed increse is that caues the error ?
nvm is just a numa error
btw looks like ik_llama support dot too any quant for that ?
yes ik_llama.cpp can run the mainline quants from bartowski, unsloth, mradermacher etc.
now i switched to main brance for running kimi now its throwing oom memerroy issues is there any pr didnt merged with main from you ? and i enable numa 2 nodes to see wether i can see speed increse is that caues the error ?
everything for kimi-k2 is merged into main on ik_llama.cpp. i have to go now, but if you run in a single numa node that node must have enough RAM for the entire model. if not, glue together all your nodes with numactl --interleave=all llama-server --numa distribute ...
any quant at 250gb range ?
I'm working on a sub 250GB model using experimental new IQ1_KT discussed here: https://github.com/ikawrakow/ik_llama.cpp/pull/616
Good stuff everyone.
Here are some stats from my rig (7995wx, 384gb DDR5-6800, 1x RTX Pro 6000 96gb 600w@450w):
INFO [ print_timings] prompt eval time = 41322.38 ms / 13323 tokens ( 3.10 ms per token, 322.42 tokens per second) | tid="140385625272320" timestamp=1752695437 id_slot=0 id_task=641 t_prompt_processing=41322.38 n_prompt_tokens_processed=13323 t_token=3.1015822262253243 n_tokens_second=322.4160854239277
INFO [ print_timings] generation eval time = 27466.76 ms / 634 runs ( 43.32 ms per token, 23.08 tokens per second) | tid="140385625272320" timestamp=1752695437 id_slot=0 id_task=641 t_token_generation=27466.758 n_decoded=634 t_token=43.32296214511041 n_tokens_second=23.08244751710413
Kimi IQ2_KS. Can't complain about 322pp/23ts on this large of a model!
Will be testing IQ2_KL shortly with the same prompt.
I bought new ram sticks yesterday, 192GB DDR5 now. I'm able to run the IQ1_S unsloth quant in mainline.
yes ik_llama.cpp can run the mainline quants from bartowski, unsloth, mradermacher etc.
I know this isn't tech support but I don't suppose anyone else gets this when trying to run the IQ1_S with a fresh clone/build?
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
Fully kills the driver
[113188.653626] sh (81407): drop_caches: 3
[114284.401772] llama-server[81743]: segfault at 21ba03fd0 ip 00007fe49e824cd7 sp 00007fa2317cf2a0 error 4 in libcuda.so.550.144.03[424cd7,7fe49e4de000+4d5000] likely on CPU 21 (core 27, socket 0)
[114284.401782] Code: ef e8 fd 9c cb ff 83 3d 3e 0c 74 01 01 49 8b 1c 24 76 0a 8b 05 46 0c 74 01 85 c0 74 56 49 8b 44 24 10 41 8b 4c 24 24 48 8b 13 <8b> 00 41 39 c6 74 52 8b b3 40 40 00 00 48 89 f0 89 8c b3 44 40 00
I know this isn't tech support but I don't suppose anyone else gets this when trying to run the IQ1_S with a fresh clone/build?
IQ1_S is not a great quant, but yeah its hard finding sub 2bpw stuff. I'm just getting back on my desk and gonna see how the IQ1_KT is looking on ik's fork, and may release something.
How you compiling ik's fork? This is what I use for hybrid CUDA+CPU inferencing for deepseek/kimi-k2 MLA stuff:
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
if that doesn't help, what is the command with which you're running?
I bought new ram sticks yesterday, 192GB DDR5 now. I'm able to run the IQ1_S unsloth quant in mainline.
yes ik_llama.cpp can run the mainline quants from bartowski, unsloth, mradermacher etc.
I know this isn't tech support but I don't suppose anyone else gets this when trying to run the IQ1_S with a fresh clone/build?
/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error /home/gghfez/apps/ik_llama.cpp/ggml/src/ggml.c:15254: fatal errorFully kills the driver
[113188.653626] sh (81407): drop_caches: 3 [114284.401772] llama-server[81743]: segfault at 21ba03fd0 ip 00007fe49e824cd7 sp 00007fa2317cf2a0 error 4 in libcuda.so.550.144.03[424cd7,7fe49e4de000+4d5000] likely on CPU 21 (core 27, socket 0) [114284.401782] Code: ef e8 fd 9c cb ff 83 3d 3e 0c 74 01 01 49 8b 1c 24 76 0a 8b 05 46 0c 74 01 85 c0 74 56 49 8b 44 24 10 41 8b 4c 24 24 48 8b 13 <8b> 00 41 39 c6 74 52 8b b3 40 40 00 00 48 89 f0 89 8c b3 44 40 00
Are you using the "fused MoE" option?
Here are some stats from my machine.
2xNVidia RTX 6000 Pro and 2XNvidia RTX 5090
IQ2_KL - 2x 6000 -ot parameter 128K
CUDA_VISIBLE_DEVICES="0,1" ./build/bin/llama-server \
--model /media/mukul/backup/models/ubergarm/Kimi-K2-Instruct-GGUF/IQ2_KL/Kimi-K2-Instruct-IQ2_KL-00001-of-00008.gguf \
--alias ubergarm/Kimi-K2-Instruct-GGUF \
--ctx-size 131072 \
-ctk q8_0 \
-fa -fmoe \
-mla 3 \
-ngl 99\
-b 4096 -ub 4096 \
-ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
-ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
-ot exps=CPU \
--parallel 1 \
--threads 56 \
--threads-batch 64 \
--host 0.0.0.0 \
--port 10002
main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 56, n_threads_batch = 64
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 35.934 | 113.99 | 91.367 | 11.21 |
| 4096 | 1024 | 4096 | 35.510 | 115.35 | 85.530 | 11.97 |
| 4096 | 1024 | 8192 | 35.780 | 114.48 | 91.210 | 11.23 |
| 4096 | 1024 | 12288 | 36.087 | 113.50 | 90.702 | 11.29 |
| 4096 | 1024 | 16384 | 36.427 | 112.44 | 94.757 | 10.81 |
| 4096 | 1024 | 20480 | 36.914 | 110.96 | 93.763 | 10.92 |
| 4096 | 1024 | 24576 | 37.213 | 110.07 | 88.753 | 11.54 |
| 4096 | 1024 | 28672 | 37.288 | 109.85 | 84.589 | 12.11 |
| 4096 | 1024 | 32768 | 37.556 | 109.06 | 88.703 | 11.54 |
| 4096 | 1024 | 36864 | 37.891 | 108.10 | 69.181 | 14.80 |
| 4096 | 1024 | 40960 | 38.097 | 107.52 | 95.880 | 10.68 |
| 4096 | 1024 | 45056 | 38.771 | 105.65 | 96.001 | 10.67 |
| 4096 | 1024 | 49152 | 41.426 | 98.88 | 114.007 | 8.98 |
| 4096 | 1024 | 53248 | 38.912 | 105.26 | 85.312 | 12.00 |
| 4096 | 1024 | 57344 | 39.081 | 104.81 | 101.482 | 10.09 |
| 4096 | 1024 | 61440 | 39.276 | 104.29 | 101.727 | 10.07 |
| 4096 | 1024 | 65536 | 39.675 | 103.24 | 95.502 | 10.72 |
| 4096 | 1024 | 69632 | 39.829 | 102.84 | 97.575 | 10.49 |
| 4096 | 1024 | 73728 | 40.495 | 101.15 | 100.179 | 10.22 |
| 4096 | 1024 | 77824 | 40.771 | 100.46 | 103.442 | 9.90 |
| 4096 | 1024 | 81920 | 40.858 | 100.25 | 105.901 | 9.67 |
| 4096 | 1024 | 86016 | 41.056 | 99.77 | 100.122 | 10.23 |
| 4096 | 1024 | 90112 | 41.556 | 98.57 | 103.497 | 9.89 |
| 4096 | 1024 | 94208 | 41.813 | 97.96 | 107.443 | 9.53 |
| 4096 | 1024 | 98304 | 42.142 | 97.19 | 111.211 | 9.21 |
| 4096 | 1024 | 102400 | 41.810 | 97.97 | 107.267 | 9.55 |
| 4096 | 1024 | 106496 | 42.192 | 97.08 | 104.101 | 9.84 |
| 4096 | 1024 | 110592 | 42.494 | 96.39 | 105.393 | 9.72 |
| 4096 | 1024 | 114688 | 42.732 | 95.85 | 112.332 | 9.12 |
| 4096 | 1024 | 118784 | 43.053 | 95.14 | 109.134 | 9.38 |
| 4096 | 1024 | 122880 | 43.748 | 93.63 | 114.136 | 8.97 |
| 4096 | 1024 | 126976 | 58.711 | 69.77 | 113.304 | 9.04 |
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$
IQ4_KS - 2x 5090 and 2x6000 with -ot parameter 128K (best)
CUDA_VISIBLE_DEVICES="0,1,2,3" ./build/bin/llama-server \
--model /media/mukul/extra/models/ubergarm/Kimi-K2-Instruct-GGUF/IQ4_KS/Kimi-K2-Instruct-IQ4_KS-00001-of-00013.gguf \
--alias ubergarm/Kimi-K2-Instruct-IQ4_KS \
--ctx-size 131072 \
-ctk q8_0 \
-fa -fmoe \
-mla 3 \
-ngl 99\
-b 4096 -ub 4096 \
-ot "blk\.(1|2|3|4|5|6|7)\.ffn_.*=CUDA0" \
-ot "blk\.(8|9|10|11|12|13|14)\.ffn_.*=CUDA1" \
-ot "blk\.(15)\.ffn_.*=CUDA2" \
-ot "blk\.(17)\.ffn_.*=CUDA3" \
-ot exps=CPU \
--parallel 1 \
--threads 56 \
--threads-batch 64 \
--host 0.0.0.0 \
--port 10002
main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 56, n_threads_batch = 64
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 48.541 | 84.38 | 80.945 | 12.65 |
| 4096 | 1024 | 4096 | 46.096 | 88.86 | 73.455 | 13.94 |
| 4096 | 1024 | 8192 | 46.168 | 88.72 | 76.755 | 13.34 |
| 4096 | 1024 | 12288 | 46.280 | 88.51 | 81.525 | 12.56 |
| 4096 | 1024 | 16384 | 46.535 | 88.02 | 85.476 | 11.98 |
| 4096 | 1024 | 20480 | 47.028 | 87.10 | 83.644 | 12.24 |
| 4096 | 1024 | 24576 | 47.278 | 86.64 | 78.471 | 13.05 |
| 4096 | 1024 | 28672 | 46.223 | 88.61 | 82.925 | 12.35 |
| 4096 | 1024 | 32768 | 48.233 | 84.92 | 94.340 | 10.85 |
| 4096 | 1024 | 36864 | 60.244 | 67.99 | 83.756 | 12.23 |
| 4096 | 1024 | 40960 | 47.019 | 87.11 | 99.698 | 10.27 |
| 4096 | 1024 | 45056 | 47.172 | 86.83 | 77.333 | 13.24 |
| 4096 | 1024 | 49152 | 47.439 | 86.34 | 78.607 | 13.03 |
| 4096 | 1024 | 53248 | 47.452 | 86.32 | 78.862 | 12.98 |
| 4096 | 1024 | 57344 | 47.551 | 86.14 | 90.759 | 11.28 |
| 4096 | 1024 | 61440 | 47.661 | 85.94 | 91.470 | 11.19 |
| 4096 | 1024 | 65536 | 47.980 | 85.37 | 85.413 | 11.99 |
| 4096 | 1024 | 69632 | 48.122 | 85.12 | 83.660 | 12.24 |
| 4096 | 1024 | 73728 | 48.568 | 84.34 | 83.934 | 12.20 |
| 4096 | 1024 | 77824 | 48.791 | 83.95 | 84.952 | 12.05 |
| 4096 | 1024 | 81920 | 49.091 | 83.44 | 85.207 | 12.02 |
| 4096 | 1024 | 86016 | 50.188 | 81.61 | 86.625 | 11.82 |
| 4096 | 1024 | 90112 | 49.560 | 82.65 | 87.692 | 11.68 |
| 4096 | 1024 | 94208 | 49.815 | 82.22 | 87.924 | 11.65 |
| 4096 | 1024 | 98304 | 49.734 | 82.36 | 89.837 | 11.40 |
| 4096 | 1024 | 102400 | 92.639 | 44.21 | 110.396 | 9.28 |
| 4096 | 1024 | 106496 | 94.440 | 43.37 | 117.939 | 8.68 |
| 4096 | 1024 | 110592 | 62.571 | 65.46 | 119.668 | 8.56 |
| 4096 | 1024 | 114688 | 50.397 | 81.28 | 93.243 | 10.98 |
| 4096 | 1024 | 118784 | 50.644 | 80.88 | 93.611 | 10.94 |
| 4096 | 1024 | 122880 | 51.543 | 79.47 | 95.323 | 10.74 |
| 4096 | 1024 | 126976 | 53.588 | 76.43 | 95.850 | 10.68 |
Are you using the "fused MoE" option?
tl;dr;
update: yes seems like at least one UD quant is broken, it fixes it to remove -fmoe though with a speed penalty. Thanks juk, and interesting work you're up to with the tiny draft model!
details
Huh I'm getting some issues too running ik's fork trying to test perplexity on https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF?show_file_info=UD-IQ1_S%2FKimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf
# compiled CPU only backend
cmake -B build -DGGML_CUDA=0 -DGGML_BLAS=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)
# command
model=/mnt/raid/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ1_S/Kimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf
numactl -N 0 -m 0 \
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
-mla 3 \
--ctx-size 512 \
--numa numactl \
--threads 128 \
--threads-batch 192
llama_model_loader: - type f32: 365 tensors
llama_model_loader: - type q8_0: 122 tensors
llama_model_loader: - type q4_K: 56 tensors
llama_model_loader: - type q5_K: 35 tensors
llama_model_loader: - type q6_K: 18 tensors
llama_model_loader: - type iq2_xxs: 24 tensors
llama_model_loader: - type iq3_xxs: 49 tensors
llama_model_loader: - type iq1_s: 82 tensors
llama_model_loader: - type iq3_s: 158 tensors
llama_model_loader: - type iq2_s: 34 tensors
llama_model_loader: - type iq4_xs: 139 tensors
llama_model_loader: - type iq1_m: 14 tensors
llm_load_print_meta: model size = 261.979 GiB (2.192 BPW)
Computed blk.53.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.55.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
llama_new_context_with_model: attn_max_b = 0 18:38:06 [180/1955]
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init: CPU KV buffer size = 137.25 MiB
llama_new_context_with_model: KV self size = 137.25 MiB, c^KV (f16): 137.25 MiB, kv^T: not used
llama_new_context_with_model: CPU output buffer size = 2.50 MiB
llama_new_context_with_model: CPU compute buffer size = 334.00 MiB
llama_new_context_with_model: graph nodes = 3525
llama_new_context_with_model: graph splits = 1
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.
c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/
src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/project
s/ik_llama.cpp/ggml/src/ggml.c:15254: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal errorfatal errorfatal error/home/w/projects/ik_llama.
cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:15254: fatal error
I'll try to disable -fmoe and see if it runs. I've heard some other folks having issues specifically with the UD quants on both ik/mainline but was working okay with unsloths "non-UD" versions. It wasn't this error, but over here: https://github.com/ggml-org/llama.cpp/pull/14654#issuecomment-3071665949
UPDATE
yup this UD seems broken with -fmoe. removing that it is running now albiet at a speed penalty:
Computed blk.59.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CPU
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init: CPU KV buffer size = 137.25 MiB
llama_new_context_with_model: KV self size = 137.25 MiB, c^KV (f16): 137.25 MiB, kv^T: not used
llama_new_context_with_model: CPU output buffer size = 2.50 MiB
llama_new_context_with_model: CPU compute buffer size = 334.00 MiB
llama_new_context_with_model: graph nodes = 3645
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 128 (n_threads_batch = 192) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 646.754 ms
perplexity: calculating perplexity over 568 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 16.55 seconds per pass - ETA 39.15 minutes
Here are some stats from my machine.
2xNVidia RTX 6000 Pro and 2XNvidia RTX 5090
Are you offloading as many layers as possible in both scenarios? If not you can get faster right? Maybe open a different dicussion for benchmarking please.
Also keep an eye out for these smaller quants which you'll probably be able to fully offload https://github.com/ikawrakow/ik_llama.cpp/pull/616
Thanks jukofyork and ubergram, -fmoe was the issue.
Thanks juk, and interesting work you're up to with the tiny draft model!
Managed to get it to train:
https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0
https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF
looks like around 10% increase in acceptance rate over the untrained version too!
https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF/tree/main/UD-TQ1_0
i was using this quant under the 256 ram since my hard disk very slow like 600mb writing and 200 mb reading i cant use the 286gb the memory bandwight i am getiing within ram is 10gb so its fine and response some what okay just remove the default prompt like your help assitam because its cause low quality response and also remove the ser and i just order the new disk. and does anyone know ik_llama support functional calling and tool calling ?
Thanks jukofyork and ubergram, -fmoe was the issue.
Please check out and keep an eye on this new ik_llama.cpp issue regarding -fmoe support for IQ1_M ffn tensors: https://github.com/ikawrakow/ik_llama.cpp/issues/626
EDIT okay ik now supports that IQ1_M quant time with MMQ so you can get the speed boost with -fmoe again! Pull tip of main and rebuild.
I've measured PPL on that UD-TQ1_0 and it is a bit high for its size. Though I've not released my IQ1_KT sub 2.0 BPW models yet as that PR is still open on ik's for.
does anyone know ik_llama support functional calling and tool calling ?
Give it a try and let us know. The api endpoint is not as updated as mainline with that 3rd party openai compliant library. Some chatter about using a front end to handle it properly here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943 by @mtcl . Please update that github issue if you try it out!
EDIT Also @gopi87 there is a new PR for tool calling if you want to try it: https://github.com/ikawrakow/ik_llama.cpp/pull/628
Also thanks @jukofyork for the draft model, I've not tried it yet one guy used your earlier one for deepseek kinda like so psure:
-md ~/Qwen3-0.6B-Base.i1-IQ4_XS.gguf \
-ngld 99 \
--draft 16 \
I might have to try out some llama-sweep-bench with and without the draft model and see how it looks!
Thanks jukofyork and ubergram, -fmoe was the issue.
Please check out and keep an eye on this new ik_llama.cpp issue regarding
-fmoesupport forIQ1_Mffn tensors: https://github.com/ikawrakow/ik_llama.cpp/issues/626EDIT okay ik now supports that IQ1_M quant time with MMQ so you can get the speed boost with
-fmoeagain! Pull tip of main and rebuild.I've measured PPL on that UD-TQ1_0 and it is a bit high for its size. Though I've not released my IQ1_KT sub 2.0 BPW models yet as that PR is still open on ik's for.
does anyone know ik_llama support functional calling and tool calling ?
Give it a try and let us know. The api endpoint is not as updated as mainline with that 3rd party openai compliant library. Some chatter about using a front end to handle it properly here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943 by @mtcl . Please update that github issue if you try it out!
EDIT Also @gopi87 there is a new PR for tool calling if you want to try it: https://github.com/ikawrakow/ik_llama.cpp/pull/628
Also thanks @jukofyork for the draft model, I've not tried it yet one guy used your earlier one for deepseek kinda like so psure:
-md ~/Qwen3-0.6B-Base.i1-IQ4_XS.gguf \ -ngld 99 \ --draft 16 \I might have to try out some
llama-sweep-benchwith and without the draft model and see how it looks!
thanks sir will try this with newly updaed gguf from unsloth
W790E Sage + QYFS + 512G + RTX5090
IQ3_KS:
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 43454.48 MiB
llm_load_tensors: CPU buffer size = 44382.00 MiB
llm_load_tensors: CPU buffer size = 630.00 MiB
llm_load_tensors: CUDA0 buffer size = 7618.07 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 196608
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init: CUDA0 KV buffer size = 6999.78 MiB
llama_new_context_with_model: KV self size = 6999.75 MiB, c^KV (q8_0): 6999.75 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 13015.16 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 3183.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122
main: n_kv_max = 196608, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4090 | 1022 | 0 | 47.114 | 86.81 | 69.047 | 14.80 |
| 4090 | 1022 | 4090 | 47.603 | 85.92 | 70.267 | 14.54 |
| 4090 | 1022 | 8180 | 48.209 | 84.84 | 86.623 | 11.80 |
| 4090 | 1022 | 12270 | 81.340 | 50.28 | 82.109 | 12.45 |
| 4090 | 1022 | 16360 | 82.138 | 49.79 | 85.950 | 11.89 |
You might have enough VRAM to run the smallest smol-IQ1_KT fully offloaded on VRAM haha... Would be curious to see your benchmarks if you do.
Also keep in mind ik added a fix for that broken UD quant so you can run that small UD quants now with -fmoe for full bonus speed. But for you can get better perplexity in less RAM with the new IQ1_KT quants.
