Mitko Vasilev
mitkox
AI & ML interests
Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it.
Recent Activity
posted an
update
9 days ago
134,614 tok/sec input prefil max
1031 tokens/sec out gen max
At these local AI speeds, there is no User Interface for humans. My human UI is the Radicle distributed Git issues queue
On my GPU workstation:
- Z8 Fury G5 4x A6000
- MiniMax-M2.5
- Claude Code to localhost:8000 posted an
update
19 days ago
I just pushed Claude Code Agent Swarm with 20 coding agents on my desktop GPU workstation.
With local AI, I don’t have /fast CC switch, but I have /absurdlyfast:
- 100’499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707’200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; It’s not the car. It’s the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting. Organizations
posted an
update 6 days ago
posted an
update 19 days ago
Post
4714
I just pushed Claude Code Agent Swarm with 20 coding agents on my desktop GPU workstation.
With local AI, I don’t have /fast CC switch, but I have /absurdlyfast:
- 100’499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707’200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; It’s not the car. It’s the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.
With local AI, I don’t have /fast CC switch, but I have /absurdlyfast:
- 100’499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707’200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; It’s not the car. It’s the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.
posted an
update 29 days ago
Post
306
▐▛██▜▌ Claude Code v2.1.23
▝████▘ Kimi-K2.5 · API Usage Billing
▘▘ ▝▝ ~/dev/vllm
/model to try Opus 4.5
❯ hey
● Hello! How can I help you today?
❯ what model are you?
● I'm Claude Kimi-K2.5, running in a local environment on Linux.
Took some time to download and vLLM hybrid inferencing magic to get it running on my desktop workstation.
▝████▘ Kimi-K2.5 · API Usage Billing
▘▘ ▝▝ ~/dev/vllm
/model to try Opus 4.5
❯ hey
● Hello! How can I help you today?
❯ what model are you?
● I'm Claude Kimi-K2.5, running in a local environment on Linux.
Took some time to download and vLLM hybrid inferencing magic to get it running on my desktop workstation.
posted an
update about 1 month ago
Post
1536
GLM-4.7-Flash is fast, good and cheap.
3,074 tokens/sec peak at 200k tokens context window on my desktop PC.
Works with Claude Code and opencode for hours. No errors, drop-in replacement of the Anthropic cloud AI.
MIT licensed, open weights, free for commercial use and modifications.
Supports speculative decoding using MTP, which is highly effective in mitigating latency.
Great for on device AI coding as AWQ 4bit at 18.5 GB. Hybrid inference on a single consumer GPU + CPU RAM.
3,074 tokens/sec peak at 200k tokens context window on my desktop PC.
Works with Claude Code and opencode for hours. No errors, drop-in replacement of the Anthropic cloud AI.
MIT licensed, open weights, free for commercial use and modifications.
Supports speculative decoding using MTP, which is highly effective in mitigating latency.
Great for on device AI coding as AWQ 4bit at 18.5 GB. Hybrid inference on a single consumer GPU + CPU RAM.
posted an
update about 2 months ago
Post
3325
I just stress-tested the Beast: MiniMax-M2.1 on Z8 Fury G5.
2101 tokens/sec. FORTY concurrent clients. That's 609 t/s out, 1492 t/s in. The model outputs fire faster than I can type, but feeds on data like a black hole on cheat day.
But wait, there's more! Threw it into Claude Code torture testing with 60+ tools, 8 agents (7 sub-agents because apparently one wasn't enough chaos). It didn't even flinch. Extremely fast, scary good at coding. The kind of performance that makes you wonder if the model's been secretly reading Stack Overflow in its spare time lol
3 months ago, these numbers lived in my "maybe in “2030 dreams. Today it's running on my desk AND heaths my home office during the winter!
2101 tokens/sec. FORTY concurrent clients. That's 609 t/s out, 1492 t/s in. The model outputs fire faster than I can type, but feeds on data like a black hole on cheat day.
But wait, there's more! Threw it into Claude Code torture testing with 60+ tools, 8 agents (7 sub-agents because apparently one wasn't enough chaos). It didn't even flinch. Extremely fast, scary good at coding. The kind of performance that makes you wonder if the model's been secretly reading Stack Overflow in its spare time lol
3 months ago, these numbers lived in my "maybe in “2030 dreams. Today it's running on my desk AND heaths my home office during the winter!
posted an
update 3 months ago
Post
3212
I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. It’s a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,
Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.
Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,
posted an
update 4 months ago
Post
4203
I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:
564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
posted an
update 4 months ago
Post
411
I have just vibe coded a feature for ODA on-device AI with MiniMax M2, running locally on my Z8 Fury - and holy silicon, this thing SLAPS!
TL;DR the nerd stuff
Specialized in coding and agentic work
60 tokens/sec
Ryzen AI is getting some serious ROCm 7.0.2 brain implants
One extra script to rule them all and bind them to my GPU
Vibe coding feature implementation that actually worked on the first try. I know, I'm scared too
TL;DR the nerd stuff
Specialized in coding and agentic work
60 tokens/sec
Ryzen AI is getting some serious ROCm 7.0.2 brain implants
One extra script to rule them all and bind them to my GPU
Vibe coding feature implementation that actually worked on the first try. I know, I'm scared too
posted an
update 4 months ago
Post
1894
I’m just reading that Ryzen AI 395 has to be 30% slower than DGX Spark in LLM inferencing… and only 96GB GPU RAM… good I haven’t RTFM upfront, so I made the AMD faster with 128GB unified RAM 🫡
Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
posted an
update 4 months ago
Post
2830
Say hello to my little friends! I just unboxed this trio of HP Z2 G1a!
Three is always better than one!
3x AMD Ryzen AI Max+ Pro 395
384GB RAM
24TB of RAID storage
Ubuntu 24.04
ROCm 7.0.2
llama cpp, vLLM and Aibrix
Small, cheap GPUs are about to become the Raspberry Pi of edge AI inference. Sprinkle some
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
Three is always better than one!
3x AMD Ryzen AI Max+ Pro 395
384GB RAM
24TB of RAID storage
Ubuntu 24.04
ROCm 7.0.2
llama cpp, vLLM and Aibrix
Small, cheap GPUs are about to become the Raspberry Pi of edge AI inference. Sprinkle some
kubectl fairy dust on top, and suddenly it's a high-availability, self-healing, cloud-native, enterprise-grade AI cluster camping in a closet.Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
posted an
update 4 months ago
Post
2834
I see all Chinese labs are turning TL;DR into TL;DRGB
Problem: 1M text tokens == 1 M opportunities for your GPU to file worker-comp
Solution: don’t feed the model War & Peace—feed it the movie poster.
This is Glyph, Zai’s new visual-text compression voodoo:
• 10 k words → 3 PNGs ≈ 3 k visual tokens
• Compression ratio: 4.3×
• Throughput: 40-60 tok/s i.e. your context window now finishes before my coffee does
So I did the only reasonable thing: asked GLM-4.6 to port Glyph for Qwen3-VL-8B-Thinking.
Translation: I made one model compress a novel into a comic strip, then made another model read the comic strip and still ace QA.
It’s basically passing notes in class, except the note is a 1920×1080 meme and the teacher is a transformer.
We've gone from "Attention is All You Need" to "Attention is Too Expensive, Just Use Your Eyes." Remember kids: in 2025 literacy is optional, but JPEG is forever.
Problem: 1M text tokens == 1 M opportunities for your GPU to file worker-comp
Solution: don’t feed the model War & Peace—feed it the movie poster.
This is Glyph, Zai’s new visual-text compression voodoo:
• 10 k words → 3 PNGs ≈ 3 k visual tokens
• Compression ratio: 4.3×
• Throughput: 40-60 tok/s i.e. your context window now finishes before my coffee does
So I did the only reasonable thing: asked GLM-4.6 to port Glyph for Qwen3-VL-8B-Thinking.
Translation: I made one model compress a novel into a comic strip, then made another model read the comic strip and still ace QA.
It’s basically passing notes in class, except the note is a 1920×1080 meme and the teacher is a transformer.
We've gone from "Attention is All You Need" to "Attention is Too Expensive, Just Use Your Eyes." Remember kids: in 2025 literacy is optional, but JPEG is forever.
posted an
update 6 months ago
Post
5676
I’ve built my blocker for AI-generated content. It’s a local AI running on my laptop with a browser extension that classifies and scrubs synthetic content from my eyeballs. I’m too old for this synthetic noise.
TL;DR I’m going full John Connor on the AI content apocalypse
Think of it as an on device AI ad-blocker, but for:
Em-dash overdose. Seriously, why is everything suddenly revolutionary—disruptive—life-changing?
AI influencers’ auto-generated posts and images, auto-posted, all hands-free.
Fake news, fake images, fake people... puff.
Surprisingly, it works. I suppose it will block some human-generated content. However, I would rather read a 2007 Myspace blog than another “10 Growth Hacks Powered By ChatGPT” post.
TL;DR I’m going full John Connor on the AI content apocalypse
Think of it as an on device AI ad-blocker, but for:
Em-dash overdose. Seriously, why is everything suddenly revolutionary—disruptive—life-changing?
AI influencers’ auto-generated posts and images, auto-posted, all hands-free.
Fake news, fake images, fake people... puff.
Surprisingly, it works. I suppose it will block some human-generated content. However, I would rather read a 2007 Myspace blog than another “10 Growth Hacks Powered By ChatGPT” post.
posted an
update 6 months ago
Post
406
Hermes4 70B synthetic dataset generation on my desktop Z8 GPU rig:
307 tok/sec
1.1M tok/hour
The bottleneck for generating massive, high-quality reinforcement learning datasets is never the GPU compute; it's always the model's willingness to actually answer the darn question.
307 tok/sec
1.1M tok/hour
The bottleneck for generating massive, high-quality reinforcement learning datasets is never the GPU compute; it's always the model's willingness to actually answer the darn question.
posted an
update 7 months ago
Post
3563
I run Claude Code with Qwen3 Coder Flash locally on my MacBook Air. It works offline, zero cloud, zero internet, zero EU AI Act anxiety. No limit with all tokens on the house.
It’s not great, not terrible- adequate performance for an on device AI agent chewing through code on a 1.24 kg laptop. I wrote an interpreter to broker peace between Claude Code and my local AI runtime.
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
It’s not great, not terrible- adequate performance for an on device AI agent chewing through code on a 1.24 kg laptop. I wrote an interpreter to broker peace between Claude Code and my local AI runtime.
Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
posted an
update 7 months ago
Post
3464
XBai o4 claims to beat Claude Opus 4 and o3-mini, and they provide verifiable proof. My skepticism circuits overloaded, but my local AI FOMO module screamed louder.
I've thrown this 33B monoblock LLM onto a single GPU and used Roo Code for some… let’s call it “vibe testing”. It’s terrifyingly competent. As an architect, it’s the best open-weight model I’ve touched this side of 2025.
I've thrown this 33B monoblock LLM onto a single GPU and used Roo Code for some… let’s call it “vibe testing”. It’s terrifyingly competent. As an architect, it’s the best open-weight model I’ve touched this side of 2025.
posted an
update 7 months ago
Post
2596
We’ve reached a point where on device AI coding that is free, offline, and capable isn’t just a theoretical possibility; it’s sitting on my lap, barely warming my thighs.
My local MacBook Air setup includes a Qwen3 Coder Flash with a 1M context, Cline in a VSCode IDE. No internet, no cloud, no ID verification- this is the forbidden tech.
Current stats:
All agentic tools work great local, sandboxed, and MCP
OK model output precision
17 tokens/sec. Not great, not terrible
65K tokens context, the model can do 1M, but let’s be real, my MacBook Air would probably achieve fusion before hitting that smoothly
Standard backend and cache off for the test
All inference and function calling happen locally, offline, untethered. The cloud didn’t even get a memo.
My local MacBook Air setup includes a Qwen3 Coder Flash with a 1M context, Cline in a VSCode IDE. No internet, no cloud, no ID verification- this is the forbidden tech.
Current stats:
All agentic tools work great local, sandboxed, and MCP
OK model output precision
17 tokens/sec. Not great, not terrible
65K tokens context, the model can do 1M, but let’s be real, my MacBook Air would probably achieve fusion before hitting that smoothly
Standard backend and cache off for the test
All inference and function calling happen locally, offline, untethered. The cloud didn’t even get a memo.