π Nacrith: a 135M model that out-compresses everything on natural language
What if a tiny LM could compress english text better than _every_ compressor out there β classical or neural, small or large?
Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.
What's inside
The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7Γ faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) β the first LLM-based binary compressor we know of.
Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.
Try it, break it, share your results β all feedback welcome. β on the repo appreciated!
Results across all systems we tested: - alice29.txt β 0.918 bpb (β44% vs CMIX, β20% vs ts_zip) β below the 2nd-order Shannon entropy bound - enwik8 (100 MB) β 0.9389 bpb (β8% vs FineZip/LLMZip's 8B model, β15% vs ts_zip) - Unseen text β 0.723 bpb on a doc published after training cutoff β no memorization, 26% better than FineZip/LLMZip on the same model
In 2017, my RNNs were babbling. Today, they are hallucinating beautifully.
10 years ago, getting an LSTM to output coherent English was a struggle. 10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating.
We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge.
The ambitious goal is to build a small instruct model with RAG and tool usage capabilities (ethicalabs/Kurtis-EON1)
π The Benchmarks (Size: 400M)
For a model this size (trained on <10B tokens), the specialized performance is surprising:
*SciQ*: 73.8% π¦ (This rivals billion-parameter models in pure fact retrieval). *PIQA*: 62.3% (Solid physical intuition for a sub-1B model).
The Reality Check:
HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training.
We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore...").
The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity.
Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries.
π€ Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, letβs connect!