miLLi 1.0: Model Integrating Local Linguistic Insights for Efficient Tokenization
miLLi 1.0 is a linguistically informed hybrid tokenizer designed to address the specific morphological and phonological challenges of the Azerbaijani language within Natural Language Processing (NLP) frameworks. By integrating a rule-based root dictionary with statistical Byte-Pair Encoding (BPE), the model aims to establish an optimal balance between token efficiency and semantic preservation.
Methodology and Design
The architecture of miLLi 1.0 incorporates a dynamic Phonological Restoration algorithm. This mechanism is designed to reconcile allomorphic variations—such as vowel loss (haplology) and consonant mutations (e.g., q - ğ, k - y)—back to their canonical root forms during the pre-tokenization phase.
Unlike standard statistical approaches that may fragment words based solely on frequency, miLLi 1.0 employs a "Longest Restored Match" strategy. This ensures that words are segmented into linguistically valid roots and suffixes (e.g., _bayraq + ##ı), preserving the semantic link between the stem and its inflected forms.
Empirical Observations and Benchmarks
To evaluate the efficacy of the hybrid architecture, comparative benchmarks were conducted using a curated evaluation set comprising 100 Azerbaijani sentences across various registers (literary, scientific, and colloquial).
1. Comparison with Global Industry Standards
Against multilingual and English-centric models, miLLi 1.0 demonstrates superior compression efficiency, significantly reducing the "token inflation" problem:
| Competitor Tokenizer | Observed Token Reduction with miLLi 1.0 |
|---|---|
| GPT-3.5 (cl100k_base) | 57% fewer tokens |
| mBERT (multilingual-cased) | 40% fewer tokens |
| GPT-4o (o200k_base) | 37% fewer tokens |
| XLM-RoBERTa (base) | 14% fewer tokens |
2. Comparison with Local Statistical Models
When compared to local models trained on massive corpora (e.g., aLLMA, CustomAz), miLLi 1.0 produces a slightly higher (1.13x-1.31x) number of tokens. This difference is a deliberate design choice:
- Statistical Models: Tend to "memorize" frequent inflected forms as single tokens (e.g., gələcəyəm as one token). While this reduces token count, it can lead to vocabulary sparsity and weaker morphological generalization for rare words.
- miLLi 1.0: Prioritizes Morphological Boundary Accuracy (MBA) and Root Consistency (RCR). By enforcing a split between the root and the suffix, miLLi 1.0 ensures that the model recognizes the underlying root regardless of the inflection, promoting better generalization at the cost of a marginal increase in sequence length.
Synthesis of Results
Experiments confirm that the primary advantage of miLLi 1.0 lies not just in statistical compression, but in linguistic robustness. Unlike statistical models, miLLi 1.0 restores word roots and preserves morphological boundaries, paving the way for future language models built on this tokenizer to possess deeper semantic perception.
The ultimate goal is to make the complex morphological structure of the Azerbaijani language transparent and understandable for the model, while minimizing token waste. These findings suggest that miLLi 1.0 facilitates an optimal balance—the "Golden Mean"—between statistical compression and semantic preservation.
Usage
This model is compatible with the transformers library. Because it utilizes custom Python logic for phonological restoration, the trust_remote_code=True parameter must be enabled.
Installation
pip install transformers tokenizers pyahocorasick
from transformers import AutoTokenizer
# Initialize the tokenizer
# 'trust_remote_code=True' is required to load the custom phonological logic
tokenizer = AutoTokenizer.from_pretrained(
"elshadrahimov/miLLi-1.0",
trust_remote_code=True
)
# Example text: "Our flag is waving on the heights."
# Note the restoration of 'bayrağı' -> 'bayraq'
text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır."
# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
# Encode to IDs
input_ids = tokenizer.encode(text)
print(f"Token IDs: {input_ids}")
Limitations
Dictionary Dependence: The effectiveness of the phonological restoration is dependent on the coverage of the underlying root dictionary. Neologisms or specific terms not present in the dictionary will default to standard BPE segmentation.
Inference Speed: Due to the additional linguistic processing layer (Trie lookup and restoration logic), inference time is slightly higher compared to purely C++ optimized statistical tokenizers.