miLLi 1.0: Model Integrating Local Linguistic Insights for Efficient Tokenization

miLLi 1.0 is a linguistically informed hybrid tokenizer designed to address the specific morphological and phonological challenges of the Azerbaijani language within Natural Language Processing (NLP) frameworks. By integrating a rule-based root dictionary with statistical Byte-Pair Encoding (BPE), the model aims to establish an optimal balance between token efficiency and semantic preservation.

Methodology and Design

The architecture of miLLi 1.0 incorporates a dynamic Phonological Restoration algorithm. This mechanism is designed to reconcile allomorphic variations—such as vowel loss (haplology) and consonant mutations (e.g., q - ğ, k - y)—back to their canonical root forms during the pre-tokenization phase.

Unlike standard statistical approaches that may fragment words based solely on frequency, miLLi 1.0 employs a "Longest Restored Match" strategy. This ensures that words are segmented into linguistically valid roots and suffixes (e.g., _bayraq + ##ı), preserving the semantic link between the stem and its inflected forms.

Empirical Observations and Benchmarks

To evaluate the efficacy of the hybrid architecture, comparative benchmarks were conducted using a curated evaluation set comprising 100 Azerbaijani sentences across various registers (literary, scientific, and colloquial).

1. Comparison with Global Industry Standards

Against multilingual and English-centric models, miLLi 1.0 demonstrates superior compression efficiency, significantly reducing the "token inflation" problem:

Competitor Tokenizer Observed Token Reduction with miLLi 1.0
GPT-3.5 (cl100k_base) 57% fewer tokens
mBERT (multilingual-cased) 40% fewer tokens
GPT-4o (o200k_base) 37% fewer tokens
XLM-RoBERTa (base) 14% fewer tokens

2. Comparison with Local Statistical Models

When compared to local models trained on massive corpora (e.g., aLLMA, CustomAz), miLLi 1.0 produces a slightly higher (1.13x-1.31x) number of tokens. This difference is a deliberate design choice:

  • Statistical Models: Tend to "memorize" frequent inflected forms as single tokens (e.g., gələcəyəm as one token). While this reduces token count, it can lead to vocabulary sparsity and weaker morphological generalization for rare words.
  • miLLi 1.0: Prioritizes Morphological Boundary Accuracy (MBA) and Root Consistency (RCR). By enforcing a split between the root and the suffix, miLLi 1.0 ensures that the model recognizes the underlying root regardless of the inflection, promoting better generalization at the cost of a marginal increase in sequence length.

Synthesis of Results

Experiments confirm that the primary advantage of miLLi 1.0 lies not just in statistical compression, but in linguistic robustness. Unlike statistical models, miLLi 1.0 restores word roots and preserves morphological boundaries, paving the way for future language models built on this tokenizer to possess deeper semantic perception.

The ultimate goal is to make the complex morphological structure of the Azerbaijani language transparent and understandable for the model, while minimizing token waste. These findings suggest that miLLi 1.0 facilitates an optimal balance—the "Golden Mean"—between statistical compression and semantic preservation.

Usage

This model is compatible with the transformers library. Because it utilizes custom Python logic for phonological restoration, the trust_remote_code=True parameter must be enabled.

Installation

pip install transformers tokenizers pyahocorasick

from transformers import AutoTokenizer

# Initialize the tokenizer
# 'trust_remote_code=True' is required to load the custom phonological logic
tokenizer = AutoTokenizer.from_pretrained(
    "elshadrahimov/miLLi-1.0", 
    trust_remote_code=True
)

# Example text: "Our flag is waving on the heights."
# Note the restoration of 'bayrağı' -> 'bayraq'
text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır."

# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")

# Encode to IDs
input_ids = tokenizer.encode(text)
print(f"Token IDs: {input_ids}")

Limitations

Dictionary Dependence: The effectiveness of the phonological restoration is dependent on the coverage of the underlying root dictionary. Neologisms or specific terms not present in the dictionary will default to standard BPE segmentation.

Inference Speed: Due to the additional linguistic processing layer (Trie lookup and restoration logic), inference time is slightly higher compared to purely C++ optimized statistical tokenizers.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train elshadrahimov/miLLi-1.0