Spaces:

HeshamHaroon
/

Arabic_Tokenizer

Running

App Files Files Community

Arabic_Tokenizer / README.md

HeshamHaroon

Add HuggingFace Spaces YAML configuration

751def7 20 days ago

preview code

raw

history blame contribute delete

2.85 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Arabic Tokenizer Arena
emoji: 🏟️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false

🏟️ Arabic Tokenizer Arena Pro

Advanced research & production platform for Arabic tokenization analysis.

Features

📊 Comprehensive Metrics: Fertility, compression, STRR, OOV rate, and more
🌍 Arabic-Specific Analysis: Dialect support, diacritic preservation
⚖️ Side-by-Side Comparison: Compare multiple tokenizers instantly
🎨 Beautiful Visualization: Token-by-token display with IDs
🏆 Leaderboard: Evaluate on real HuggingFace Arabic datasets
📖 Multi-Variant Support: MSA, dialectal, and Classical Arabic

Project Structure

arabic_tokenizer_arena/
├── app.py                 # Main Gradio application
├── config.py              # Tokenizer registry & dataset configs
├── tokenizer_manager.py   # Tokenizer loading & caching
├── analysis.py            # Tokenization analysis functions
├── leaderboard.py         # Leaderboard with HF datasets
├── ui_components.py       # HTML generation
├── styles.py              # CSS styles
├── utils.py               # Arabic text utilities
├── requirements.txt       # Dependencies
└── README.md              # This file

Installation

pip install -r requirements.txt

Usage

Local Development

python app.py

HuggingFace Spaces

Upload all .py files to your Space
Add HF_TOKEN secret if using gated models
The app will start automatically

Available Tokenizers

Arabic BERT Models

AraBERT v2 (AUB MIND Lab)
CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab)
MARBERT & ARBERT (UBC NLP)

Arabic LLMs

Jais 13B/30B (Inception/MBZUAI)
SILMA 9B (SILMA AI)
Fanar 9B (QCRI)
Yehia 7B (Navid AI)
Atlas-Chat (MBZUAI Paris)

Arabic Tokenizers

Aranizer PBE/SP 32K/86K (RIOTU Lab)

Multilingual Models

Qwen 2.5 (Alibaba)
Gemma 2 (Google)
Mistral (Mistral AI)
XLM-RoBERTa (Meta)

Leaderboard Datasets

Dataset	Source	Category
ArabicMMLU	MBZUAI	MSA Benchmark
ArSenTD-LEV	ramybaly	Levantine Dialect
ATHAR	mohamed-khalil	Classical Arabic
ARCD	arcd	QA Dataset
Ashaar	arbml	Poetry
Hadith	gurgutan	Religious
Arabic Sentiment	arbml	Social Media
SANAD	arbml	News

Metrics

Fertility: Tokens per word (lower = better, 1.0 ideal)
Compression: Bytes per token (higher = better)
STRR: Single Token Retention Rate (higher = better)
OOV Rate: Out-of-vocabulary percentage (lower = better)

License

MIT License

Contributing

Contributions welcome! Please open an issue or PR.

Built with ❤️ for the Arabic NLP community