Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: Arabic Tokenizer Arena
emoji: ποΈ
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
ποΈ Arabic Tokenizer Arena Pro
Advanced research & production platform for Arabic tokenization analysis.
Features
- π Comprehensive Metrics: Fertility, compression, STRR, OOV rate, and more
- π Arabic-Specific Analysis: Dialect support, diacritic preservation
- βοΈ Side-by-Side Comparison: Compare multiple tokenizers instantly
- π¨ Beautiful Visualization: Token-by-token display with IDs
- π Leaderboard: Evaluate on real HuggingFace Arabic datasets
- π Multi-Variant Support: MSA, dialectal, and Classical Arabic
Project Structure
arabic_tokenizer_arena/
βββ app.py # Main Gradio application
βββ config.py # Tokenizer registry & dataset configs
βββ tokenizer_manager.py # Tokenizer loading & caching
βββ analysis.py # Tokenization analysis functions
βββ leaderboard.py # Leaderboard with HF datasets
βββ ui_components.py # HTML generation
βββ styles.py # CSS styles
βββ utils.py # Arabic text utilities
βββ requirements.txt # Dependencies
βββ README.md # This file
Installation
pip install -r requirements.txt
Usage
Local Development
python app.py
HuggingFace Spaces
- Upload all
.pyfiles to your Space - Add
HF_TOKENsecret if using gated models - The app will start automatically
Available Tokenizers
Arabic BERT Models
- AraBERT v2 (AUB MIND Lab)
- CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab)
- MARBERT & ARBERT (UBC NLP)
Arabic LLMs
- Jais 13B/30B (Inception/MBZUAI)
- SILMA 9B (SILMA AI)
- Fanar 9B (QCRI)
- Yehia 7B (Navid AI)
- Atlas-Chat (MBZUAI Paris)
Arabic Tokenizers
- Aranizer PBE/SP 32K/86K (RIOTU Lab)
Multilingual Models
- Qwen 2.5 (Alibaba)
- Gemma 2 (Google)
- Mistral (Mistral AI)
- XLM-RoBERTa (Meta)
Leaderboard Datasets
| Dataset | Source | Category |
|---|---|---|
| ArabicMMLU | MBZUAI | MSA Benchmark |
| ArSenTD-LEV | ramybaly | Levantine Dialect |
| ATHAR | mohamed-khalil | Classical Arabic |
| ARCD | arcd | QA Dataset |
| Ashaar | arbml | Poetry |
| Hadith | gurgutan | Religious |
| Arabic Sentiment | arbml | Social Media |
| SANAD | arbml | News |
Metrics
- Fertility: Tokens per word (lower = better, 1.0 ideal)
- Compression: Bytes per token (higher = better)
- STRR: Single Token Retention Rate (higher = better)
- OOV Rate: Out-of-vocabulary percentage (lower = better)
License
MIT License
Contributing
Contributions welcome! Please open an issue or PR.
Built with β€οΈ for the Arabic NLP community