Arabic_Tokenizer / README.md
HeshamHaroon's picture
Add HuggingFace Spaces YAML configuration
751def7

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Arabic Tokenizer Arena
emoji: 🏟️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false

🏟️ Arabic Tokenizer Arena Pro

Advanced research & production platform for Arabic tokenization analysis.

Features

  • πŸ“Š Comprehensive Metrics: Fertility, compression, STRR, OOV rate, and more
  • 🌍 Arabic-Specific Analysis: Dialect support, diacritic preservation
  • βš–οΈ Side-by-Side Comparison: Compare multiple tokenizers instantly
  • 🎨 Beautiful Visualization: Token-by-token display with IDs
  • πŸ† Leaderboard: Evaluate on real HuggingFace Arabic datasets
  • πŸ“– Multi-Variant Support: MSA, dialectal, and Classical Arabic

Project Structure

arabic_tokenizer_arena/
β”œβ”€β”€ app.py                 # Main Gradio application
β”œβ”€β”€ config.py              # Tokenizer registry & dataset configs
β”œβ”€β”€ tokenizer_manager.py   # Tokenizer loading & caching
β”œβ”€β”€ analysis.py            # Tokenization analysis functions
β”œβ”€β”€ leaderboard.py         # Leaderboard with HF datasets
β”œβ”€β”€ ui_components.py       # HTML generation
β”œβ”€β”€ styles.py              # CSS styles
β”œβ”€β”€ utils.py               # Arabic text utilities
β”œβ”€β”€ requirements.txt       # Dependencies
└── README.md              # This file

Installation

pip install -r requirements.txt

Usage

Local Development

python app.py

HuggingFace Spaces

  1. Upload all .py files to your Space
  2. Add HF_TOKEN secret if using gated models
  3. The app will start automatically

Available Tokenizers

Arabic BERT Models

  • AraBERT v2 (AUB MIND Lab)
  • CAMeLBERT Mix/MSA/DA/CA (CAMeL Lab)
  • MARBERT & ARBERT (UBC NLP)

Arabic LLMs

  • Jais 13B/30B (Inception/MBZUAI)
  • SILMA 9B (SILMA AI)
  • Fanar 9B (QCRI)
  • Yehia 7B (Navid AI)
  • Atlas-Chat (MBZUAI Paris)

Arabic Tokenizers

  • Aranizer PBE/SP 32K/86K (RIOTU Lab)

Multilingual Models

  • Qwen 2.5 (Alibaba)
  • Gemma 2 (Google)
  • Mistral (Mistral AI)
  • XLM-RoBERTa (Meta)

Leaderboard Datasets

Dataset Source Category
ArabicMMLU MBZUAI MSA Benchmark
ArSenTD-LEV ramybaly Levantine Dialect
ATHAR mohamed-khalil Classical Arabic
ARCD arcd QA Dataset
Ashaar arbml Poetry
Hadith gurgutan Religious
Arabic Sentiment arbml Social Media
SANAD arbml News

Metrics

  • Fertility: Tokens per word (lower = better, 1.0 ideal)
  • Compression: Bytes per token (higher = better)
  • STRR: Single Token Retention Rate (higher = better)
  • OOV Rate: Out-of-vocabulary percentage (lower = better)

License

MIT License

Contributing

Contributions welcome! Please open an issue or PR.


Built with ❀️ for the Arabic NLP community