Evaluating Arabic Tokenizers
Introduction
When working with Arabic language models, choosing the right tokenizer can significantly impact model performance and efficiency. In this post, I’ll share my experience comparing two popular Arabic tokenizers: AraModernBert and AraBERT v2.
Why Tokenizer Evaluation Matters
After reading the comprehensive guide on GPT tokenizers, I realized that tokenization is often overlooked but critically important. Poor tokenization can lead to:
- Inefficient use of limited context windows
- Higher computational costs (you pay per token!)
- Worse model performance, especially for non-English languages
For Arabic specifically, tokenization is challenging because:
- Arabic has rich morphology with prefixes and suffixes
- Different dialects (Egyptian, Levantine, Gulf) have varying vocabulary
- Diacritical marks (tashkeel) add complexity
- Tasks the needs Tashkeel like Speech synthesis Grammar checking (need tashkeel) like arabic poetry tasks ..etc
The Evaluation Framework
I built a simple evaluation function to measure tokenizer quality:
def evaluate_tokenizer(text, tokenizer):
= len(tokenizer.tokenize(text))
number_of_tokens = len(text.encode('utf-8'))
number_of_bytes = len(text.split(" "))
number_of_words = number_of_tokens / number_of_words
fertility = number_of_bytes / number_of_tokens
compression_ratio return {
"fertility": fertility,
"compression_ratio": compression_ratio,
"total_tokens": number_of_tokens
}
Key Metrics
- Fertility Rate (tokens/word): Lower is better. Measures how many tokens are needed per word.
- Compression Ratio (bytes/token): Higher is better. Measures how efficiently the tokenizer compresses text.
- Total Tokens: The raw count for the given text.
The Contenders
AraModernBert
- Vocabulary: 50,280 tokens
- Training Data: 100GB of Arabic text
- Architecture: ModernBERT with transtokenization
- Context Window: 8,192 tokens
AraBERT v2
- Vocabulary: ~30,000 tokens
- Training Data: 77GB of Arabic text
- Architecture: BERT-base
- Pre-segmentation: Uses Farasa segmenter
Test Results
I tested both tokenizers on three different Arabic texts with Encoder only for now more to come later!:
Test 1: Modern Standard Arabic
Text: “مرحبا كيف حالك اليوم”
Tokenizer | Fertility | Compression Ratio | Total Tokens |
---|---|---|---|
AraModernBert | 1.25 | 7.4 | 5 |
AraBERT v2 | 1.5 | 6.17 | 6 |
Test 2: Egyptian Dialect
Text: “إزيك يا صاحبي عامل إيه”
Tokenizer | Fertility | Compression Ratio | Total Tokens |
---|---|---|---|
AraModernBert | 1.2 | 6.67 | 6 |
AraBERT v2 | 1.4 | 5.71 | 7 |
Test 3: Technical/Formal Text
Text: “الذكاء الاصطناعي يغير العالم بسرعة كبيرة”
Tokenizer | Fertility | Compression Ratio | Total Tokens |
---|---|---|---|
AraModernBert | 1.17 | 10.71 | 7 |
AraBERT v2 | 2.0 | 6.25 | 12 |
Key Findings
1. AraModernBert is Consistently More Efficient
Across all three tests, AraModernBert showed:
Lower fertility (15-42% fewer tokens per word)
Higher compression ratio (14-71% better compression)
Fewer total tokens needed for the same text
2. The Gap Widens with Complex Text
The most dramatic difference appeared in Test 3 (technical text):
AraModernBert: 7 tokens
AraBERT v2: 12 tokens (71% more!)
This means for an 8K context window, AraModernBert can fit significantly more Arabic text.
3. Both Handle Dialects Reasonably Well
The Egyptian dialect test (Test 2) showed both tokenizers maintained similar efficiency to MSA, though AraModernBert still outperformed.
Why AraModernBert Performs Better
Larger Vocabulary (50K vs 30K tokens)
More tokens means the model can learn longer, more common Arabic word chunks as single tokens.
This is especially important for Arabic’s morphologically rich structure.
More Training Data (100GB vs 77GB)
More data leads to better byte-pair encoding merges that reflect actual Arabic usage patterns.
Modern Architecture
AraModernBert uses transtokenization - a technique that optimally initializes embeddings when creating the tokenizer, leading to better learned representations.
Recency Advantage
Trained in 2024 vs 2020, AraModernBert benefits from more recent data and improved training techniques.
Practical Implications
For Model Training
- Context efficiency: AraModernBert lets you fit ~40% more Arabic text in the same context window
- Cost savings: Fewer tokens = lower training costs
- Better performance: More efficient tokenization often correlates with better downstream task performance
For Production Systems
- API costs: Pay per token, so more efficient tokenization = lower costs
- Latency: Fewer tokens to process = faster inference
- Memory: Smaller token sequences = lower memory footprint
Should You Train Your Own Tokenizer?
After this evaluation, here’s my thinking:
Use AraModernBert if:
- You’re working with Modern Standard Arabic or mixed dialects
- You want state-of-the-art efficiency out of the box
- You don’t have massive compute resources for training
Train your own if:
- You have a very specific domain (medical, legal, etc.)
- You’re working with a specific dialect extensively (pure Egyptian, Levantine, etc.)
- You have unique requirements (handling tashkeel differently, etc.)
For my Egyptian Arabic use case, I’m leaning toward using AraModernBert’s tokenizer as-is, since:
- It already handles Egyptian dialect reasonably well
- The 50K vocabulary is large enough to be flexible
- Training a custom tokenizer requires significant effort and data
Next Steps
- Test on real data: Evaluate on actual Egyptian Arabic corpus (FineWeb Egyptian)
- Compare with SentencePiece: Test a SentencePiece tokenizer trained on Arabic
- Measure downstream performance: Tokenizer efficiency doesn’t always equal better model performance
- Investigate tashkeel handling: How do these tokenizers handle diacritical marks?
Conclusion
Tokenizer evaluation revealed that AraModernBert significantly outperforms AraBERT v2 in efficiency metrics, with 15-71% fewer tokens needed for the same Arabic text.
This translates to real cost savings and performance improvements in production systems.
The key lesson: don’t assume all tokenizers are equal. A few hours of evaluation can save months of headaches and significant costs down the line.
Code Repository
The complete evaluation code is available as a simple function:
def evaluate_tokenizer(text, tokenizer):
= len(tokenizer.tokenize(text))
number_of_tokens = len(text.encode('utf-8'))
number_of_bytes = len(text.split(" "))
number_of_words = number_of_tokens / number_of_words
fertility = number_of_bytes / number_of_tokens
compression_ratio return {
"fertility": fertility,
"compression_ratio": compression_ratio,
"total_tokens": number_of_tokens
}
this post was built with guide from Amazing Solveit