Evaluating Arabic Tokenizers

blogging

til

nlp

A practical comparison of AraModernBert and AraBERT v2 tokenizers for Arabic NLP, including efficiency metrics and real-world evaluation tips.

Author

kareem

Published

October 21, 2025

Introduction

When working with Arabic language models, choosing the right tokenizer can significantly impact model performance and efficiency. In this post, I’ll share my experience comparing two popular Arabic tokenizers: AraModernBert and AraBERT v2.

Why Tokenizer Evaluation Matters

After reading the comprehensive guide on GPT tokenizers, I realized that tokenization is often overlooked but critically important. Poor tokenization can lead to:

Inefficient use of limited context windows
Higher computational costs (you pay per token!)
Worse model performance, especially for non-English languages

For Arabic specifically, tokenization is challenging because:

Arabic has rich morphology with prefixes and suffixes
Different dialects (Egyptian, Levantine, Gulf) have varying vocabulary
Diacritical marks (tashkeel) add complexity
Tasks the needs Tashkeel like Speech synthesis Grammar checking (need tashkeel) like arabic poetry tasks ..etc

The Evaluation Framework

I built a simple evaluation function to measure tokenizer quality:

def evaluate_tokenizer(text, tokenizer):
    number_of_tokens = len(tokenizer.tokenize(text))
    number_of_bytes = len(text.encode('utf-8'))
    number_of_words = len(text.split(" "))
    fertility = number_of_tokens / number_of_words 
    compression_ratio = number_of_bytes / number_of_tokens 
    return {
        "fertility": fertility,
        "compression_ratio": compression_ratio,
        "total_tokens": number_of_tokens
    }

Key Metrics

Fertility Rate (tokens/word): Lower is better. Measures how many tokens are needed per word.
Compression Ratio (bytes/token): Higher is better. Measures how efficiently the tokenizer compresses text.
Total Tokens: The raw count for the given text.

The Contenders

AraModernBert

Vocabulary: 50,280 tokens
Training Data: 100GB of Arabic text
Architecture: ModernBERT with transtokenization
Context Window: 8,192 tokens

AraBERT v2

Vocabulary: ~30,000 tokens
Training Data: 77GB of Arabic text
Architecture: BERT-base
Pre-segmentation: Uses Farasa segmenter

Test Results

I tested both tokenizers on three different Arabic texts with Encoder only for now more to come later!:

Test 1: Modern Standard Arabic

Text: “مرحبا كيف حالك اليوم”

Tokenizer	Fertility	Compression Ratio	Total Tokens
AraModernBert	1.25	7.4	5
AraBERT v2	1.5	6.17	6

Test 2: Egyptian Dialect

Text: “إزيك يا صاحبي عامل إيه”

Tokenizer	Fertility	Compression Ratio	Total Tokens
AraModernBert	1.2	6.67	6
AraBERT v2	1.4	5.71	7

Test 3: Technical/Formal Text

Text: “الذكاء الاصطناعي يغير العالم بسرعة كبيرة”

Tokenizer	Fertility	Compression Ratio	Total Tokens
AraModernBert	1.17	10.71	7
AraBERT v2	2.0	6.25	12

Key Findings

1. AraModernBert is Consistently More Efficient

Across all three tests, AraModernBert showed:

Lower fertility (15-42% fewer tokens per word)
Higher compression ratio (14-71% better compression)
Fewer total tokens needed for the same text

2. The Gap Widens with Complex Text

The most dramatic difference appeared in Test 3 (technical text):

AraModernBert: 7 tokens
AraBERT v2: 12 tokens (71% more!)

This means for an 8K context window, AraModernBert can fit significantly more Arabic text.

3. Both Handle Dialects Reasonably Well

The Egyptian dialect test (Test 2) showed both tokenizers maintained similar efficiency to MSA, though AraModernBert still outperformed.

Why AraModernBert Performs Better

Larger Vocabulary (50K vs 30K tokens)

More tokens means the model can learn longer, more common Arabic word chunks as single tokens.

This is especially important for Arabic’s morphologically rich structure.

More Training Data (100GB vs 77GB)

More data leads to better byte-pair encoding merges that reflect actual Arabic usage patterns.

Modern Architecture

AraModernBert uses transtokenization - a technique that optimally initializes embeddings when creating the tokenizer, leading to better learned representations.

Recency Advantage

Trained in 2024 vs 2020, AraModernBert benefits from more recent data and improved training techniques.

Practical Implications

For Model Training

Context efficiency: AraModernBert lets you fit ~40% more Arabic text in the same context window
Cost savings: Fewer tokens = lower training costs
Better performance: More efficient tokenization often correlates with better downstream task performance

For Production Systems

API costs: Pay per token, so more efficient tokenization = lower costs
Latency: Fewer tokens to process = faster inference
Memory: Smaller token sequences = lower memory footprint

Should You Train Your Own Tokenizer?

After this evaluation, here’s my thinking:

Use AraModernBert if:

You’re working with Modern Standard Arabic or mixed dialects
You want state-of-the-art efficiency out of the box
You don’t have massive compute resources for training

Train your own if:

You have a very specific domain (medical, legal, etc.)
You’re working with a specific dialect extensively (pure Egyptian, Levantine, etc.)
You have unique requirements (handling tashkeel differently, etc.)

For my Egyptian Arabic use case, I’m leaning toward using AraModernBert’s tokenizer as-is, since:

It already handles Egyptian dialect reasonably well
The 50K vocabulary is large enough to be flexible
Training a custom tokenizer requires significant effort and data

Next Steps

Test on real data: Evaluate on actual Egyptian Arabic corpus (FineWeb Egyptian)
Compare with SentencePiece: Test a SentencePiece tokenizer trained on Arabic
Measure downstream performance: Tokenizer efficiency doesn’t always equal better model performance
Investigate tashkeel handling: How do these tokenizers handle diacritical marks?

Conclusion

Tokenizer evaluation revealed that AraModernBert significantly outperforms AraBERT v2 in efficiency metrics, with 15-71% fewer tokens needed for the same Arabic text.

This translates to real cost savings and performance improvements in production systems.

The key lesson: don’t assume all tokenizers are equal. A few hours of evaluation can save months of headaches and significant costs down the line.

Code Repository

The complete evaluation code is available as a simple function:

def evaluate_tokenizer(text, tokenizer):
    number_of_tokens = len(tokenizer.tokenize(text))
    number_of_bytes = len(text.encode('utf-8'))
    number_of_words = len(text.split(" "))
    fertility = number_of_tokens / number_of_words 
    compression_ratio = number_of_bytes / number_of_tokens 
    return {
        "fertility": fertility,
        "compression_ratio": compression_ratio,
        "total_tokens": number_of_tokens
    }

this post was built with guide from Amazing Solveit