Zaraah model2vec for Arabic NLP with magical power

blogging
embedding
minishlab
model2vec
arabic
Zaraah model2vec family model analysis and testing on Arabic Embeddings tasks.
Author

kareem

Published

May 15, 2025

Arabic Embedding Models

This blogpost describes the Zaraah model2vec which is similar to POTION models, They are based on how to pre-train Model2Vec models. I will also share a quick notes about Zaraah power and it’s limitations and the usage for Arabic embedding tasks.

What is even POTION models ?

==POTION: bag of tricks leads to better models==

I like to think of it as Bojji from the Osuama ranking, this small king in size, but he can fight with gaints like (Jinaiai-embedding and bge models) It’s better than any set of static embeddings for any task,including glove, fasttext and specialized word embeddings.

It’s performace is the similar to the all-MiniLM-L6-v2 in English and suprpassing the 50% average MTEB score make,while being very small around 4M nd 2M Params. and is smaller than GloVE with around ~55 times the sizes these of model can be around 30MB to 8MB while still has much more power.

So you can run these models in CPU or using them in the browser for developing lightweight applications on the edge-devices and low-resources.

what is model2vec technique ?

how to make a sentence transformer 500x faster and 15x smaller? yes, you can and you should do it with model2vec technique, currently it’s the ==Fast State-of-the-Art Static Embeddings==.

Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance

Model2vec is instead of creating a static embedding like Glove with the old ways, It tries to mimic the knolwedge from the sentece transformers big models, into uncontextualized vectors, which is a downside, but this provides the other benefits from the size, speed and still gives you enough representation for the words for most of the applications.

It’s better to get more information from their website and the Github code.

Jinaai-v3-embedding for Arabic

If you got the mteb and select the best embedding model with zero-shot, and opensource for arabic language you will get the jina-embeddings-v3 which is very strong and powerful model in all the MTEB tasks. It’s also tested on production and even in production applications for arabic….more to explain later.

So I thought it will be very nice to create an arabic version with this model and model2vec for arabic.

Zaraah Family

Zaraah is the strongest static embedding model for arabic and the first to train with tokenlearn.

It comes with variant sizes like 1. 256D 2. 64 3. 32 4. 16 5. 8 6. 4

and all these supports float32 to int8 it was trained on the vocab of the allenai/c4 arabic supset

Zaraah model vs the rest

Okey, enought taking let’s see the performance.

I tested on multiple Sentence-transformers model that are multilingual and Arabic Specific models

Zaraah vs all-mini

Zaraah Applications with minishlab

Arabic Leaderboards for Embedding tasks

Arabic Rag Leaderboard

MTEB benchmark for Arabic

Zaraah model references