Zaraah model2vec for Arabic NLP with magical power
Arabic Embedding Models
This blogpost describes the Zaraah model2vec which is similar to POTION models, They are based on how to pre-train Model2Vec models. I will also share a quick notes about Zaraah power and it’s limitations and the usage for Arabic embedding tasks.
What is even POTION models ?
==POTION: bag of tricks leads to better models==
I like to think of it as Bojji from the Osuama ranking, this small king in size, but he can fight with gaints like (Jinaiai-embedding and bge models) It’s better than any set of static embeddings for any task,including glove, fasttext and specialized word embeddings.
It’s performace is the similar to the all-MiniLM-L6-v2 in English and suprpassing the 50% average MTEB score make,while being very small around 4M nd 2M Params. and is smaller than GloVE with around ~55 times the sizes these of model can be around 30MB to 8MB while still has much more power.
So you can run these models in CPU or using them in the browser for developing lightweight applications on the edge-devices and low-resources.
what is model2vec technique ?
how to make a sentence transformer 500x faster and 15x smaller? yes, you can and you should do it with model2vec technique, currently it’s the ==Fast State-of-the-Art Static Embeddings==.
Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance
Model2vec is instead of creating a static embedding like Glove with the old ways, It tries to mimic the knolwedge from the sentece transformers big models, into uncontextualized vectors, which is a downside, but this provides the other benefits from the size, speed and still gives you enough representation for the words for most of the applications.
It’s better to get more information from their website and the Github code.
Jinaai-v3-embedding for Arabic
If you got the mteb and select the best embedding model with zero-shot, and opensource for arabic language you will get the jina-embeddings-v3 which is very strong and powerful model in all the MTEB tasks. It’s also tested on production and even in production applications for arabic….more to explain later.
So I thought it will be very nice to create an arabic version with this model and model2vec for arabic.
Zaraah Family
Zaraah is the strongest static embedding model for arabic and the first to train with tokenlearn.
It comes with variant sizes like 1. 256D 2. 64 3. 32 4. 16 5. 8 6. 4
and all these supports float32 to int8 it was trained on the vocab of the supset
Zaraah model vs the rest
Okey, enought taking let’s see the performance.
I tested on multiple Sentence-transformers model that are multilingual and Arabic Specific models