Arabic NLP Fundamentals: Sparse Embeddings & Dynamic Programming

blogging

til

blog/status/published

blog/status/complete

blog/learn/journey

A deep dive into Arabic NLP fundamentals: from lexical embeddings and BM25 to implementing minimum edit distance using dynamic programming.

Author

kareem

Published

December 13, 2025

When I was a student I was learning about classical machine learning and NLP I encountered a lot of concept like:
TF-ID
Bm25 and other concept I skimmed them and get the idea but I didn’t apply it into Arabic with practical code after these years
I was reading about the following topics

Lexical Embedding
MiniCol from Qdrant
StaticEmbedding (minislab and sbert) I understand the new advanced that are related to Bert and Transformer but the basics about sparse embedding are totally missing like how they are using n-gram to subsample for a large space then train an MLP to simulate a dense embedding to create a faster model what are the parameters of this architecture how to optimize it…etc all these knowledge are not connect well. So I want to revisit these concepts again in roadmap

word2vec
n-gram
stemming
TF-IDF
BM25
- BMX
Sparse embedding
- splade
- train a sparse embedding with sbert This is my initial thoughts but I think a better way it to start reading this book and try to implement the code for Arabic NLP: Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models Third Edition draft Daniel Jurafsky Stanford University James H. Martin University of Colorado at Boulder

Chapter Summary and what to do next

the chapter was an introduction about words, tokens , BPE and tokenization and some linguistics stuff, and how is the Unicode and encoding for different language is done which was nice intro.
How to use regular expression to do some NLP analysis.
what I really like most was the Minimum Edit Distance and Alignment of words. The problem I had is implement the algorhtim yourself needs some problem solving and I didn’t practice problem solving for a long time and my skills is no there yet **** I tried for 30 minutes to implement minimum distance but failed. I opened a website for leetcode problems is more organized and minimal called neetcode I solved the first 11 problems 3 was very easy, and other were medium and the good thing that I had solutions that are not in the solutions sections and my focus on how to use python collections methods like counter is good. The code solves the problem but it’s not the fastest thing. But I feel good that my patience and thinking is better. ![[Screenshot_2025-12-14-10-35-27-07_e4424258c8b8649f6e67d283a50a2cbc.jpg]]
my month plan will be to finish all the 150 problem and do a lot of dynamic programming (DP) because the minimum edit distance and viterbi in the end of the first chapter will be implemented in DP and other things I know also needs DP.
After that i will start practice on the following alternative to leetcode but for machine learning deep-ml.com ![[Pasted image 20251214122627.png]] ### Build a Keyword checker in Arabic I get a nice idea while learning about alignment that I could build my own keyword spell checker for my language that is more smart than the current keyboards by creating the following features:

Recommend technical meaning
1. If I said word like embedding model it can suggest => نماذج التضمين
Recommend keywords related to my persona (engineer or doctor..etc)
Federated Learning Keyboard like gboard
Arabic Grammars corrected
Automatic learning based on user interactions I thought it will be nice even to implement this in rust :) 🦀🦀 ### References
The NLP book

Internal Resources

If you are interested in more structured deep dives, check out my Blog or my Research Papers. For my work on tools and libraries, visit the Open Source section. You can also explore more daily notes in the TIL Index.