Revisit the Basics of Arabic NLP
blogging
til
Let’s get back into studying Arabic NLP from the basics
- When I was a student I was learning about classical machine learning and NLP I encountered a lot of concept like:
- TF-ID
- Bm25 and other concept I skimmed them and get the idea but I didn’t apply it into Arabic with practical code after these years
- I was reading about the following topics
- Lexical Embedding
- MiniCol from Qdrant
- StaticEmbedding (minislab and sbert) I understand the new advanced that are related to Bert and Transformer but the basics about sparse embedding are totally missing like how they are using n-gram to subsample for a large space then train an MLP to simulate a dense embedding to create a faster model what are the parameters of this architecture how to optimize it…etc all these knowledge are not connect well. So I want to revisit these concepts again in roadmap
- word2vec
- n-gram
- stemming
- TF-IDF
- BM25
- BMX
- Sparse embedding
- splade
- train a sparse embedding with sbert This is my initial thoughts but I think a better way it to start reading this book and try to implement the code for Arabic NLP: Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models Third Edition draft Daniel Jurafsky Stanford University James H. Martin University of Colorado at Boulder
Chapter Summary and what to do next
- the chapter was an introduction about words, tokens , BPE and tokenization and some linguistics stuff, and how is the Unicode and encoding for different language is done which was nice intro.
- How to use regular expression to do some NLP analysis.
- what I really like most was the Minimum Edit Distance and Alignment of words. The problem I had is implement the algorhtim yourself needs some problem solving and I didn’t practice problem solving for a long time and my skills is no there yet **** I tried for 30 minutes to implement minimum distance but failed. I opened a website for leetcode problems is more organized and minimal called neetcode I solved the first 11 problems 3 was very easy, and other were medium and the good thing that I had solutions that are not in the solutions sections and my focus on how to use python collections methods like counter is good. The code solves the problem but it’s not the fastest thing. But I feel good that my patience and thinking is better. ![[Screenshot_2025-12-14-10-35-27-07_e4424258c8b8649f6e67d283a50a2cbc.jpg]]
- my month plan will be to finish all the 150 problem and do a lot of dynamic programming (DP) because the minimum edit distance and viterbi in the end of the first chapter will be implemented in DP and other things I know also needs DP.
- After that i will start practice on the following alternative to leetcode but for machine learning deep-ml.com ![[Pasted image 20251214122627.png]] ### Build a Keyword checker in Arabic I get a nice idea while learning about alignment that I could build my own keyword spell checker for my language that is more smart than the current keyboards by creating the following features:
- Recommend technical meaning
- If I said word like embedding model it can suggest => نماذج التضمين
- Recommend keywords related to my persona (engineer or doctor..etc)
- Federated Learning Keyboard like gboard
- Arabic Grammars corrected
- Automatic learning based on user interactions I thought it will be nice even to implement this in rust :) 🦀🦀 ### References
- The NLP book