Revisit the Basics of Arabic NLP

blogging
til
Let’s get back into studying Arabic NLP from the basics
Author

kareem

Published

December 13, 2025

  1. Lexical Embedding
  2. MiniCol from Qdrant
  3. StaticEmbedding (minislab and sbert) I understand the new advanced that are related to Bert and Transformer but the basics about sparse embedding are totally missing like how they are using n-gram to subsample for a large space then train an MLP to simulate a dense embedding to create a faster model what are the parameters of this architecture how to optimize it…etc all these knowledge are not connect well. So I want to revisit these concepts again in roadmap

Chapter Summary and what to do next

  • the chapter was an introduction about words, tokens , BPE and tokenization and some linguistics stuff, and how is the Unicode and encoding for different language is done which was nice intro.
  • How to use regular expression to do some NLP analysis.
  • what I really like most was the Minimum Edit Distance and Alignment of words. The problem I had is implement the algorhtim yourself needs some problem solving and I didn’t practice problem solving for a long time and my skills is no there yet **** I tried for 30 minutes to implement minimum distance but failed. I opened a website for leetcode problems is more organized and minimal called neetcode I solved the first 11 problems 3 was very easy, and other were medium and the good thing that I had solutions that are not in the solutions sections and my focus on how to use python collections methods like counter is good. The code solves the problem but it’s not the fastest thing. But I feel good that my patience and thinking is better. ![[Screenshot_2025-12-14-10-35-27-07_e4424258c8b8649f6e67d283a50a2cbc.jpg]]
  • my month plan will be to finish all the 150 problem and do a lot of dynamic programming (DP) because the minimum edit distance and viterbi in the end of the first chapter will be implemented in DP and other things I know also needs DP.
  • After that i will start practice on the following alternative to leetcode but for machine learning deep-ml.com ![[Pasted image 20251214122627.png]] ### Build a Keyword checker in Arabic I get a nice idea while learning about alignment that I could build my own keyword spell checker for my language that is more smart than the current keyboards by creating the following features:
  1. Recommend technical meaning
    1. If I said word like embedding model it can suggest => نماذج التضمين
  2. Recommend keywords related to my persona (engineer or doctor..etc)
  3. Federated Learning Keyboard like gboard
  4. Arabic Grammars corrected
  5. Automatic learning based on user interactions I thought it will be nice even to implement this in rust :) 🦀🦀 ### References
  6. The NLP book