<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Abdelkareem Elkhateb</title>
<link>https://kareemai.com/blog/posts/nlp/embedding_world/</link>
<atom:link href="https://kareemai.com/blog/posts/nlp/embedding_world/index.xml" rel="self" type="application/rss+xml"/>
<description>Deep dives into text embeddings, vector databases, sparse retrieval, and semantic search. A curated collection of articles on building production-ready search systems.</description>
<image>
<url>https://kareemai.com/kareem.jpg</url>
<title>Abdelkareem Elkhateb</title>
<link>https://kareemai.com/blog/posts/nlp/embedding_world/</link>
</image>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Fri, 17 Apr 2026 22:00:00 GMT</lastBuildDate>
<item>
  <title>Vector Databases - O’reilly By Nitin Borwankar</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/vector_database_book.html</link>
  <description><![CDATA[ 





<section id="vector-databases-book" class="level2">
<h2 class="anchored" data-anchor-id="vector-databases-book">Vector Databases Book</h2>
<p>It’s a small book 292 pages category : Intermediate to Advanced.</p>
<p>I just finished reading it today it take me 2 days to finish reading it’s with my whole respect to the author.</p>
<p>A very bad book from multiple points i will share.</p>
<p>This is not related to the author knowledge or anything just my thought about the book.</p>
<section id="the-topics" class="level3">
<h3 class="anchored" data-anchor-id="the-topics">The topics</h3>
<ol type="1">
<li>Intro to databases</li>
<li>Embeddings</li>
<li>FAISS</li>
<li>SQLite3</li>
<li>POstgresSQL pygvector</li>
<li>SQLite and Ollama</li>
<li>Complete RAG system app</li>
<li>Vector Query Language.</li>
</ol>
<p>shiny and nice topics.</p>
</section>
</section>
<section id="vector-database-book-review." class="level2">
<h2 class="anchored" data-anchor-id="vector-database-book-review.">Vector database Book Review.</h2>
<p>When you read the book you would feel it’s not connected with each other in a nice way. it’s like listing some information from GitHub pages without the sense of teaching.</p>
<p>a lot of chapters listing info that is specific to Project which if i need it i will read GitHub page.</p>
<p>The Charts are not that helpful they don’t add more value just an overview it’s more like AI generated one. It maybe not AI but not much helpful.</p>
<p>You feel there is a gab between the depth of knowledge and the titles. shiny titles.</p>
<p>I hoped it was more detailed more to concepts with nice explaining that works with any vector database.</p>
<ul>
<li>Example the FAISS chapter :
<ul>
<li>it gives all the indexing types in FAISS and quick overview about quantization that is bad then benefit and trade offs this is style of book!</li>
</ul></li>
</ul>
</section>
<section id="my-recommendation." class="level2">
<h2 class="anchored" data-anchor-id="my-recommendation.">My Recommendation.</h2>
<p>Any Vector database Articles are far helpful and much better like :</p>
<p><a href="https://qdrant.tech/articles/">Quarto Blogs.</a></p>
<p><a href="https://weaviate.io/blog">Weaviate</a></p>
<section id="who-am-i" class="level3">
<h3 class="anchored" data-anchor-id="who-am-i">Who am I ?</h3>
<p>I am just 2 years experience in AI and RAG but it’s obvious it’s not good book.</p>
<p>This review is written by : <a href="https://kareemai.com/">Kareem Elkhateb AI Engineer</a></p>


</section>
</section>

 ]]></description>
  <category>blogging</category>
  <category>til</category>
  <category>blog/review/book</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/vector_database_book.html</guid>
  <pubDate>Fri, 17 Apr 2026 22:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/images/vector_database_book.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>HyperRun + ColGrep: A Self-Hosted Alternative to RunLLM</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/hyperrun.html</link>
  <description><![CDATA[ 





<p>Documentation sites are where developers go for answers, but finding what you need in a large codebase can be frustrating.</p>
<p>RunLLM solved this by adding an “Ask AI” chat widget to docs, letting users ask questions in natural language and get answers grounded in the actual code.</p>
<p>But RunLLM is closed-source and hosted.</p>
<p>If you want control over your data, your models, and your costs — you’re out of luck.</p>
<ul>
<li>maybe you want to let the user use its own AI model (BYOK)
<ul>
<li>It’s better for the user and no cost for you</li>
</ul></li>
<li>Way faster than the RunLLM</li>
</ul>
<div class="callout callout-note" data-callout="note">
<div class="callout-title-container">
<p><span class="callout-icon-container"><span class="callout-icon"></span></span><span class="callout-title">RunLLM is more than this</span></p>
</div>
<div class="callout-body">
<p>RunLLM is much more than just a chat with your docs. I first found it in the Documentation of DSPy, I liked it and i use it multiple times so i am trying to replicate this part only with other features as opensource</p>
</div>
</div>
<p>HyperRun is my attempt to build an open-source, self-hosted alternative.</p>
<p>It combines ColGrep’s semantic code search with LLM chat, and can be embedded in any docs site — Quarto, nbdev, MkDocs, or plain GitHub Pages — with a single line of code.</p>
<section id="what-is-hyperrun" class="level2">
<h2 class="anchored" data-anchor-id="what-is-hyperrun">What is HyperRun ?</h2>
<p>HyperRun lets developers add a semantic code search chat widget to their docs.</p>
<p>Built on <a href="https://github.com/lightonai/colgrep">ColGrep</a> for indexing, <a href="https://lisette.answer.ai">Lisette</a> for LLM orchestration, and <a href="https://fastht.ml">FastHTML</a> for the UI.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://kareemai.com/blog/posts/nlp/embedding_world/images/pylate_hyperun.jpg" class="img-fluid figure-img"></p>
<figcaption>pylate_hyperun</figcaption>
</figure>
</div>
<section id="hyperrun-features" class="level3">
<h3 class="anchored" data-anchor-id="hyperrun-features">HyperRun Features</h3>
<ul>
<li><p><strong>Semantic code search</strong> — powered by ColGrep’s late-interaction retrieval</p></li>
<li><p><strong>Streaming chat</strong> — SSE-based responses via HTMX</p></li>
<li><p><strong>Multi-provider</strong> — supports Anthropic, OpenAI, Google and more via LiteLLM</p></li>
<li><p><strong>BYOK</strong> — users bring their own API key</p></li>
<li><p><strong>Chat history</strong> — persistent conversations per session</p></li>
<li><p><strong>Cost tracking</strong> — per-session token cost display</p></li>
<li><p><strong>DaisyUI styled</strong> — floating widget with chat bubbles and markdown rendering</p></li>
<li><p><strong>Embeddable</strong> — drop into any static site (Quarto, MkDocs, GitHub Pages) with one snippet</p></li>
</ul>
</section>
</section>
<section id="why-colgrep" class="level2">
<h2 class="anchored" data-anchor-id="why-colgrep">Why ColGrep</h2>
<p>Most coding agents still use <code>grep</code> to search codebases.</p>
<p>It works — but it’s pure pattern matching.</p>
<p>If you don’t know the exact function name, you’re stuck guessing.</p>
<p>Semantic search (RAG) solves this but introduces problems:</p>
<ul>
<li><p>Requires remote storage of your code (security concern)</p></li>
<li><p>Needs a separate vector DB service running</p></li>
<li><p>Keeping the index in sync with a fast-moving codebase is hard</p></li>
</ul>
<p>ColGrep takes a completely different approach. It’s a Rust CLI tool built by LightOn that:</p>
<ul>
<li><p><strong>Mirrors the grep interface</strong> — agents already know how to use it</p></li>
<li><p><strong>Runs entirely locally</strong> — no remote storage, no separate API service</p></li>
<li><p><strong>Uses late-interaction retrieval</strong> (ColBERT-style) via LateOn-Code models — not traditional embeddings</p>
<ul>
<li>We are using their specialized coding retrieval model 17M</li>
</ul></li>
<li><p><strong>Supports hybrid queries</strong> — regex filtering first, then semantic ranking on the filtered results</p></li>
<li><p><strong>Incremental index updates</strong> — only re-indexes changed files, not the whole repo</p>
<ul>
<li>They are using a nice hashing trick 😎</li>
</ul></li>
<li><p><strong>Tree-sitter parsing</strong> — extracts functions, classes, signatures, call graphs — not just raw text chunks</p></li>
</ul>
<p>The results speak for themselves:</p>
<ul>
<li><p>Won <strong>70% of head-to-head comparisons</strong> against vanilla grep with Claude Code</p></li>
<li><p>Cut token usage by <strong>15.7%</strong> on average</p></li>
<li><p><strong>56% fewer search operations</strong> needed to find the right code</p></li>
<li><p>Hard conceptual questions (where you describe behavior, not function names) benefit the most</p></li>
</ul>
<div class="callout callout-note" data-callout="note">
<div class="callout-title-container">
<p><span class="callout-icon-container"><span class="callout-icon"></span></span><span class="callout-title">Late-interaction vs traditional embeddings</span></p>
</div>
<div class="callout-body">

</div>
</div>
<blockquote class="blockquote">
<p>Traditional embeddings compress an entire code block into a single vector — losing detail.</p>
</blockquote>
<blockquote class="blockquote">
<p>Late-interaction models (like ColBERT) keep per-token vectors, so they can do soft matching between query terms and code tokens.</p>
</blockquote>
<blockquote class="blockquote">
<p>This is why ColGrep handles “find the function that inserts articles into the database” better than a single-vector approach.</p>
</blockquote>
<p>For HyperRun, ColGrep is the retrieval layer — it finds the relevant code semantically, then Lisette sends it to the LLM for a conversational answer.</p>
</section>
<section id="how-it-works" class="level2">
<h2 class="anchored" data-anchor-id="how-it-works">How It Works</h2>
<p>HyperRun has two sides — the <strong>Doc Author</strong> who sets it up, and the <strong>End User</strong> who asks questions.</p>
<section id="doc-author-setup" class="level3">
<h3 class="anchored" data-anchor-id="doc-author-setup">Doc Author Setup</h3>
<ol type="1">
<li>Point HyperRun at your codebase via <code>.env</code></li>
<li>On startup, ColGrep indexes the repo automatically — <code>colgrep init</code> runs if no index exists</li>
<li>Add one line to your docs:</li>
</ol>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode javascript code-with-copy"><code class="sourceCode javascript"><span id="cb1-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>script src<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://your-server/embed"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&lt;/</span>script<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p>That’s it. Works with Quarto, nbdev, MkDocs, GitHub Pages — anything that serves HTML.</p>
</section>
<section id="what-happens-when-a-user-asks-a-question" class="level3">
<h3 class="anchored" data-anchor-id="what-happens-when-a-user-asks-a-question">What Happens When a User Asks a Question</h3>
<ol type="1">
<li>User clicks “💬 Ask AI” → floating chat panel opens</li>
<li>User types a question → sent to the server via HTMX</li>
<li>Server calls <code>colgrep</code> with the query → gets the top matching code snippets</li>
<li>Lisette’s <code>AsyncChat</code> sends the snippets + question to the LLM as a tool call</li>
<li>LLM response streams back via SSE → rendered as markdown in the chat bubble</li>
</ol>
<p><strong>The LLM never sees your whole codebase — only the relevant snippets ColGrep finds.</strong> This keeps context small, responses fast, and costs low.</p>
</section>
<section id="byok" class="level3">
<h3 class="anchored" data-anchor-id="byok">BYOK</h3>
<p>End users can click the ⚙️ settings icon, pick a provider (Anthropic, OpenAI, Google, etc.), enter their own API key, and choose a model — all dynamically populated from LiteLLM.</p>
</section>
</section>
<section id="whats-next" class="level2">
<h2 class="anchored" data-anchor-id="whats-next">What’s Next</h2>
<p>HyperRun is still early — here’s what I’m working on:</p>
<ul>
<li><p><strong>Citations</strong> — show which code snippets the LLM used to answer, so users can verify</p></li>
<li><p><strong>GitHub Actions</strong> — auto re-index on push, auto deploy the server — zero manual steps</p></li>
<li><p><strong>Better nbdev support</strong> — ColGrep struggles with <code>.ipynb</code> files, so we index the exported <code>.py</code> package for now</p></li>
<li><p><strong>Secure BYOK</strong> — move API keys from session cookies to server-side encrypted storage</p></li>
<li><p><strong>Switching embedding models</strong> — let doc authors choose which ColGrep model to use</p></li>
</ul>
<p>For a deep dive into the streaming UI architecture, see <a href="hyperun_fasthtml.html">HyperRun Deep Dive: FastHTML, HTMX, and SSE</a>.</p>
<p>If you want to try it or contribute: <a href="https://github.com/abdelkareemkobo/hyperrun">github.com/abdelkareemkobo/hyperrun</a></p>


</section>

 ]]></description>
  <category>blogging</category>
  <category>til</category>
  <category>blog/build/project</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/hyperrun.html</guid>
  <pubDate>Wed, 18 Mar 2026 22:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/images/colgrep.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>HyperRun Deep Dive: Streaming Chat Architecture with FastHTML, HTMX, and SSE</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/hyperun_fasthtml.html</link>
  <description><![CDATA[ 





<section id="ai-widget-with-htmx-vs-react-javascript" class="level2">
<h2 class="anchored" data-anchor-id="ai-widget-with-htmx-vs-react-javascript">AI Widget with HTMX vs React JavaScript</h2>
<p>Every AI chat widget I’ve seen is built the same way:</p>
<p>React frontend + WebSocket connection + custom state management + a build step.</p>
<p>For something that’s basically “send text, get text back”, that’s a lot of machinery.</p>
<p>FastHTML + HTMX gives you a different deal:</p>
<ul>
<li><p>Server renders the HTML — no client-side framework</p></li>
<li><p>HTMX handles interactions — no custom JS for fetching/swapping</p></li>
<li><p>SSE handles streaming — simpler than WebSockets for one-directional data</p></li>
<li><p>Python all the way down — the UI is just functions returning FT components</p></li>
</ul>
<p>The entire HyperRun UI is one Python file.</p>
<p><strong>No <code>package.json</code>, no <code>node_modules</code>, no build step.</strong></p>
<hr>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb1-1"></span>
<span id="cb1-2">      ┌─────────────────────────────────────────────────────────┐</span>
<span id="cb1-3">      │                    Doc Author's Site                    │</span>
<span id="cb1-4">      │  (Quarto / nbdev / MkDocs / GitHub Pages)               │</span>
<span id="cb1-5">      │                                                         │</span>
<span id="cb1-6">      │  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">script</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;"> src</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://server/embed"</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">script</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span>           │</span>
<span id="cb1-7">      │         │                                               │</span>
<span id="cb1-8">      │         ▼                                               │</span>
<span id="cb1-9">      │  ┌─────────────┐                                        │</span>
<span id="cb1-10">      │  │ 💬 Ask AI   │ ◄── floating button (injected by JS)   │</span>
<span id="cb1-11">      │  └──────┬──────┘                                        │</span>
<span id="cb1-12">      │         │ click                                         │</span>
<span id="cb1-13">      │         ▼                                               │</span>
<span id="cb1-14">      │  ┌─────────────────────────┐                            │</span>
<span id="cb1-15">      │  │  iframe                 │                            │</span>
<span id="cb1-16">      │  │  /chat-standalone       │──────────────────┐         │</span>
<span id="cb1-17">      │  │                         │                  │         │</span>
<span id="cb1-18">      │  └─────────────────────────┘                  │         │</span>
<span id="cb1-19">      └───────────────────────────────────────────────┼─────────┘</span>
<span id="cb1-20">                                                      │</span>
<span id="cb1-21">                                                      ▼</span>
<span id="cb1-22">                                        ┌────────────────────┐</span>
<span id="cb1-23">                                        │  HyperRun Server   │</span>
<span id="cb1-24">                                        │  (FastHTML + HTMX) │</span>
<span id="cb1-25">                                        └────────────────────┘</span></code></pre></div></div>
</section>
<section id="fasthtml-app-setup" class="level2">
<h2 class="anchored" data-anchor-id="fasthtml-app-setup">FastHTML App Setup</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb2-1"></span>
<span id="cb2-2">  User types "what does insert_article do?"</span>
<span id="cb2-3">        │</span>
<span id="cb2-4">        ▼</span>
<span id="cb2-5">  ┌──────────┐  POST /ask   ┌──────────────┐</span>
<span id="cb2-6">  │  Browser │────────────►│  FastHTML     │</span>
<span id="cb2-7">  │  (HTMX)  │             │  /ask route   │</span>
<span id="cb2-8">  └──────────┘              └──────┬───────┘</span>
<span id="cb2-9">        ▲                          │</span>
<span id="cb2-10">        │                          ▼</span>
<span id="cb2-11">        │                   ┌──────────────┐</span>
<span id="cb2-12">        │                   │ Returns:     │</span>
<span id="cb2-13">        │                   │ • user bubble│</span>
<span id="cb2-14">        │                   │ • ai bubble  │</span>
<span id="cb2-15">        │                   │   (with SSE) │</span>
<span id="cb2-16">        │                   │ • new input  │</span>
<span id="cb2-17">        │                   │   (OOB swap) │</span>
<span id="cb2-18">        │                   └──────┬───────┘</span>
<span id="cb2-19">        │                          │</span>
<span id="cb2-20">        │  SSE connects to         │</span>
<span id="cb2-21">        │  /stream?query=...       ▼</span>
<span id="cb2-22">        │                   ┌──────────────┐</span>
<span id="cb2-23">        │  sse: message     │  /stream     │</span>
<span id="cb2-24">        │◄──────────────────│  async gen   │</span>
<span id="cb2-25">        │  sse: message     │      │       │</span>
<span id="cb2-26">        │◄──────────────────│      ▼       │</span>
<span id="cb2-27">        │  sse: message     │  ┌────────┐  │</span>
<span id="cb2-28">        │◄──────────────────│  │ColGrep │  │</span>
<span id="cb2-29">        │  sse: close       │  │search  │  │</span>
<span id="cb2-30">        │◄──────────────────│  └───┬────┘  │</span>
<span id="cb2-31">        │                   │      │       │</span>
<span id="cb2-32">        ▼                   │      ▼       │</span>
<span id="cb2-33">    marked.parse()          │  ┌────────┐  │</span>
<span id="cb2-34">    renders markdown        │  │Lisette │  │</span>
<span id="cb2-35">                            │  │AsyncChat│ │</span>
<span id="cb2-36">                            │  │+ tools │  │</span>
<span id="cb2-37">                            │  └────────┘  │</span>
<span id="cb2-38">                            └──────────────┘</span></code></pre></div></div>
<p>Setting up the app means picking your CSS framework and loading the right headers:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fasthtml.common <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fhdaisy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> fastlucide <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SvgSprites, SvgStyle</span>
<span id="cb3-4"></span>
<span id="cb3-5">icons <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SvgSprites()</span>
<span id="cb3-6"></span>
<span id="cb3-7">app, rt <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> fast_app(</span>
<span id="cb3-8">    pico<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb3-9">    hdrs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(</span>
<span id="cb3-10">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>daisy_hdrs,</span>
<span id="cb3-11">        SvgStyle(),</span>
<span id="cb3-12">        CHAT_CSS,</span>
<span id="cb3-13">        Script(src<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://unpkg.com/htmx-ext-sse@2.2.3/sse.js"</span>),</span>
<span id="cb3-14">        Script(src<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://cdn.jsdelivr.net/npm/marked/marked.min.js"</span>),</span>
<span id="cb3-15">    ))</span></code></pre></div></div>
<p>A few things to note:</p>
<ul>
<li><code>pico=False</code> — FastHTML includes PicoCSS by default, we swap it for DaisyUI</li>
<li><code>daisy_hdrs</code> — from <code>fhdaisy</code>, loads Tailwind + DaisyUI via CDN</li>
<li><code>SvgSprites()</code> — from <code>fastlucide</code>, gives us Lucide icons as inline SVGs</li>
<li>SSE extension and <code>marked.js</code> are the only external JS we load</li>
</ul>
<p>That’s the whole frontend stack.</p>
<p>No bundler needed.</p>
</section>
<section id="the-chat-widget-architecture" class="level2">
<h2 class="anchored" data-anchor-id="the-chat-widget-architecture">The Chat Widget Architecture</h2>
<section id="modal-with-checkbox-toggle" class="level3">
<h3 class="anchored" data-anchor-id="modal-with-checkbox-toggle">Modal with Checkbox Toggle</h3>
<p>DaisyUI has a modal component that opens/closes with a hidden checkbox — pure CSS, no JS:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> chat_widget():</span>
<span id="cb4-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Div(</span>
<span id="cb4-3">        icons,</span>
<span id="cb4-4">        Label(</span>
<span id="cb4-5">            Span(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inline-flex items-center gap-2"</span>)(</span>
<span id="cb4-6">                icons(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"message-circle"</span>, sz<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>),</span>
<span id="cb4-7">                Span(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Ask AI"</span>, cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"font-semibold tracking-tight"</span>)),</span>
<span id="cb4-8">            fr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat-modal"</span>,</span>
<span id="cb4-9">            cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"btn btn-primary rounded-full fixed bottom-6 right-6 z-50"</span>),</span>
<span id="cb4-10">        Input(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">type</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"checkbox"</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat-modal"</span>, cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal-toggle"</span>),</span>
<span id="cb4-11">        Div(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal"</span>)(</span>
<span id="cb4-12">            chat_content(),</span>
<span id="cb4-13">            Label(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal-backdrop"</span>, fr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat-modal"</span>)))</span></code></pre></div></div>
<p>Click the <code>Label</code> → toggles the checkbox → CSS shows/hides the modal.</p>
<p>Click the backdrop → same checkbox → closes it. Zero JavaScript.</p>
<blockquote class="blockquote">
<p><strong>Gotcha:</strong> The <code>Input(modal-toggle)</code> must be immediately before the <code>Div(modal)</code> — DaisyUI uses a CSS sibling selector.</p>
<p>We learned this the hard way.</p>
</blockquote>
</section>
<section id="reusable-component" class="level3">
<h3 class="anchored" data-anchor-id="reusable-component">Reusable Component</h3>
<p>The same <code>chat_content()</code> function powers both the modal widget and the standalone page:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> chat_content(standalone<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>):</span>
<span id="cb5-2">    box_cls <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w-full h-screen p-0 bg-base-100"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> standalone <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal-box w-11/12 max-w-4xl p-0 ..."</span></span>
<span id="cb5-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Div(</span>
<span id="cb5-4">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># header, messages, input form</span></span>
<span id="cb5-5">        cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>box_cls)</span></code></pre></div></div>
<p>One component, two contexts — no duplication.</p>
</section>
<section id="chat-bubbles" class="level3">
<h3 class="anchored" data-anchor-id="chat-bubbles">Chat Bubbles</h3>
<p>DaisyUI’s <code>chat</code> component gives us styled message bubbles for free:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> user_bubble(msg):</span>
<span id="cb6-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Div(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat chat-end"</span>)(</span>
<span id="cb6-3">        Div(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"You"</span>, cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat-header"</span>),</span>
<span id="cb6-4">        Div(msg, cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat-bubble chat-bubble-primary"</span>))</span></code></pre></div></div>
<p><code>chat-end</code> = right-aligned (user).</p>
<p><code>chat-start</code> = left-aligned (AI).</p>
<p>DaisyUI handles the speech bubble shapes, spacing, everything.</p>
</section>
</section>
<section id="streaming-with-sse-htmx" class="level2">
<h2 class="anchored" data-anchor-id="streaming-with-sse-htmx">Streaming with SSE + HTMX</h2>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb7-1"></span>
<span id="cb7-2">**Streaming Life Cycle**</span>
<span id="cb7-3"></span>
<span id="cb7-4">  Browser (HTMX)                          Server (FastHTML)</span>
<span id="cb7-5">       │                                        │</span>
<span id="cb7-6">       │  ── SSE connect /stream ──────────►    │</span>
<span id="cb7-7">       │                                        │</span>
<span id="cb7-8">       │                              AsyncChat(query, stream=True)</span>
<span id="cb7-9">       │                                        │</span>
<span id="cb7-10">       │  ◄── sse: message (Span("I'll")) ──    │</span>
<span id="cb7-11">       │  ◄── sse: message (Span(" search"))──  │</span>
<span id="cb7-12">       │  ◄── sse: message (Span(" the")) ──    │</span>
<span id="cb7-13">       │  ◄── sse: message (Span(" code"))──    │</span>
<span id="cb7-14">       │           ...                          │</span>
<span id="cb7-15">       │  ◄── sse: message (Span("...")) ──     │</span>
<span id="cb7-16">       │  ◄── sse: close (Span("")) ────────    │</span>
<span id="cb7-17">       │                                        │</span>
<span id="cb7-18">       │  disconnect                            │</span>
<span id="cb7-19">       │                                        │</span>
<span id="cb7-20">       ▼                                        </span>
<span id="cb7-21">  hx_on__sse_close fires:</span>
<span id="cb7-22">  this.innerHTML = marked.parse(this.textContent)</span>
<span id="cb7-23">  </span>
<span id="cb7-24">  Raw text ──► Rendered markdown HTML</span></code></pre></div></div>
<p>Why SSE over WebSockets</p>
<p>Chat is one-directional — the server streams tokens to the client.</p>
<p>WebSockets are bidirectional, which is overkill here.</p>
<p>SSE is:</p>
<ul>
<li>Built into browsers natively</li>
<li>Supported by HTMX via a small extension</li>
<li>Dead simple in FastHTML — just return an EventStream</li>
</ul>
<section id="the-sse-lifecycle" class="level3">
<h3 class="anchored" data-anchor-id="the-sse-lifecycle">The SSE Lifecycle</h3>
<p>The AI response bubble sets up the SSE connection:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ai_bubble_stream(query):</span>
<span id="cb8-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Div(</span>
<span id="cb8-3">        Span(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"loading loading-dots loading-sm"</span>),</span>
<span id="cb8-4">        hx_ext<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sse"</span>,</span>
<span id="cb8-5">        sse_connect<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>stream_response.to(query<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>query),</span>
<span id="cb8-6">        sse_swap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"message"</span>,</span>
<span id="cb8-7">        sse_close<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"close"</span>,</span>
<span id="cb8-8">        hx_swap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"beforeend"</span>,</span>
<span id="cb8-9">        hx_on__sse_close<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"this.innerHTML=marked.parse(this.textContent)"</span>,</span>
<span id="cb8-10">        cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"chat-bubble bg-base-200"</span>)</span></code></pre></div></div>
<p>Here’s what each attribute does:</p>
<ul>
<li><p>hx_ext=“sse” — activates the SSE extension on this element</p></li>
<li><p>sse_connect — URL to connect to, built with .to() which handles URL encoding</p></li>
<li><p>sse_swap=“message” — swap in content when an event named message arrives</p></li>
<li><p>sse_close=“close” — disconnect when an event named close arrives</p></li>
<li><p>hx_swap=“beforeend” — append each chunk (don’t replace)</p></li>
<li><p>hx_on__sse_close — when stream ends, render the accumulated text as markdown</p></li>
</ul>
</section>
<section id="the-server-side" class="level3">
<h3 class="anchored" data-anchor-id="the-server-side">The Server Side</h3>
<p>FastHTML makes SSE trivial — an async generator that yields sse_message():</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@rt</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/stream"</span>)</span>
<span id="cb9-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> stream_response(query: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, sess):</span>
<span id="cb9-3">    chat <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> chats[sid]</span>
<span id="cb9-4"></span>
<span id="cb9-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> generate():</span>
<span id="cb9-6">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> chunk <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">await</span> chat(query, stream<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>):</span>
<span id="cb9-7">            text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> chunk.choices[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].delta.content</span>
<span id="cb9-8">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> text:</span>
<span id="cb9-9">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> sse_message(Span(text))</span>
<span id="cb9-10">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> sse_message(Span(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>), event<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"close"</span>)</span>
<span id="cb9-11"></span>
<span id="cb9-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> EventStream(generate())</span></code></pre></div></div>
<p>Each sse_message(Span(text)) sends a chunk of HTML. HTMX appends it into the bubble.</p>
<p>When we’re done, event=“close” tells HTMX to disconnect.</p>
</section>
<section id="markdown-rendering" class="level3">
<h3 class="anchored" data-anchor-id="markdown-rendering">Markdown Rendering</h3>
<p>During streaming, the text arrives as plain chunks.</p>
<p>We can’t render markdown mid-stream because a <strong>bold</strong> might be split across two chunks.</p>
<p>The solution: accumulate raw text during streaming, then render it all at once when the stream closes.</p>
<p>marked.parse() on the client converts the accumulated textContent to HTML.</p>
<blockquote class="blockquote">
<p>Gotcha: SSE auto-reconnects by default. Without sse_close=“close”, the connection reopens and you get triple responses.</p>
</blockquote>
</section>
</section>
<section id="htmx-patterns-we-used" class="level2">
<h2 class="anchored" data-anchor-id="htmx-patterns-we-used">HTMX Patterns We Used</h2>
<section id="oob-swaps-clearing-the-input" class="level3">
<h3 class="anchored" data-anchor-id="oob-swaps-clearing-the-input">OOB Swaps — Clearing the Input</h3>
<p>After the user submits a question, the input should clear.</p>
<p>The HTMX-idiomatic way is an out-of-band swap — return a fresh empty input from the server alongside the response:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@rt</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/ask"</span>)</span>
<span id="cb10-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> ask(query: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, sess):</span>
<span id="cb10-3">    new_inp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Input(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"query"</span>, placeholder<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Ask a follow-up question..."</span>,</span>
<span id="cb10-4">                    hx_swap_oob<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"true"</span>)</span>
<span id="cb10-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Div(user_bubble(query), ai_bubble_stream(query)), new_inp</span></code></pre></div></div>
<p>hx_swap_oob=“true” tells HTMX: “find the element with this id on the page and replace it.” No JavaScript needed.</p>
<p>The main response goes to #messages, the input replacement happens out-of-band.</p>
</section>
<section id="form-submit-free-enter-key" class="level3">
<h3 class="anchored" data-anchor-id="form-submit-free-enter-key">Form Submit — Free Enter Key</h3>
<p>Instead of putting hx_get on the button and wiring up keyboard events, wrap everything in a Form:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">Form(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"join w-full"</span>, hx_post<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ask, hx_target<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#messages"</span>, hx_swap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"beforeend"</span>)(</span>
<span id="cb11-2">    Input(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"query"</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"query"</span>),</span>
<span id="cb11-3">    Button(icons(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"send"</span>, sz<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>)))</span></code></pre></div></div>
<p>Forms submit on Enter automatically. One less thing to build.</p>
</section>
<section id="to-for-safe-urls" class="level3">
<h3 class="anchored" data-anchor-id="to-for-safe-urls">.to() for Safe URLs</h3>
<p>FastHTML route functions have a .to() method that generates URL-encoded paths:</p>
<p>sse_connect=stream_response.to(query=query, msg_id=msg_id)</p>
<p>No manual urllib.parse.quote(). It handles spaces, special characters, everything.</p>
</section>
<section id="session-based-state" class="level3">
<h3 class="anchored" data-anchor-id="session-based-state">Session-Based State</h3>
<p>FastHTML’s session (signed cookie) stores per-user state — selected model, API key, cost tracking:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@rt</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/stream"</span>)</span>
<span id="cb12-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">async</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> stream_response(query: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, sess):</span>
<span id="cb12-3">    model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sess.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"claude-sonnet-4-20250514"</span>)</span>
<span id="cb12-4">    api_key <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sess.get(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"api_key"</span>, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span></code></pre></div></div>
<p>Chat history lives in a server-side dict keyed by session ID:</p>
<p>chats = {} # session_id → AsyncChat instance</p>
<p>The AsyncChat from Lisette maintains conversation history internally — just reuse the same instance and it remembers everything.</p>
</section>
</section>
<section id="the-embed-trick" class="level2">
<h2 class="anchored" data-anchor-id="the-embed-trick">The Embed Trick</h2>
<section id="dynamic-embed-route" class="level3">
<h3 class="anchored" data-anchor-id="dynamic-embed-route">Dynamic /embed Route</h3>
<p>The hardest part of embedding a widget on external sites is the server URL — you don’t want the doc author manually editing URLs. So HyperRun serves its own embed script:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@rt</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/embed"</span>)</span>
<span id="cb13-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> embed_js(req):</span>
<span id="cb13-3">    server <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>req<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>url<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>scheme<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">://</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>req<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>url<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>netloc<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb13-4">    js <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"""document.addEventListener('DOMContentLoaded',function()</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">{{</span></span>
<span id="cb13-5"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">      // creates floating button + iframe pointing to </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>server<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">/chat-standalone</span></span>
<span id="cb13-6"><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">    </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">}}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">);"""</span></span>
<span id="cb13-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Response(js, media_type<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'application/javascript; charset=utf-8'</span>)</span></code></pre></div></div>
<p>The key: req.url.scheme and req.url.netloc give us the server’s own URL.</p>
<p>Deploy to <code>https://myapp.fly.dev</code> and the generated JS automatically uses that URL.</p>
<p>Zero configuration for the doc author.</p>
</section>
<section id="chat-standalone-full-page-for-iframe" class="level3">
<h3 class="anchored" data-anchor-id="chat-standalone-full-page-for-iframe">/chat-standalone — Full Page for iframe</h3>
<p>The same chat_content() component gets wrapped in a complete HTML page with its own HTMX and DaisyUI headers:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@rt</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/chat-standalone"</span>)</span>
<span id="cb14-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> chat_standalone():</span>
<span id="cb14-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> Html(</span>
<span id="cb14-4">        Head(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>daisy_hdrs, SvgStyle(), CHAT_CSS,</span>
<span id="cb14-5">             Script(src<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt;https://unpkg.com/htmx.org&gt;"</span>),</span>
<span id="cb14-6">             Script(src<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt;https://unpkg.com/htmx-ext-sse@2.2.3/sse.js&gt;"</span>),</span>
<span id="cb14-7">             Script(src<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt;https://cdn.jsdelivr.net/npm/marked/marked.min.js&gt;"</span>)),</span>
<span id="cb14-8">        Body(icons, chat_content(standalone<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)))</span></code></pre></div></div>
<p>The iframe is fully self-contained — it doesn’t depend on anything from the host page.</p>
</section>
<section id="cross-origin-close-button" class="level3">
<h3 class="anchored" data-anchor-id="cross-origin-close-button">Cross-Origin Close Button</h3>
<p>In the modal version, the close button is a Label that toggles a checkbox.</p>
<p>In the iframe, there’s no checkbox.</p>
<p>So the standalone close button sends a postMessage to the parent page:</p>
<p>Button(icons(“x”), onclick=“window.parent.postMessage(‘hyperrun-close’,’*’)“)</p>
<p>And the /embed script listens for it:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">window.addEventListener(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'message'</span>, function(e) {</span>
<span id="cb15-2">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (e.data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">===</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hyperrun-close'</span>)</span>
<span id="cb15-3">        document.getElementById(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hr-frame'</span>).style.display <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'none'</span></span>
<span id="cb15-4">})<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span></code></pre></div></div>
</section>
<section id="one-line-to-embed-anywhere" class="level3">
<h3 class="anchored" data-anchor-id="one-line-to-embed-anywhere">One Line to Embed Anywhere</h3>
<p>The doc author’s entire setup:</p>
<section id="quartonbdev-_quarto.yml" class="level4">
<h4 class="anchored" data-anchor-id="quartonbdev-_quarto.yml">Quarto/nbdev (_quarto.yml)</h4>
<p>include-after-body:</p>
<ul>
<li>text: ’
<script src="https://your-server/embed"></script>
’</li>
</ul>
</section>
<section id="mkdocs-mkdocs.yml" class="level4">
<h4 class="anchored" data-anchor-id="mkdocs-mkdocs.yml">MkDocs (mkdocs.yml)</h4>
<p>extra_javascript:</p>
<ul>
<li><a href="https://your-server/embed" class="uri">https://your-server/embed</a></li>
</ul>
</section>
<section id="any-html-page" class="level4">
<h4 class="anchored" data-anchor-id="any-html-page">Any HTML page</h4>
<script src="https://your-server/embed"></script>
<p>One line. Works everywhere.</p>
</section>
</section>
</section>
<section id="lessons-learned-using-streaming-htmx" class="level2">
<h2 class="anchored" data-anchor-id="lessons-learned-using-streaming-htmx">Lessons Learned Using Streaming HTMX</h2>
<p>Building HyperRun was mostly smooth — but a few things bit us.</p>
<p>Here’s what to watch for.</p>
<section id="sse-auto-reconnect-triple-responses" class="level3">
<h3 class="anchored" data-anchor-id="sse-auto-reconnect-triple-responses">SSE Auto-Reconnect = Triple Responses</h3>
<p>The HTMX SSE extension reconnects automatically when the connection closes.</p>
<p>If you don’t explicitly tell it to stop, it reconnects and the LLM response streams three times.</p>
<p>The fix: <code>sse_close="close"</code> on the element, and send <code>event="close"</code> as the last SSE message:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> sse_message(Span(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>), event<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"close"</span>)</span></code></pre></div></div>
<p>Simple, but took a while to figure out why every answer appeared three times.</p>
</section>
<section id="daisyui-modal-checkbox-ordering" class="level3">
<h3 class="anchored" data-anchor-id="daisyui-modal-checkbox-ordering">DaisyUI Modal Checkbox Ordering</h3>
<p>DaisyUI’s checkbox modal uses a CSS sibling selector — the <code>modal-toggle</code> input must be <strong>immediately before</strong> the <code>modal</code> div.</p>
<p>Put anything between them and the modal silently breaks.</p>
<p>This doesn’t work:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1">Input(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal-toggle"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ✓ checkbox</span></span>
<span id="cb17-2">Label(...)                  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ✗ this breaks the sibling selector</span></span>
<span id="cb17-3">Div(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal"</span>)(...)       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># modal never opens</span></span></code></pre></div></div>
<p>This works:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1">Label(...)                  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># button can go anywhere before</span></span>
<span id="cb18-2">Input(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal-toggle"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ✓ immediately before modal</span></span>
<span id="cb18-3">Div(cls<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modal"</span>)(...)       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ✓ works</span></span></code></pre></div></div>
<p>No error, no warning. Just a modal that doesn’t open.</p>
</section>
<section id="markdown-rendering-timing" class="level3">
<h3 class="anchored" data-anchor-id="markdown-rendering-timing">Markdown Rendering Timing</h3>
<p>You can’t render markdown per-chunk during streaming — a <code>**bold**</code> might arrive as <code>**bo</code> in one chunk and <code>ld**</code> in the next.</p>
<p>We tried server-side rendering with mistletoe, client-side with <code>marked.js</code>, and various event-driven approaches.</p>
<p>The simplest winner: accumulate plain text during streaming, then <code>marked.parse(this.textContent)</code> on <code>sse_close</code>. One line.</p>
</section>
<section id="returns-are-falsy-in-fasthtml" class="level3">
<h3 class="anchored" data-anchor-id="returns-are-falsy-in-fasthtml"><code>""</code> Returns Are Falsy in FastHTML</h3>
<p>When we needed a route to clear content (like closing the chat panel), returning <code>""</code> didn’t work — FastHTML treats empty strings as falsy.</p>
<p>Returning <code>Div()</code> instead gives you a proper empty element that HTMX swaps in correctly.</p>
</section>
</section>
<section id="final-thoughts" class="level2">
<h2 class="anchored" data-anchor-id="final-thoughts">Final Thoughts</h2>
<p>The stack — FastHTML + HTMX + SSE + DaisyUI — turned out to be surprisingly productive for building a real-time chat widget. No build step, no React, no client-side state management. Just Python functions returning HTML.</p>
<p>You can read more about other related blogs here:</p>
<ol type="1">
<li><p><a href="https://kareemai.com/blog/posts/nlp/embedding_world/hyperrun.html">HyperRun</a></p></li>
<li><p><a href="https://kareemai.com/blog/posts/seo/seo_rat_journey.html">SeoRat</a></p></li>
<li><p><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_from_scratch.html">BM25 Search</a></p></li>
<li><p><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_arabic_qdrant.html">BM25 Explained</a></p></li>
<li><p><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_benchmark_full.html">BM25 Benchmark</a></p></li>
</ol>


</section>

 ]]></description>
  <category>blogging</category>
  <category>til</category>
  <category>blog/build/project</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/hyperun_fasthtml.html</guid>
  <pubDate>Wed, 18 Mar 2026 22:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/images/colgrep.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>BM25 Part 3: Benchmarking Python BM25 Libraries</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_benchmark_full.html</link>
  <description><![CDATA[ 





<div id="Hbol" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> marimo <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> mo</span></code></pre></div></div>
</div>
<div id="MJUe" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> math <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> log</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy.sparse <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> csr_matrix</span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb2-5"></span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> tokenize(text):</span>
<span id="cb2-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> text.lower().split()</span></code></pre></div></div>
</div>
<section id="dataset-benchmark" class="level3">
<h3 class="anchored" data-anchor-id="dataset-benchmark">Dataset Benchmark</h3>
<p>I will continue exploring the bm25 and how to use with a huggingface dataset with multiple implementations and compare their speed</p>
<p>The methods i will use:</p>
<ol type="1">
<li>Normal Python Calculations</li>
<li>Spicy sparse matrix with python loops</li>
<li>Spicy sparse matrix with Numpy Vectorization</li>
<li>Rank_bm25</li>
<li>bm25-sparse</li>
<li>Polars optimization</li>
</ol>
<p>If there is any mistake please tell me!</p>
<p>it’s first time for me to create a benchmark for anything</p>
<div id="bkHC" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> load_dataset</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> time</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> random</span>
<span id="cb3-4"></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load dataset</span></span>
<span id="cb3-6">dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> load_dataset(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rotten_tomatoes"</span>, split<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"train"</span>)</span>
<span id="cb3-7">corpus <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [item[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> dataset]</span>
<span id="cb3-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Corpus size: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> reviews"</span>)</span>
<span id="cb3-9"></span>
<span id="cb3-10"></span>
<span id="cb3-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> generate_random_queries(corpus, num_queries<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>):</span>
<span id="cb3-12">    queries <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb3-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(num_queries):</span>
<span id="cb3-14">        doc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> random.choice(corpus)</span>
<span id="cb3-15">        words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(doc)</span>
<span id="cb3-16">        num_words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> random.randint(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb3-17">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(words) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> num_words:</span>
<span id="cb3-18">            query_words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> random.sample(words, num_words)</span>
<span id="cb3-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb3-20">            query_words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> words</span>
<span id="cb3-21">        queries.append(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>.join(query_words))</span>
<span id="cb3-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> queries</span>
<span id="cb3-23"></span>
<span id="cb3-24"></span>
<span id="cb3-25">test_queries <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> generate_random_queries(corpus, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>)</span>
<span id="cb3-26"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Generated </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> test queries"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Corpus size: 8530 reviews
Generated 1000 test queries</code></pre>
</div>
</div>
<div id="lEQa" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> benchmark_python_loops():</span>
<span id="cb5-2">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=== 1. NORMAL PYTHON (LOOPS) ==="</span>)</span>
<span id="cb5-3">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb5-4"></span>
<span id="cb5-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Tokenize all documents</span></span>
<span id="cb5-6">    all_docs_terms_py <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb5-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> corpus:</span>
<span id="cb5-8">        tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(doc)</span>
<span id="cb5-9">        all_docs_terms_py.append(Counter(tokens))</span>
<span id="cb5-10"></span>
<span id="cb5-11">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute DF</span></span>
<span id="cb5-12">    df_py <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<span id="cb5-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> all_docs_terms_py:</span>
<span id="cb5-14">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> term <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc_tf.keys():</span>
<span id="cb5-15">            df_py[term] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_py.get(term, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb5-16"></span>
<span id="cb5-17">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Average doc length</span></span>
<span id="cb5-18">    total_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(doc_tf.values()) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> all_docs_terms_py)</span>
<span id="cb5-19">    avg_doc_len_py <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> total_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus)</span>
<span id="cb5-20"></span>
<span id="cb5-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># IDF function</span></span>
<span id="cb5-22">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> compute_idf_py(term, df_dict, N):</span>
<span id="cb5-23">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> term <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> df_dict:</span>
<span id="cb5-24">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb5-25">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> np.log((N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> df_dict[term] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (df_dict[term] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span>
<span id="cb5-26"></span>
<span id="cb5-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># BM25 function</span></span>
<span id="cb5-28">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> bm25_score_py(term, doc_tf, idf_val, doc_len, avg_len, k1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span>, b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>):</span>
<span id="cb5-29">        tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> doc_tf.get(term, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb5-30">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb5-31">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb5-32">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> idf_val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (doc_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> avg_len)))</span>
<span id="cb5-33"></span>
<span id="cb5-34">    index_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb5-35"></span>
<span id="cb5-36">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Query</span></span>
<span id="cb5-37">    start_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb5-38">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_queries:</span>
<span id="cb5-39">        query_tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(query)</span>
<span id="cb5-40">        doc_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus)</span>
<span id="cb5-41">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> token <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> query_tokens:</span>
<span id="cb5-42">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> token <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> df_py:</span>
<span id="cb5-43">                idf_val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> compute_idf_py(token, df_py, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus))</span>
<span id="cb5-44">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_idx, doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_docs_terms_py):</span>
<span id="cb5-45">                    doc_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(doc_tf.values())</span>
<span id="cb5-46">                    doc_scores[doc_idx] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> bm25_score_py(</span>
<span id="cb5-47">                        token, doc_tf, idf_val, doc_len, avg_doc_len_py</span>
<span id="cb5-48">                    )</span>
<span id="cb5-49">        top_k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sorted</span>(</span>
<span id="cb5-50">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(doc_scores)), key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> i: doc_scores[i], reverse<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb5-51">        )[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>]</span>
<span id="cb5-52"></span>
<span id="cb5-53">    query_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_query</span>
<span id="cb5-54">    qps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> query_time</span>
<span id="cb5-55"></span>
<span id="cb5-56">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Indexing: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>index_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb5-57">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query time: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb5-58">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"QPS: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>qps<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb5-59"></span>
<span id="cb5-60">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> index_time, query_time, qps</span>
<span id="cb5-61"></span>
<span id="cb5-62"></span>
<span id="cb5-63">benchmark_python_loops()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>=== 1. NORMAL PYTHON (LOOPS) ===
Indexing: 0.0801s
Query time: 23.6460s
QPS: 42.29
</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="4">
<pre><code>(0.08005261421203613, 23.645965099334717, 42.29051323551751)</code></pre>
</div>
</div>
</section>
<section id="why-sparse-matrices" class="level2">
<h2 class="anchored" data-anchor-id="why-sparse-matrices">Why Sparse Matrices?</h2>
<p>Right now, to search for “python programming”, you’d need to: 1. Loop through each document 2. Calculate BM25 score for “python” 3. Calculate BM25 score for “programming” 4. Add them up</p>
<p>That’s slow for large datasets!</p>
<p><strong>The key insight:</strong> We can pre-compute ALL BM25 scores for ALL words in ALL documents and store them in a matrix. Then searching becomes just looking up rows and adding them.</p>
<p>Here’s what the matrix looks like conceptually:</p>
<pre><code>               Doc0   Doc1   Doc2
python         -0.51  -0.56   0.00
programming    -1.95  -2.14  -1.79
java            0.00   0.00   2.20</code></pre>
<div id="Xref" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> benchmark_sparse_python_loops():</span>
<span id="cb9-2">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=== 2. SPARSE MATRIX WITH PYTHON LOOPS ==="</span>)</span>
<span id="cb9-3">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb9-4"></span>
<span id="cb9-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Tokenize</span></span>
<span id="cb9-6">    all_docs_terms_sp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb9-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> corpus:</span>
<span id="cb9-8">        tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(doc)</span>
<span id="cb9-9">        all_docs_terms_sp.append(Counter(tokens))</span>
<span id="cb9-10"></span>
<span id="cb9-11">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get all words and create word_to_idx</span></span>
<span id="cb9-12">    all_words_sp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(word <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> all_docs_terms_sp <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc))</span>
<span id="cb9-13">    word_to_idx_sp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {word: idx <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> idx, word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_words_sp)}</span>
<span id="cb9-14"></span>
<span id="cb9-15">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute DF</span></span>
<span id="cb9-16">    df_sp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<span id="cb9-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> all_docs_terms_sp:</span>
<span id="cb9-18">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> term <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc_tf.keys():</span>
<span id="cb9-19">            df_sp[term] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_sp.get(term, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb9-20"></span>
<span id="cb9-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Average doc length</span></span>
<span id="cb9-22">    avg_doc_len_sp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(doc_tf.values()) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> all_docs_terms_sp) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(</span>
<span id="cb9-23">        corpus</span>
<span id="cb9-24">    )</span>
<span id="cb9-25"></span>
<span id="cb9-26">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Build sparse matrix</span></span>
<span id="cb9-27">    rows_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb9-28">    cols_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb9-29">    data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb9-30"></span>
<span id="cb9-31">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word_idx, word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_words_sp):</span>
<span id="cb9-32">        idf_val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.log((<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> df_sp[word] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (df_sp[word] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span>
<span id="cb9-33">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_idx, doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_docs_terms_sp):</span>
<span id="cb9-34">            doc_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(doc_tf.values())</span>
<span id="cb9-35">            tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> doc_tf.get(word, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-36">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb9-37">                score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb9-38">                    idf_val</span>
<span id="cb9-39">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tf</span>
<span id="cb9-40">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.5</span></span>
<span id="cb9-41">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (doc_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> avg_doc_len_sp)))</span>
<span id="cb9-42">                )</span>
<span id="cb9-43">                rows_idx.append(word_idx)</span>
<span id="cb9-44">                cols_idx.append(doc_idx)</span>
<span id="cb9-45">                data.append(score)</span>
<span id="cb9-46"></span>
<span id="cb9-47">    sparse_matrix_sp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> csr_matrix(</span>
<span id="cb9-48">        (data, (rows_idx, cols_idx)), shape<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(all_words_sp), <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus))</span>
<span id="cb9-49">    )</span>
<span id="cb9-50"></span>
<span id="cb9-51">    index_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb9-52"></span>
<span id="cb9-53">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Query</span></span>
<span id="cb9-54">    start_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb9-55">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_queries:</span>
<span id="cb9-56">        query_indices <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb9-57">            word_to_idx_sp[word] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> tokenize(query) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> word_to_idx_sp</span>
<span id="cb9-58">        ]</span>
<span id="cb9-59">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> query_indices:</span>
<span id="cb9-60">            doc_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(</span>
<span id="cb9-61">                sparse_matrix_sp[query_indices, :].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-62">            ).flatten()</span>
<span id="cb9-63">            top_k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.argsort(doc_scores)[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>:][::<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb9-64"></span>
<span id="cb9-65">    query_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_query</span>
<span id="cb9-66">    qps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> query_time</span>
<span id="cb9-67"></span>
<span id="cb9-68">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Indexing: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>index_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb9-69">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query time: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb9-70">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"QPS: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>qps<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb9-71"></span>
<span id="cb9-72">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> index_time, query_time, qps</span>
<span id="cb9-73"></span>
<span id="cb9-74"></span>
<span id="cb9-75">benchmark_sparse_python_loops()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>=== 2. SPARSE MATRIX WITH PYTHON LOOPS ===
Indexing: 77.9146s
Query time: 0.4709s
QPS: 2123.64
</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="5">
<pre><code>(77.91458225250244, 0.4708900451660156, 2123.6380133019015)</code></pre>
</div>
</div>
</section>
<section id="performance-optimization-with-numpy" class="level2">
<h2 class="anchored" data-anchor-id="performance-optimization-with-numpy">Performance Optimization with Numpy</h2>
<p>My code has python loops everywhere. Python loops are very slow. The goal here it to use Numpy Vectorization: doing operations on entiry arrays at once instead of looping :</p>
<pre><code>for word_1 in all_words:
for doc_idx, doc_1 in enumerate(all_docs_terms):
    idf_python_1 = compute_idf(word_1, 3, df)
    doc_len_1 = sum(all_docs_terms[doc_idx].values())
    score_1 = bm25_score(word_1, doc_1, idf_python_1, doc_len_1, avg_doc_len)</code></pre>
<p>This has nested loops =&gt; very slow for large datasets. Key Insight: Instead of computing one BM25 score at a time, we can compute all scores at once using matrix opertions.</p>
<div id="BYtC" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> benchmark_sparse_numpy():</span>
<span id="cb13-2">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=== 3. SPARSE MATRIX WITH NUMPY VECTORIZATION ==="</span>)</span>
<span id="cb13-3">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb13-4"></span>
<span id="cb13-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Tokenize</span></span>
<span id="cb13-6">    all_docs_terms_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb13-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> corpus:</span>
<span id="cb13-8">        tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(doc)</span>
<span id="cb13-9">        all_docs_terms_np.append(Counter(tokens))</span>
<span id="cb13-10"></span>
<span id="cb13-11">    all_words_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(word <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> all_docs_terms_np <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc))</span>
<span id="cb13-12">    word_to_idx_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {word: idx <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> idx, word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_words_np)}</span>
<span id="cb13-13"></span>
<span id="cb13-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Build TF matrix</span></span>
<span id="cb13-15">    rows <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb13-16">    cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb13-17">    tf_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb13-18"></span>
<span id="cb13-19">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_idx, doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_docs_terms_np):</span>
<span id="cb13-20">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word, tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc_tf.items():</span>
<span id="cb13-21">            rows.append(word_to_idx_np[word])</span>
<span id="cb13-22">            cols.append(doc_idx)</span>
<span id="cb13-23">            tf_data.append(tf)</span>
<span id="cb13-24"></span>
<span id="cb13-25">    tf_matrix_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> csr_matrix(</span>
<span id="cb13-26">        (tf_data, (rows, cols)), shape<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(all_words_np), <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus))</span>
<span id="cb13-27">    )</span>
<span id="cb13-28"></span>
<span id="cb13-29">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute DF array</span></span>
<span id="cb13-30">    df_array_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array((tf_matrix_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)).flatten()</span>
<span id="cb13-31"></span>
<span id="cb13-32">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Document lengths</span></span>
<span id="cb13-33">    doc_lengths_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(tf_matrix_np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)).flatten()</span>
<span id="cb13-34">    avg_doc_len_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> doc_lengths_np.mean()</span>
<span id="cb13-35"></span>
<span id="cb13-36">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute IDF array</span></span>
<span id="cb13-37">    N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus)</span>
<span id="cb13-38">    idf_array_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.log((N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> df_array_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (df_array_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span>
<span id="cb13-39"></span>
<span id="cb13-40">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Vectorized BM25</span></span>
<span id="cb13-41">    k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span></span>
<span id="cb13-42">    b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span></span>
<span id="cb13-43">    tf_dense <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tf_matrix_np.toarray()</span>
<span id="cb13-44"></span>
<span id="cb13-45">    numerator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tf_dense <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb13-46">    length_norm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (doc_lengths_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> avg_doc_len_np)</span>
<span id="cb13-47">    denominator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tf_dense <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> length_norm</span>
<span id="cb13-48"></span>
<span id="cb13-49">    idf_column <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> idf_array_np.reshape(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb13-50">    bm25_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> idf_column <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (numerator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (denominator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-10</span>))</span>
<span id="cb13-51"></span>
<span id="cb13-52">    bm25_matrix_np <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> csr_matrix(bm25_scores)</span>
<span id="cb13-53"></span>
<span id="cb13-54">    index_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb13-55"></span>
<span id="cb13-56">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Query</span></span>
<span id="cb13-57">    start_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb13-58">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_queries:</span>
<span id="cb13-59">        query_indices <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb13-60">            word_to_idx_np[word] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> tokenize(query) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> word_to_idx_np</span>
<span id="cb13-61">        ]</span>
<span id="cb13-62">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> query_indices:</span>
<span id="cb13-63">            doc_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(</span>
<span id="cb13-64">                bm25_matrix_np[query_indices, :].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb13-65">            ).flatten()</span>
<span id="cb13-66">            top_k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.argsort(doc_scores)[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>:][::<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb13-67"></span>
<span id="cb13-68">    query_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_query</span>
<span id="cb13-69">    qps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> query_time</span>
<span id="cb13-70"></span>
<span id="cb13-71">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Indexing: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>index_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb13-72">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query time: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb13-73">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"QPS: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>qps<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb13-74"></span>
<span id="cb13-75">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> index_time, query_time, qps</span>
<span id="cb13-76"></span>
<span id="cb13-77"></span>
<span id="cb13-78">benchmark_sparse_numpy()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>=== 3. SPARSE MATRIX WITH NUMPY VECTORIZATION ===
Indexing: 6.9885s
Query time: 0.5250s
QPS: 1904.79
</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="6">
<pre><code>(6.988537311553955, 0.5249927043914795, 1904.7883744577036)</code></pre>
</div>
</div>
<div id="RGSE" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> benchmark_rank_bm25():</span>
<span id="cb16-2">    <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> rank_bm25 <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BM25Okapi</span>
<span id="cb16-3"></span>
<span id="cb16-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=== 4. RANK-BM25 ==="</span>)</span>
<span id="cb16-5"></span>
<span id="cb16-6">    tokenized_corpus_rank <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [tokenize(doc) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> corpus]</span>
<span id="cb16-7"></span>
<span id="cb16-8">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb16-9">    bm25_rank <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> BM25Okapi(tokenized_corpus_rank)</span>
<span id="cb16-10">    index_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb16-11"></span>
<span id="cb16-12">    start_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb16-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_queries:</span>
<span id="cb16-14">        query_tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(query)</span>
<span id="cb16-15">        scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bm25_rank.get_scores(query_tokens)</span>
<span id="cb16-16">        top_k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.argsort(scores)[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>:][::<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb16-17"></span>
<span id="cb16-18">    query_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_query</span>
<span id="cb16-19">    qps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> query_time</span>
<span id="cb16-20"></span>
<span id="cb16-21">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Indexing: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>index_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb16-22">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query time: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb16-23">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"QPS: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>qps<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb16-24"></span>
<span id="cb16-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> index_time, query_time, qps</span>
<span id="cb16-26"></span>
<span id="cb16-27"></span>
<span id="cb16-28">benchmark_rank_bm25()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>=== 4. RANK-BM25 ===
Indexing: 0.0813s
Query time: 5.0695s
QPS: 197.26
</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="7">
<pre><code>(0.08132147789001465, 5.069519281387329, 197.2573619892297)</code></pre>
</div>
</div>
<div id="Kclp" class="cell" data-execution_count="12">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> benchmark_bm25s():</span>
<span id="cb19-2">    <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> bm25s</span>
<span id="cb19-3"></span>
<span id="cb19-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=== 5. BM25S ==="</span>)</span>
<span id="cb19-5"></span>
<span id="cb19-6">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb19-7">    retriever_bm25s <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bm25s.BM25()</span>
<span id="cb19-8">    corpus_tokens_bm25s <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bm25s.tokenize(corpus, show_progress<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb19-9">    retriever_bm25s.index(corpus_tokens_bm25s, show_progress<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb19-10">    index_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb19-11"></span>
<span id="cb19-12">    start_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb19-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_queries:</span>
<span id="cb19-14">        query_tokens_bm25s <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bm25s.tokenize(query, show_progress<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb19-15">        results, scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> retriever_bm25s.retrieve(</span>
<span id="cb19-16">            query_tokens_bm25s, k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, show_progress<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb19-17">        )</span>
<span id="cb19-18"></span>
<span id="cb19-19">    query_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_query</span>
<span id="cb19-20">    qps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> query_time</span>
<span id="cb19-21"></span>
<span id="cb19-22">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Indexing: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>index_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb19-23">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query time: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb19-24">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"QPS: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>qps<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb19-25"></span>
<span id="cb19-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> index_time, query_time, qps</span>
<span id="cb19-27"></span>
<span id="cb19-28"></span>
<span id="cb19-29">benchmark_bm25s()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>=== 5. BM25S ===
Indexing: 0.5329s
Query time: 0.7049s
QPS: 1418.74
</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="12">
<pre><code>(0.5328564643859863, 0.7048521041870117, 1418.7373408687158)</code></pre>
</div>
</div>
<div id="emfo" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> benchmark_polars():</span>
<span id="cb22-2">    <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> polars <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pl</span>
<span id="cb22-3"></span>
<span id="cb22-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"=== 6. POLARS OPTIMIZATION ==="</span>)</span>
<span id="cb22-5">    start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb22-6"></span>
<span id="cb22-7">    k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span></span>
<span id="cb22-8">    b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span></span>
<span id="cb22-9">    N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(corpus)</span>
<span id="cb22-10"></span>
<span id="cb22-11">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create LazyFrame</span></span>
<span id="cb22-12">    lf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pl.LazyFrame({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(N), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>: corpus})</span>
<span id="cb22-13"></span>
<span id="cb22-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Tokenize and explode</span></span>
<span id="cb22-15">    lf_tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb22-16">        lf.with_columns(</span>
<span id="cb22-17">            pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>.to_lowercase().<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>.split(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>).alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tokens"</span>)</span>
<span id="cb22-18">        )</span>
<span id="cb22-19">        .explode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tokens"</span>)</span>
<span id="cb22-20">        .<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tokens"</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>.len_chars() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb22-21">    )</span>
<span id="cb22-22"></span>
<span id="cb22-23">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Build vocab</span></span>
<span id="cb22-24">    vocab_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb22-25">        lf_tokens.select(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tokens"</span>).unique()).collect().with_row_index(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>)</span>
<span id="cb22-26">    )</span>
<span id="cb22-27">    vocab_size <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(vocab_df)</span>
<span id="cb22-28"></span>
<span id="cb22-29">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Join to map tokens to indices</span></span>
<span id="cb22-30">    lf_tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lf_tokens.join(vocab_df.lazy(), on<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tokens"</span>, how<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"left"</span>)</span>
<span id="cb22-31"></span>
<span id="cb22-32">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute TF</span></span>
<span id="cb22-33">    lf_tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lf_tokens.group_by([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>]).agg(pl.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>().alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tf"</span>))</span>
<span id="cb22-34"></span>
<span id="cb22-35">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute document lengths</span></span>
<span id="cb22-36">    lf_doc_lens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lf_tf.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>).agg(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tf"</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>().alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_len"</span>))</span>
<span id="cb22-37">    avg_doc_len_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> lf_doc_lens.select(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_len"</span>).mean()).collect().item()</span>
<span id="cb22-38"></span>
<span id="cb22-39">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute DF and IDF</span></span>
<span id="cb22-40">    lf_idf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb22-41">        lf_tf.group_by(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>)</span>
<span id="cb22-42">        .agg(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>).n_unique().alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"df"</span>))</span>
<span id="cb22-43">        .with_columns(</span>
<span id="cb22-44">            ((pl.lit(N) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"df"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"df"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)).log().alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"idf"</span>)</span>
<span id="cb22-45">        )</span>
<span id="cb22-46">    )</span>
<span id="cb22-47"></span>
<span id="cb22-48">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Calculate BM25</span></span>
<span id="cb22-49">    df_bm25_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb22-50">        lf_tf.join(lf_doc_lens, on<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>, how<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"left"</span>)</span>
<span id="cb22-51">        .join(lf_idf, on<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>, how<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"left"</span>)</span>
<span id="cb22-52">        .with_columns(</span>
<span id="cb22-53">            (</span>
<span id="cb22-54">                pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"idf"</span>)</span>
<span id="cb22-55">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tf"</span>)</span>
<span id="cb22-56">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb22-57">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (</span>
<span id="cb22-58">                    pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tf"</span>)</span>
<span id="cb22-59">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_len"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> avg_doc_len_pl))</span>
<span id="cb22-60">                )</span>
<span id="cb22-61">            ).alias(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bm25_score"</span>)</span>
<span id="cb22-62">        )</span>
<span id="cb22-63">        .select([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bm25_score"</span>])</span>
<span id="cb22-64">        .collect()</span>
<span id="cb22-65">    )</span>
<span id="cb22-66"></span>
<span id="cb22-67">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Build sparse matrix</span></span>
<span id="cb22-68">    rows_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_bm25_pl[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>].to_numpy()</span>
<span id="cb22-69">    cols_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_bm25_pl[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"doc_id"</span>].to_numpy()</span>
<span id="cb22-70">    data_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_bm25_pl[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bm25_score"</span>].to_numpy()</span>
<span id="cb22-71">    word_to_idx_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(vocab_df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tokens"</span>], vocab_df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"word_idx"</span>]))</span>
<span id="cb22-72">    bm25_matrix_pl <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> csr_matrix((data_pl, (rows_pl, cols_pl)), shape<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(vocab_size, N))</span>
<span id="cb22-73"></span>
<span id="cb22-74">    index_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start</span>
<span id="cb22-75"></span>
<span id="cb22-76">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Query</span></span>
<span id="cb22-77">    start_query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time()</span>
<span id="cb22-78">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_queries:</span>
<span id="cb22-79">        query_indices <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb22-80">            word_to_idx_pl[word] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> tokenize(query) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> word_to_idx_pl</span>
<span id="cb22-81">        ]</span>
<span id="cb22-82">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> query_indices:</span>
<span id="cb22-83">            doc_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(</span>
<span id="cb22-84">                bm25_matrix_pl[query_indices, :].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb22-85">            ).flatten()</span>
<span id="cb22-86">            top_k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.argsort(doc_scores)[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>:][::<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb22-87"></span>
<span id="cb22-88">    query_time <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> time.time() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> start_query</span>
<span id="cb22-89">    qps <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(test_queries) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> query_time</span>
<span id="cb22-90"></span>
<span id="cb22-91">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Indexing: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>index_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb22-92">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query time: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query_time<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">s"</span>)</span>
<span id="cb22-93">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"QPS: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>qps<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb22-94"></span>
<span id="cb22-95">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> index_time, query_time, qps</span>
<span id="cb22-96"></span>
<span id="cb22-97"></span>
<span id="cb22-98">benchmark_polars()</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>=== 6. POLARS OPTIMIZATION ===
Indexing: 0.1580s
Query time: 0.6143s
QPS: 1627.94
</code></pre>
</div>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>(0.1579906940460205, 0.6142714023590088, 1627.9449053946898)</code></pre>
</div>
</div>
</section>
<section id="bm25-benchmark-results-8530-documents-1000-queries" class="level2">
<h2 class="anchored" data-anchor-id="bm25-benchmark-results-8530-documents-1000-queries">🏆 BM25 Benchmark Results (8,530 documents, 1000 queries)</h2>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Implementation</th>
<th>Indexing (s)</th>
<th>Query (s)</th>
<th>QPS</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1. Python Loops</strong></td>
<td>0.0801</td>
<td>23.6460</td>
<td>42.29</td>
</tr>
<tr class="even">
<td><strong>2. Sparse + Python Loops</strong></td>
<td>77.9146</td>
<td>0.4709</td>
<td>2,123.64</td>
</tr>
<tr class="odd">
<td><strong>3. Sparse + NumPy</strong></td>
<td>6.9885</td>
<td>0.5250</td>
<td>1,904.79</td>
</tr>
<tr class="even">
<td><strong>4. rank-bm25</strong></td>
<td>0.0813</td>
<td>5.0695</td>
<td>197.26</td>
</tr>
<tr class="odd">
<td><strong>5. bm25s</strong></td>
<td>0.5329</td>
<td>0.7049</td>
<td>1,418.74</td>
</tr>
<tr class="even">
<td><strong>6. Polars</strong></td>
<td>0.1580</td>
<td>0.6143</td>
<td>1,627.94</td>
</tr>
</tbody>
</table>
<hr>
<p><strong>Key Findings:</strong></p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Category</th>
<th>Winner</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Fastest Indexing</strong></td>
<td>Python Loops</td>
<td>0.0801s</td>
</tr>
<tr class="even">
<td><strong>Fastest Query</strong></td>
<td>Sparse + Python Loops</td>
<td>0.4709s</td>
</tr>
<tr class="odd">
<td><strong>Highest QPS</strong></td>
<td>Sparse + Python Loops</td>
<td>2,123.64</td>
</tr>
<tr class="even">
<td><strong>Best Balance</strong></td>
<td><strong>Polars</strong></td>
<td>0.1580s index, 1,627.94 QPS</td>
</tr>
</tbody>
</table>
<hr>
<section id="references" class="level3">
<h3 class="anchored" data-anchor-id="references">References</h3>
<ol type="1">
<li><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_from_scratch.html">bm25 part 1</a></li>
<li><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_arabic_qdrant.html">bm25 part 2</a></li>
</ol>


</section>
</section>

 ]]></description>
  <category>blogging</category>
  <category>embedding</category>
  <category>qdrant</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_benchmark_full.html</guid>
  <pubDate>Fri, 19 Dec 2025 22:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/images/bm25.png" medium="image" type="image/png" height="98" width="144"/>
</item>
<item>
  <title>BM25 Explained Part 2 | Qdrant hybrid search for real estate</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_arabic_qdrant.html</link>
  <description><![CDATA[ 





<section id="building-hybrid-search-for-real-estate-with-qdrant-and-bm25" class="level2">
<h2 class="anchored" data-anchor-id="building-hybrid-search-for-real-estate-with-qdrant-and-bm25">Building Hybrid Search for Real Estate with Qdrant and BM25</h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/images/xbites_chat.png" class="img-fluid figure-img"></p>
<figcaption>xbites mena real esate ai</figcaption>
</figure>
</div>
<section id="the-problem" class="level3">
<h3 class="anchored" data-anchor-id="the-problem">The Problem</h3>
<p>When building a real estate search system, I ran into a frustrating issue: dense vector search kept returning the wrong locations. A query for “6th Settlement Apartment” would return properties in “5th Settlement” or “New Cairo” instead.</p>
<p>I was using Gemini’s embedding model, which performs well for both English and Arabic. But even strong embeddings struggle with domain-specific data where names are nearly identical:</p>
<ul>
<li>El Patio Vera</li>
<li>El Patio Solo</li>
<li>El Patio Casa</li>
</ul>
<p>These project names fall outside the embedding model’s training distribution.</p>
<p>The semantic similarity between them is too high for dense search to distinguish.</p>
<p>My property chunks contain full descriptions like “Villa in El Patio Vera, 3 bedrooms, 1,000,000 EGP”. Semantic search handles most queries well, but some searches are fundamentally lexical. I needed both approaches working together.</p>
<hr>
</section>
</section>
<section id="when-dense-search-fails" class="level2">
<h2 class="anchored" data-anchor-id="when-dense-search-fails">When Dense Search Fails</h2>
<p>My dataset contained 12,877 real estate properties across Egypt with location names like:</p>
<ul>
<li>6th Settlement, 5th Settlement</li>
<li>6th of October, 5th of October</li>
<li>Sheikh Zayed, New Zayed</li>
<li>El Alamein, North Coast</li>
</ul>
<p>When a user searched for “6th Settlement Apartment less than 10 million”, the dense vector results looked like this:</p>
<pre><code>5th Settlement, New Cairo - 9.5M
5th Settlement, New Cairo - 9.3M
El Alamein, North Coast - 5.6M
5th Settlement, New Cairo - 9.4M
El Alamein, North Coast - 5.7M</code></pre>
<p>Zero results from 6th Settlement. The embedding model treated “6th” and “5th” as semantically similar because they are both ordinal numbers. The word “Settlement” matched in both cases, pushing the semantic similarity even higher. This is exactly where lexical matching would help.</p>
<hr>
</section>
<section id="why-hybrid-search-works" class="level2">
<h2 class="anchored" data-anchor-id="why-hybrid-search-works">Why Hybrid Search Works</h2>
<p>Dense vectors and BM25 have complementary strengths.</p>
<p>Dense vectors understand meaning. They know that “apartment”, “flat”, and “unit” refer to similar things. They handle typos and variations gracefully. But they struggle with exact matches where surface-level differences matter, like distinguishing “6th” from “5th”.</p>
<p>BM25 sparse vectors match tokens exactly. The token “6th” will never match “5th”. This works across languages without additional configuration. The downside is that BM25 has no semantic understanding. It cannot recognize that “apartment” and “flat” mean the same thing.</p>
<p>By combining both approaches, you get semantic understanding when you need it and exact matching when that matters more.</p>
<p>The architecture looks like this:</p>
<pre><code>User Query: "6th Settlement Apartment"
                    |
         +----------+----------+
         |                     |
   Dense Search          BM25 Search
   (100 candidates)      (30 candidates)
         |                     |
         +----------+----------+
                    |
              RRF Fusion
                    |
             Final Results</code></pre>
<p>Reciprocal Rank Fusion combines the rankings from both approaches, giving weight to results that appear highly ranked in either or both lists.</p>
<hr>
</section>
<section id="challenge-1-reducing-token-noise" class="level2">
<h2 class="anchored" data-anchor-id="challenge-1-reducing-token-noise">Challenge 1: Reducing Token Noise</h2>
<p>The first problem I encountered was tokenization noise. Raw property chunks contained everything: URLs, metadata IDs, field names, and numeric values.</p>
<p>A typical chunk looked like:</p>
<pre><code>"Palm Hills, Palm Hills New Cairo, Apartment, 15.3M EGP,
154 sqm, metadata_id_123, https://example.com/file.pdf, ..."</code></pre>
<p>Tokenizing this produced over 13,000 unique tokens across the corpus. Most of these were useless for search: URL fragments, random IDs, and field names that would never appear in user queries.</p>
<p>The solution was to extract only the fields that users would actually search for:</p>
<ul>
<li>Developer name</li>
<li>Project name</li>
<li>Unit type</li>
<li>Location and sublocation</li>
</ul>
<p>After filtering, a chunk became:</p>
<pre><code>"Palm Hills Palm Hills New Cairo Apartment 5th Settlement New Cairo"</code></pre>
<p>This reduced the vocabulary from 13,000 tokens to 1,941, an 86% reduction. The BM25 index became faster and more accurate because every remaining token was meaningful.</p>
<hr>
</section>
<section id="challenge-2-handling-stopwords" class="level2">
<h2 class="anchored" data-anchor-id="challenge-2-handling-stopwords">Challenge 2: Handling Stopwords</h2>
<p>Common words like “of”, “in”, and “the” added noise to the BM25 scores. Since this was a bilingual system supporting Arabic and English, I needed stopwords for both languages:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">STOPWORDS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb5-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># English</span></span>
<span id="cb5-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'a'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'an'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'of'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'in'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'on'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'at'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'to'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'for'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'and'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'or'</span>,</span>
<span id="cb5-4">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Arabic</span></span>
<span id="cb5-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'في'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'من'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'إلى'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'على'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'عن'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'مع'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'هذا'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'هذه'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'ذلك'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'التي'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'الذي'</span></span>
<span id="cb5-6">}</span></code></pre></div></div>
<p>Removing these ensured that queries like “apartment in 6th Settlement” focused on the meaningful tokens rather than matching every document containing “in”.</p>
<hr>
</section>
<section id="challenge-3-n-gram-tokenization" class="level2">
<h2 class="anchored" data-anchor-id="challenge-3-n-gram-tokenization">Challenge 3: N-gram Tokenization</h2>
<p>With basic tokenization, similar location names still overlapped too much:</p>
<pre><code>"6th Settlement" → ['6th', 'settlement']
"5th Settlement" → ['5th', 'settlement']</code></pre>
<p>Both share the token “settlement”, so BM25 would give partial credit to 5th Settlement results even when the user explicitly asked for 6th Settlement.</p>
<p>The solution was to add bigrams, two-word phrases joined with an underscore:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> tokenize_ngram(text: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> List[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>]:</span>
<span id="cb7-2">    words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [w <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> w <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> text.split() <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> w <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> STOPWORDS]</span>
<span id="cb7-3">    tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> words.copy()</span>
<span id="cb7-4">    </span>
<span id="cb7-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(words) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>):</span>
<span id="cb7-6">        tokens.append(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>words[i]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">_</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>words[i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb7-7">    </span>
<span id="cb7-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> tokens</span></code></pre></div></div>
<p>Now the tokenization produces:</p>
<pre><code>"6th Settlement" → ['6th', 'settlement', '6th_settlement']
"5th Settlement" → ['5th', 'settlement', '5th_settlement']</code></pre>
<p>The bigram “6th_settlement” is unique and will only match documents containing that exact phrase. This dramatically improved precision for location-specific queries.</p>
<hr>
</section>
<section id="challenge-4-data-imbalance" class="level2">
<h2 class="anchored" data-anchor-id="challenge-4-data-imbalance">Challenge 4: Data Imbalance</h2>
<p>Even with proper tokenization, some brands dominated the results unfairly. My dataset had:</p>
<ul>
<li>Palm Hills: 718 properties</li>
<li>El Patio: 49 properties</li>
</ul>
<p>When searching for “El Patio apartment”, BM25 would often return Palm Hills results first because the sheer frequency of Palm Hills in the corpus gave it higher term frequency scores.</p>
<p>The solution was to adjust the balance between dense and sparse search in the hybrid approach. By retrieving more candidates from dense search (100) and fewer from BM25 (30), I gave semantic understanding more influence in the final ranking. This allowed the dense embeddings, which correctly understood “El Patio” as a distinct entity, to override BM25’s frequency bias.</p>
</section>
<section id="rrf-reciprocal-rank-fusion-explained" class="level2">
<h2 class="anchored" data-anchor-id="rrf-reciprocal-rank-fusion-explained">RRF (Reciprocal Rank Fusion) Explained</h2>
<p>RRF is a method for combining rankings from multiple search systems.</p>
<p>c How It Works</p>
<p>Instead of averaging scores (which can be misleading when different systems use different scales), RRF uses <strong>rank positions</strong>.</p>
<p><strong>Formula:</strong></p>
<pre><code>RRF_score = Σ (1 / (k + rank))</code></pre>
<p>Where: - <code>k</code> = constant (usually 60) - <code>rank</code> = position in that ranking (1st, 2nd, 3rd…)</p>
<section id="example" class="level3">
<h3 class="anchored" data-anchor-id="example">Example</h3>
<p>Query: “6th Settlement Apartment”</p>
<p><strong>Dense Search Rankings:</strong> 1. Result A (6th Settlement) → score = 1/(60+1) = 0.0164 2. Result B (New Cairo) → score = 1/(60+2) = 0.0161 3. Result C (5th Settlement) → score = 1/(60+3) = 0.0159</p>
<p><strong>BM25 Rankings:</strong> 1. Result A (6th Settlement) → score = 1/(60+1) = 0.0164 2. Result D (6th October) → score = 1/(60+2) = 0.0161 3. Result C (5th Settlement) → score = 1/(60+3) = 0.0159</p>
<p><strong>Combined RRF Scores:</strong> - Result A: 0.0164 + 0.0164 = <strong>0.0328</strong> (appears in both, ranked 1st) - Result C: 0.0159 + 0.0159 = <strong>0.0318</strong> (appears in both) - Result B: 0.0161 (only in dense) - Result D: 0.0161 (only in BM25)</p>
<p>Result A wins because it ranked highly in <strong>both</strong> systems!</p>
<hr>
</section>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<p>After implementing hybrid search with these optimizations, the same query that previously failed now worked correctly.</p>
<p>Query: “6th Settlement Apartment less than 10 million”</p>
<p>Before (Dense Only):</p>
<pre><code>5th Settlement - 9.5M
5th Settlement - 9.3M
El Alamein - 5.6M</code></pre>
<p>After (Hybrid with BM25):</p>
<pre><code>6th Settlement - 8.0M
6th Settlement - 9.5M
6th Settlement - 7.2M</code></pre>
<hr>
</section>
<section id="key-takeaways" class="level2">
<h2 class="anchored" data-anchor-id="key-takeaways">Key Takeaways</h2>
<ol type="1">
<li><p>Filter your tokens aggressively. Remove everything that users would never search for.</p></li>
<li><p>Use n-grams for multi-word entities. Bigrams turn “6th Settlement” into a unique, matchable token.</p></li>
<li><p>Handle stopwords in all supported languages. A bilingual system needs bilingual stopword lists.</p></li>
<li><p>Balance dense and sparse weights based on your data. If one brand dominates your corpus, give more weight to semantic search.</p></li>
<li><p>Hybrid search is not always necessary. If pure semantic search works for your use case, the added complexity of BM25 may not be worth it. Use hybrid when exact matches matter and when you have domain-specific vocabulary that embeddings handle poorly.</p></li>
</ol>
<div id="e0b03459" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 1: Install dependencies</span></span>
<span id="cb12-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># pip install qdrant-client rank-bm25</span></span>
<span id="cb12-3"></span>
<span id="cb12-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> re</span>
<span id="cb12-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> typing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> List</span>
<span id="cb12-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pickle</span>
<span id="cb12-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> rank_bm25 <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BM25Okapi</span>
<span id="cb12-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> qdrant_client <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> QdrantClient, models</span>
<span id="cb12-9"></span>
<span id="cb12-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 2: Define tokenizer with stopwords and bigrams</span></span>
<span id="cb12-11">STOPWORDS <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb12-12">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"the"</span>,</span>
<span id="cb12-13">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"a"</span>,</span>
<span id="cb12-14">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"an"</span>,</span>
<span id="cb12-15">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"of"</span>,</span>
<span id="cb12-16">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"in"</span>,</span>
<span id="cb12-17">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"on"</span>,</span>
<span id="cb12-18">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"at"</span>,</span>
<span id="cb12-19">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"to"</span>,</span>
<span id="cb12-20">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"for"</span>,</span>
<span id="cb12-21">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"and"</span>,</span>
<span id="cb12-22">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"or"</span>,</span>
<span id="cb12-23">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"في"</span>,</span>
<span id="cb12-24">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"من"</span>,</span>
<span id="cb12-25">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"إلى"</span>,</span>
<span id="cb12-26">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"على"</span>,</span>
<span id="cb12-27">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"عن"</span>,</span>
<span id="cb12-28">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"مع"</span>,</span>
<span id="cb12-29">}</span>
<span id="cb12-30"></span>
<span id="cb12-31"></span>
<span id="cb12-32"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> tokenize_ngram(text: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> List[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>]:</span>
<span id="cb12-33">    text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> text.lower()</span>
<span id="cb12-34">    text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> re.sub(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r"</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">\w\s</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>, text)</span>
<span id="cb12-35">    words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [w <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> w <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> text.split() <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> w <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> STOPWORDS]</span>
<span id="cb12-36">    tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> words.copy()</span>
<span id="cb12-37">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(words) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>):</span>
<span id="cb12-38">        tokens.append(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>words[i]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">_</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>words[i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb12-39">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> tokens</span></code></pre></div></div>
</div>
<div id="8181bd16" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 3: Build BM25 index from your filtered chunks</span></span>
<span id="cb13-2">filtered_chunks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"palm hills apartment 5th settlement new cairo"</span>, ...]  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Your data</span></span>
<span id="cb13-3">tokenized_corpus <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [tokenize_ngram(chunk) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> chunk <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> filtered_chunks]</span>
<span id="cb13-4">bm25 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> BM25Okapi(tokenized_corpus)</span>
<span id="cb13-5"></span>
<span id="cb13-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save for later use</span></span>
<span id="cb13-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bm25_index.pkl"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"wb"</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> f:</span>
<span id="cb13-8">    pickle.dump(bm25, f)</span>
<span id="cb13-9"></span>
<span id="cb13-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 4: Create hybrid collection in Qdrant</span></span>
<span id="cb13-11">client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> QdrantClient(url<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"your-url"</span>, api_key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"your-key"</span>)</span>
<span id="cb13-12"></span>
<span id="cb13-13">client.create_collection(</span>
<span id="cb13-14">    collection_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hybrid_collection"</span>,</span>
<span id="cb13-15">    vectors_config<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb13-16">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dense"</span>: models.VectorParams(size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3072</span>, distance<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>models.Distance.COSINE)</span>
<span id="cb13-17">    },</span>
<span id="cb13-18">    sparse_vectors_config<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text-sparse"</span>: models.SparseVectorParams()},</span>
<span id="cb13-19">)</span></code></pre></div></div>
</div>
<div id="d51cef0c" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 5: Convert text to sparse vector using BM25 scores</span></span>
<span id="cb14-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> text_to_sparse_vector(text: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, bm25_index) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> models.SparseVector:</span>
<span id="cb14-3">    tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize_ngram(text)</span>
<span id="cb14-4">    scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bm25_index.get_scores(tokens)</span>
<span id="cb14-5"></span>
<span id="cb14-6">    indices <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb14-7">    values <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb14-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> idx, score <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(scores):</span>
<span id="cb14-9">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb14-10">            indices.append(idx)</span>
<span id="cb14-11">            values.append(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>(score))</span>
<span id="cb14-12"></span>
<span id="cb14-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> models.SparseVector(indices<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>indices, values<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>values)</span>
<span id="cb14-14"></span>
<span id="cb14-15"></span>
<span id="cb14-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 6: Upload documents with both vectors</span></span>
<span id="cb14-17"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i, (chunk, dense_embedding, metadata) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(</span>
<span id="cb14-18">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(filtered_chunks, embeddings, metadata_list)</span>
<span id="cb14-19">):</span>
<span id="cb14-20">    client.upsert(</span>
<span id="cb14-21">        collection_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hybrid_collection"</span>,</span>
<span id="cb14-22">        points<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb14-23">            models.PointStruct(</span>
<span id="cb14-24">                <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>i,</span>
<span id="cb14-25">                payload<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>metadata,</span>
<span id="cb14-26">                vector<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb14-27">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dense"</span>: dense_embedding,</span>
<span id="cb14-28">                    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text-sparse"</span>: text_to_sparse_vector(chunk, bm25),</span>
<span id="cb14-29">                },</span>
<span id="cb14-30">            )</span>
<span id="cb14-31">        ],</span>
<span id="cb14-32">    )</span></code></pre></div></div>
</div>
<div id="6257b62b" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 7: Hybrid search with RRF fusion</span></span>
<span id="cb15-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> hybrid_search(query_text: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, query_dense_vector: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>, limit: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>):</span>
<span id="cb15-3">    query_sparse <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> text_to_sparse_vector(query_text, bm25)</span>
<span id="cb15-4"></span>
<span id="cb15-5">    results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> client.query_points(</span>
<span id="cb15-6">        collection_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hybrid_collection"</span>,</span>
<span id="cb15-7">        prefetch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb15-8">            models.Prefetch(query<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>query_dense_vector, using<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dense"</span>, limit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>),</span>
<span id="cb15-9">            models.Prefetch(query<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>query_sparse, using<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text-sparse"</span>, limit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>),</span>
<span id="cb15-10">        ],</span>
<span id="cb15-11">        query<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>models.FusionQuery(fusion<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>models.Fusion.RRF),</span>
<span id="cb15-12">        limit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>limit,</span>
<span id="cb15-13">    )</span>
<span id="cb15-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> results.points</span>
<span id="cb15-15"></span>
<span id="cb15-16"></span>
<span id="cb15-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage</span></span>
<span id="cb15-18">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> hybrid_search(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"6th Settlement Apartment"</span>, your_query_embedding)</span></code></pre></div></div>
</div>
<section id="references" class="level3">
<h3 class="anchored" data-anchor-id="references">References</h3>
<ol type="1">
<li><a href="https://www.xbites.io/">Xbites Real Estate AI</a></li>
<li><a href="https://qdrant.tech/articles/hybrid-search/">Qdrand Hybrid search</a></li>
<li><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_from_scratch.html">Bm25 part 1</a></li>
<li><a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_benchmark_full.html">Bm25 part 3</a></li>
</ol>


</section>
</section>

 ]]></description>
  <category>blogging</category>
  <category>embedding</category>
  <category>qdrant</category>
  <category>sparse</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_arabic_qdrant.html</guid>
  <pubDate>Thu, 18 Dec 2025 22:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/images/xbites_chat.png" medium="image" type="image/png" height="153" width="144"/>
</item>
<item>
  <title>BM25 Search Algorithm: Python Implementation from Scratch</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_from_scratch.html</link>
  <description><![CDATA[ 





<section id="bm25-explained" class="level2">
<h2 class="anchored" data-anchor-id="bm25-explained">BM25 Explained</h2>
<p>Implementing BM25 Search Algorithm from Scratch</p>
<p>BM25 (Best Matching 25) is a ranking function used by search engines to estimate the relevance of documents to a given search query.</p>
<p>It’s an improvement over TF-IDF that handles term frequency saturation and document length normalization.</p>
<hr>
<p><strong>you will find a marimo version in the references that will help you understand the equation better.</strong></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/images/bm_25_marimo.png" class="img-fluid figure-img"></p>
<figcaption>marimo bm25</figcaption>
</figure>
</div>
</section>
<section id="sample-documents" class="level2">
<h2 class="anchored" data-anchor-id="sample-documents">1. Sample Documents</h2>
<p>Let’s start with a small collection of documents to search through.</p>
<div id="a0eab7dd" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1">documents <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb1-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Python is a programming language"</span>,</span>
<span id="cb1-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I love Python programming"</span>,</span>
<span id="cb1-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Java is also a programming language"</span>,</span>
<span id="cb1-5">]</span></code></pre></div></div>
</div>
</section>
<section id="tokenization" class="level2">
<h2 class="anchored" data-anchor-id="tokenization">2. Tokenization</h2>
<p>First, we need to break text into individual words (tokens). We’ll convert to lowercase for case-insensitive matching.</p>
<div id="61d0f48d" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> tokenize(text):</span>
<span id="cb2-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Convert text to lowercase and split into words."""</span></span>
<span id="cb2-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> text.lower().split()</span>
<span id="cb2-4"></span>
<span id="cb2-5"></span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Test tokenization</span></span>
<span id="cb2-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Example:"</span>, tokenize(documents[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Example: ['python', 'is', 'a', 'programming', 'language']</code></pre>
</div>
</div>
</section>
<section id="term-frequency-tf" class="level2">
<h2 class="anchored" data-anchor-id="term-frequency-tf">3. Term Frequency (TF)</h2>
<p>For each document, we count how many times each word appears.</p>
<p>We’ll use Python’s <code>Counter</code> for this.</p>
<div id="c37f85d8" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="cb4-2"></span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> compute_term_frequencies(documents):</span>
<span id="cb4-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Compute term frequency for each document."""</span></span>
<span id="cb4-6">    doc_term_freqs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb4-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> documents:</span>
<span id="cb4-8">        tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(doc)</span>
<span id="cb4-9">        term_freq <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Counter(tokens)</span>
<span id="cb4-10">        doc_term_freqs.append(term_freq)</span>
<span id="cb4-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> doc_term_freqs</span>
<span id="cb4-12"></span>
<span id="cb4-13"></span>
<span id="cb4-14">all_docs_terms <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> compute_term_frequencies(documents)</span>
<span id="cb4-15"></span>
<span id="cb4-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display term frequencies for each document</span></span>
<span id="cb4-17"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i, tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(all_docs_terms):</span>
<span id="cb4-18">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Document </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>(tf)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Document 0: {'python': 1, 'is': 1, 'a': 1, 'programming': 1, 'language': 1}
Document 1: {'i': 1, 'love': 1, 'python': 1, 'programming': 1}
Document 2: {'java': 1, 'is': 1, 'also': 1, 'a': 1, 'programming': 1, 'language': 1}</code></pre>
</div>
</div>
</section>
<section id="document-frequency-df" class="level2">
<h2 class="anchored" data-anchor-id="document-frequency-df">4. Document Frequency (DF)</h2>
<p>Document frequency counts in how many documents each term appears. This helps identify common vs.&nbsp;rare words.</p>
<div id="5ea190db" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> compute_document_frequency(doc_term_freqs):</span>
<span id="cb6-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Count how many documents each term appears in."""</span></span>
<span id="cb6-3">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<span id="cb6-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc_term_freqs:</span>
<span id="cb6-5">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> term <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc_tf.keys():</span>
<span id="cb6-6">            df[term] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.get(term, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb6-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> df</span>
<span id="cb6-8"></span>
<span id="cb6-9"></span>
<span id="cb6-10">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> compute_document_frequency(all_docs_terms)</span>
<span id="cb6-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Document Frequencies:"</span>, df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Document Frequencies: {'python': 2, 'is': 2, 'a': 2, 'programming': 3, 'language': 2, 'i': 1, 'love': 1, 'java': 1, 'also': 1}</code></pre>
</div>
</div>
</section>
<section id="inverse-document-frequency-idf" class="level2">
<h2 class="anchored" data-anchor-id="inverse-document-frequency-idf">5. Inverse Document Frequency (IDF)</h2>
<p>IDF measures how rare or common a word is across all documents. Rare words get higher scores.</p>
<p>The formula is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BIDF%7D(w)%20=%20%5Clog%5Cleft(%5Cfrac%7BN%20-%20%5Ctext%7BDF%7D(w)%20+%200.5%7D%7B%5Ctext%7BDF%7D(w)%20+%200.5%7D%5Cright)"></p>
<p>where:</p>
<ul>
<li><p><img src="https://latex.codecogs.com/png.latex?N"> = total number of documents</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BDF%7D(w)"> = document frequency of word <img src="https://latex.codecogs.com/png.latex?w"></p></li>
</ul>
<section id="why-use-logarthim-in-idf" class="level3">
<h3 class="anchored" data-anchor-id="why-use-logarthim-in-idf">Why use Logarthim in IDF?</h3>
<p>The logarithm <strong>compresses the scale</strong> of scores. Without it:</p>
<ul>
<li><p>A word appearing in 1 out of 10,000 documents would dominate everything</p></li>
<li><p>Rare words would have scores thousands of times higher than slightly less rare words</p></li>
</ul>
<p>The log smooths this out so differences are more reasonable.</p>
<p>The <code>+0.5</code> terms are a <strong>smoothing trick</strong> to handle edge cases (like when a word appears in all or no documents).</p>
<hr>
<div id="b5dd5487" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> math</span>
<span id="cb8-2"></span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> compute_idf(term, df, num_docs):</span>
<span id="cb8-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Calculate IDF score for a term."""</span></span>
<span id="cb8-6">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> term <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> df:</span>
<span id="cb8-7">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb8-8">    df_term <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[term]</span>
<span id="cb8-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> math.log((num_docs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> df_term <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (df_term <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span>
<span id="cb8-10"></span>
<span id="cb8-11"></span>
<span id="cb8-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Test IDF scores</span></span>
<span id="cb8-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"IDF('python'): </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>compute_idf(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'python'</span>, df, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(documents))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb8-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"IDF('java'): </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>compute_idf(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'java'</span>, df, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(documents))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb8-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"IDF('programming'): </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>compute_idf(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'programming'</span>, df, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(documents))<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.4f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>IDF('python'): -0.5108
IDF('java'): 0.5108
IDF('programming'): -1.9459</code></pre>
</div>
</div>
</section>
<section id="understanding-negative-vs-positive-idf" class="level3">
<h3 class="anchored" data-anchor-id="understanding-negative-vs-positive-idf">Understanding Negative vs Positive IDF</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Word</th>
<th>Appears in</th>
<th>IDF</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>“java”</td>
<td>1/3 docs</td>
<td>+0.51</td>
<td>Rare → discriminating → useful</td>
</tr>
<tr class="even">
<td>“python”</td>
<td>2/3 docs</td>
<td>-0.51</td>
<td>Common → less useful</td>
</tr>
<tr class="odd">
<td>“programming”</td>
<td>3/3 docs</td>
<td>-1.10</td>
<td>Very common → penalized</td>
</tr>
</tbody>
</table>
<p><strong>Key insight:</strong> Words that appear in more than half the documents get negative IDF scores.</p>
<p>They hurt relevance because they don’t help distinguish documents!</p>
</section>
<section id="why-do-some-scores-equal-zero" class="level3">
<h3 class="anchored" data-anchor-id="why-do-some-scores-equal-zero">Why Do Some Scores Equal Zero?</h3>
<p>When query terms have opposite IDF values, they can cancel out:</p>
<p>For query <code>"love python"</code> in Document 1:</p>
<ul>
<li><p>“love” IDF = +0.51 (rare, appears in 1 doc)</p></li>
<li><p>“python” IDF = -0.51 (common, appears in 2 docs)</p></li>
<li><p>Total ≈ 0</p></li>
</ul>
<p>This is a limitation of small document collections.</p>
<p>With thousands of documents, rare words would have much higher positive scores and wouldn’t be canceled out.</p>
</section>
<section id="why-does-java-score-better-than-python" class="level3">
<h3 class="anchored" data-anchor-id="why-does-java-score-better-than-python">Why Does “Java” Score Better Than “Python”?</h3>
<p>For query <code>"java programming"</code>:</p>
<ul>
<li><p>“java” has positive IDF (+0.51) because it’s rare</p></li>
<li><p>“programming” has negative IDF (-1.10) because it’s everywhere</p></li>
</ul>
<p>For query <code>"python programming"</code>:</p>
<ul>
<li><p>“python” has negative IDF (-0.51)</p></li>
<li><p>“programming” has negative IDF (-1.10)</p></li>
<li><p>Both terms are negative → all documents score poorly!</p></li>
</ul>
<p><strong>Takeaway:</strong> BM25 rewards queries containing rare, discriminating terms.</p>
</section>
</section>
<section id="average-document-length" class="level2">
<h2 class="anchored" data-anchor-id="average-document-length">6. Average Document Length</h2>
<p>BM25 normalizes scores by document length to avoid bias toward longer documents.</p>
<div id="a77c0e20" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> compute_avg_doc_length(doc_term_freqs):</span>
<span id="cb10-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Calculate average document length (in terms)."""</span></span>
<span id="cb10-3">    total_length <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(doc_tf) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> doc_term_freqs)</span>
<span id="cb10-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> total_length <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(doc_term_freqs)</span>
<span id="cb10-5"></span>
<span id="cb10-6"></span>
<span id="cb10-7">avg_doc_len <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> compute_avg_doc_length(all_docs_terms)</span>
<span id="cb10-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Average document length: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>avg_doc_len<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> unique terms"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Average document length: 5.00 unique terms</code></pre>
</div>
</div>
</section>
<section id="bm25-score-for-a-single-term" class="level2">
<h2 class="anchored" data-anchor-id="bm25-score-for-a-single-term">7. BM25 Score for a Single Term</h2>
<p>The BM25 score for a single term in a document is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BBM25%7D(w,%20d)%20=%20%5Ctext%7BIDF%7D(w)%20%5Ccdot%20%5Cfrac%7B%5Ctext%7BTF%7D(w,%20d)%20%5Ccdot%20(k_1%20+%201)%7D%7B%5Ctext%7BTF%7D(w,%20d)%20+%20k_1%20%5Ccdot%20%5Cleft(1%20-%20b%20+%20b%20%5Ccdot%20%5Cfrac%7B%7Cd%7C%7D%7B%5Ctext%7Bavgdl%7D%7D%5Cright)%7D"></p>
<p>where:</p>
<ul>
<li><p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BTF%7D(w,%20d)"> = term frequency of word <img src="https://latex.codecogs.com/png.latex?w"> in document <img src="https://latex.codecogs.com/png.latex?d"></p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?%7Cd%7C"> = length of document <img src="https://latex.codecogs.com/png.latex?d"></p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bavgdl%7D"> = average document length</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?k_1"> = term frequency saturation parameter (typically 1.2-2.0)</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?b"> = length normalization parameter (typically 0.75)</p></li>
</ul>
<section id="the-role-of-parameters-k_1-and-b" class="level3">
<h3 class="anchored" data-anchor-id="the-role-of-parameters-k_1-and-b">The Role of Parameters <img src="https://latex.codecogs.com/png.latex?k_1"> and <img src="https://latex.codecogs.com/png.latex?b"></h3>
<table class="caption-top table">
<colgroup>
<col style="width: 25%">
<col style="width: 22%">
<col style="width: 25%">
<col style="width: 27%">
</colgroup>
<thead>
<tr class="header">
<th>Parameter</th>
<th>Controls</th>
<th>Low Value</th>
<th>High Value</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?k_1"> (1.2-2.0)</td>
<td>Term frequency saturation</td>
<td>TF matters less</td>
<td>TF matters more</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?b"> (0-1)</td>
<td>Length normalization</td>
<td>Ignore length</td>
<td>Penalize long docs</td>
</tr>
</tbody>
</table>
<ul>
<li><p><img src="https://latex.codecogs.com/png.latex?k_1%20=%200">: Only IDF matters, TF is ignored</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?b%20=%200">: Document length is ignored</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?b%20=%201">: Full length normalization</p></li>
</ul>
<div id="a4186497" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> bm25_term_score(term, doc_tf, idf_score, doc_length, avg_doc_len, k1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span>, b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>):</span>
<span id="cb12-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Calculate BM25 score for a single term in a document."""</span></span>
<span id="cb12-3">    tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> doc_tf.get(term, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb12-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb12-5">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb12-6"></span>
<span id="cb12-7">    numerator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> idf_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb12-8">    denominator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (doc_length <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> avg_doc_len))</span>
<span id="cb12-9">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> numerator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> denominator</span></code></pre></div></div>
</div>
</section>
</section>
<section id="full-bm25-search" class="level2">
<h2 class="anchored" data-anchor-id="full-bm25-search">8. Full BM25 Search</h2>
<p>To score a query against all documents:</p>
<ol type="1">
<li><p>Tokenize the query</p></li>
<li><p>For each document, sum the BM25 scores of all query terms</p></li>
<li><p>Return scores for all documents</p></li>
</ol>
<div id="f63dbf6f" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> bm25_search(query, documents, doc_term_freqs, df, avg_doc_len, k1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span>, b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>):</span>
<span id="cb13-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb13-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Score all documents for a given query using BM25.</span></span>
<span id="cb13-4"></span>
<span id="cb13-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Returns a list of (document_index, score) tuples sorted by relevance.</span></span>
<span id="cb13-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb13-7">    query_tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tokenize(query)</span>
<span id="cb13-8">    num_docs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(documents)</span>
<span id="cb13-9">    scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb13-10"></span>
<span id="cb13-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc_idx, doc_tf <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(doc_term_freqs):</span>
<span id="cb13-12">        doc_length <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(doc_tf.values())  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Total terms in document</span></span>
<span id="cb13-13">        doc_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb13-14"></span>
<span id="cb13-15">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> term <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> query_tokens:</span>
<span id="cb13-16">            idf_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> compute_idf(term, df, num_docs)</span>
<span id="cb13-17">            doc_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> bm25_term_score(</span>
<span id="cb13-18">                term, doc_tf, idf_score, doc_length, avg_doc_len, k1, b</span>
<span id="cb13-19">            )</span>
<span id="cb13-20"></span>
<span id="cb13-21">        scores.append((doc_idx, doc_score))</span>
<span id="cb13-22"></span>
<span id="cb13-23">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sort by score (descending)</span></span>
<span id="cb13-24">    scores.sort(key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], reverse<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb13-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> scores</span></code></pre></div></div>
</div>
</section>
<section id="test-the-search-engine" class="level2">
<h2 class="anchored" data-anchor-id="test-the-search-engine">9. Test the Search Engine</h2>
<div id="991f2ba4" class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> display_results(query, results, documents):</span>
<span id="cb14-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Pretty print search results."""</span></span>
<span id="cb14-3">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">Query: '</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">'"</span>)</span>
<span id="cb14-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"-"</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60</span>)</span>
<span id="cb14-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> rank, (doc_idx, score) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(results, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>):</span>
<span id="cb14-6">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>rank<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">. [Score: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>score<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:6.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">] </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>documents[doc_idx]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb14-7"></span>
<span id="cb14-8"></span>
<span id="cb14-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Test different queries</span></span>
<span id="cb14-10">queries <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"python programming"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"java programming"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"love python"</span>]</span>
<span id="cb14-11"></span>
<span id="cb14-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> query <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> queries:</span>
<span id="cb14-13">    results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bm25_search(query, documents, all_docs_terms, df, avg_doc_len)</span>
<span id="cb14-14">    display_results(query, results, documents)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Query: 'python programming'
------------------------------------------------------------
1. [Score: -1.785] Java is also a programming language
2. [Score: -2.457] Python is a programming language
3. [Score: -2.700] I love Python programming

Query: 'java programming'
------------------------------------------------------------
1. [Score: -1.317] Java is also a programming language
2. [Score: -1.946] Python is a programming language
3. [Score: -2.138] I love Python programming

Query: 'love python'
------------------------------------------------------------
1. [Score:  0.000] I love Python programming
2. [Score:  0.000] Java is also a programming language
3. [Score: -0.511] Python is a programming language</code></pre>
</div>
</div>
</section>
<section id="understanding-the-results" class="level2">
<h2 class="anchored" data-anchor-id="understanding-the-results">10. Understanding the Results</h2>
<ul>
<li><strong>Negative scores</strong> can occur when query terms appear in most documents (high DF)</li>
<li><strong>Higher scores</strong> indicate better relevance</li>
<li>With small document collections, common words dominate; BM25 works best with larger collections</li>
<li>The parameters <img src="https://latex.codecogs.com/png.latex?k_1"> and <img src="https://latex.codecogs.com/png.latex?b"> can be tuned for different applications</li>
</ul>
</section>
<section id="the-three-ideas-behind-bm25" class="level2">
<h2 class="anchored" data-anchor-id="the-three-ideas-behind-bm25">The Three Ideas Behind BM25</h2>
<ol type="1">
<li><strong>Term Frequency:</strong> Words appearing more often in a doc → more relevant
<ul>
<li>BUT with diminishing returns (5 mentions isn’t 5x better than 1)</li>
</ul></li>
<li><strong>Inverse Document Frequency:</strong> Rare words matter more
<ul>
<li>“quantum” is more useful than “the”</li>
</ul></li>
<li><strong>Document Length Normalization:</strong> Shorter docs are often more focused
<ul>
<li>A 100-word doc mentioning “python” once may be more relevant than a 10,000-word doc mentioning it once</li>
</ul></li>
</ol>
<p>Here is <a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_arabic_qdrant.html">Part 2: BM25 with Qdrant</a>: A real use case for using BM25 with Gemini embeddings to improve search results for real estate.</p>
<p>Also the <a href="https://molab.marimo.io/notebooks/nb_qmRNbuWUz4fvLdfDt8Un7E/app">Marimo BM25 Explained</a></p>
<p>A benchmark: <a href="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_benchmark_full.html">BM25 Benchmark</a></p>


</section>

 ]]></description>
  <category>blogging</category>
  <category>embedding</category>
  <category>qdrant</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/bm25_from_scratch.html</guid>
  <pubDate>Thu, 18 Dec 2025 22:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/sparse_embedding/images/bm25.png" medium="image" type="image/png" height="98" width="144"/>
</item>
<item>
  <title>Late Interaction &amp; ColPali: Efficient Semantic Search</title>
  <dc:creator>kareem </dc:creator>
  <link>https://kareemai.com/blog/posts/nlp/embedding_world/late_interaction.html</link>
  <description><![CDATA[ 





<section id="beyond-bi-encoders-the-rise-of-late-interaction" class="level2">
<h2 class="anchored" data-anchor-id="beyond-bi-encoders-the-rise-of-late-interaction">Beyond Bi-Encoders: The Rise of Late Interaction</h2>
<p>In the world of Information Retrieval (IR), we usually face a trade-off between speed and accuracy.</p>
<section id="bi-encoders-the-speed-kings" class="level3">
<h3 class="anchored" data-anchor-id="bi-encoders-the-speed-kings">1. Bi-Encoders (The Speed Kings)</h3>
<p>Bi-encoders (like standard BERT embeddings) encode the query and the document independently into a single vector. Search is just a cosine similarity between these two points. It’s incredibly fast (sub-millisecond) but loses fine-grained details because the entire document is compressed into one fixed-size vector.</p>
</section>
<section id="cross-encoders-the-accuracy-masters" class="level3">
<h3 class="anchored" data-anchor-id="cross-encoders-the-accuracy-masters">2. Cross-Encoders (The Accuracy Masters)</h3>
<p>Cross-encoders feed both the query and the document into the model simultaneously (Early Interaction). The model can attend to every word in the query relative to every word in the document. This is highly accurate but computationally expensive because you must run the model for every single query-document pair. You can’t pre-compute embeddings.</p>
</section>
<section id="late-interaction-the-best-of-both-worlds" class="level3">
<h3 class="anchored" data-anchor-id="late-interaction-the-best-of-both-worlds">3. Late Interaction: The Best of Both Worlds</h3>
<p>Late Interaction models, pioneered by <strong>ColBERT</strong>, bridge this gap. Instead of one vector per document, they store a vector for <strong>every single token</strong> in the document.</p>
<p>When a query comes in: 1. The query is encoded into token-level embeddings. 2. A <strong>MaxSim</strong> (Maximum Similarity) operation is performed: for each query token, we find the document token that matches it best. 3. We sum these maximum similarities to get the final score.</p>
<p>This allows the model to perform fine-grained matching (like a cross-encoder) while still allowing document embeddings to be pre-computed (like a bi-encoder).</p>
</section>
<section id="colpali-retrieval-without-ocr" class="level3">
<h3 class="anchored" data-anchor-id="colpali-retrieval-without-ocr">ColPali: Retrieval Without OCR</h3>
<p>One of the most exciting recent developments is <strong>ColPali</strong>. Traditional PDF retrieval requires a complex pipeline: OCR the text, chunk it, and then embed it. This often fails on tables, charts, and complex layouts.</p>
<p>ColPali applies the Late Interaction principle to vision models (PaliGemma). It treats image patches of a PDF page as “tokens.” Instead of reading text, it “looks” at the page and matches query tokens directly to visual features.</p>
<p><strong>Key Benefits of ColPali:</strong> - <strong>Layout Aware:</strong> It understands that a caption belongs to a specific image. - <strong>OCR-Free:</strong> No more messy text extraction from scanned documents. - <strong>Superior Retrieval:</strong> It outperforms traditional text-based RAG on visually rich documents.</p>
<hr>
</section>
</section>
<section id="ecosystem-and-tools" class="level2">
<h2 class="anchored" data-anchor-id="ecosystem-and-tools">Ecosystem and Tools</h2>
<p>If you want to implement Late Interaction today, these are the projects to watch: - <strong>ColBERTv2:</strong> The optimized version of the original late interaction model. - <strong>PyLate:</strong> A flexible Python library for training and using late interaction models. - <strong>PLAID:</strong> An extremely fast engine for searching ColBERT vectors. - <strong>Model2Vec:</strong> While focused on static embeddings, it shows the trend towards more efficient representation learning.</p>
<p>Late interaction is transforming how we think about retrieval, moving us away from “one vector fits all” towards a more nuanced, token-aware future.</p>
<hr>
<section id="internal-resources" class="level3">
<h3 class="anchored" data-anchor-id="internal-resources">Internal Resources</h3>
<p>If you’re interested in more technical deep dives or information on my research, check out these sections:</p>
<ul>
<li><a href="../../../papers.html">My Research Papers</a></li>
<li><a href="../../../oss/opensource.html">Open Source Contributions</a></li>
<li><a href="../../../til/index.html">Today I Learned: AI Engineering Notes</a></li>
<li><a href="../../feed.html">Arabic NLP Blog Posts</a></li>
</ul>
</section>
</section>
<section id="practical-example-using-colbert-with-pylate" class="level2">
<h2 class="anchored" data-anchor-id="practical-example-using-colbert-with-pylate">Practical Example: Using ColBERT with PyLate</h2>
<p>Here’s how to use a late interaction model for retrieval in Python:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pylate <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ColBERT, Indexes, retrieve</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load a pre-trained ColBERT model</span></span>
<span id="cb1-4">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ColBERT(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lightonai/colbertv2"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Encode documents into token-level embeddings</span></span>
<span id="cb1-7">documents <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb1-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Late interaction models store a vector per token instead of one per document."</span>,</span>
<span id="cb1-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bi-encoders compress documents into a single vector for fast retrieval."</span>,</span>
<span id="cb1-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cross-encoders process query and document together for higher accuracy."</span></span>
<span id="cb1-11">]</span>
<span id="cb1-12"></span>
<span id="cb1-13">document_embeddings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.encode(documents, convert_to_tensor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb1-14"></span>
<span id="cb1-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Build an index for fast retrieval</span></span>
<span id="cb1-16">index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Indexes.FlatIndex()</span>
<span id="cb1-17">index.add_documents(document_embeddings, documents<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>documents)</span>
<span id="cb1-18"></span>
<span id="cb1-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Encode the query</span></span>
<span id="cb1-20">query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"What is the difference between bi-encoders and cross-encoders?"</span></span>
<span id="cb1-21">query_embedding <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.encode([query], convert_to_tensor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb1-22"></span>
<span id="cb1-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Retrieve top-k results</span></span>
<span id="cb1-24">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> index.search(query_embedding, k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb1-25"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> doc, score <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> results:</span>
<span id="cb1-26">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>score<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>doc<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<section id="performance-comparison" class="level3">
<h3 class="anchored" data-anchor-id="performance-comparison">Performance Comparison</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Model Type</th>
<th>Speed</th>
<th>Accuracy</th>
<th>Use Case</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Bi-Encoder</td>
<td>Fastest (~1ms/query)</td>
<td>Good</td>
<td>Initial retrieval, large-scale search</td>
</tr>
<tr class="even">
<td>Cross-Encoder</td>
<td>Slowest (~100ms/pair)</td>
<td>Best</td>
<td>Re-ranking top results</td>
</tr>
<tr class="odd">
<td>Late Interaction</td>
<td>Fast (~10ms/query)</td>
<td>Very Good</td>
<td>High-accuracy retrieval at scale</td>
</tr>
</tbody>
</table>
<p>Late interaction gives you 90% of cross-encoder accuracy with 10% of the computational cost.</p>
</section>
</section>
<section id="faq-late-interaction-models" class="level2">
<h2 class="anchored" data-anchor-id="faq-late-interaction-models">FAQ: Late Interaction Models</h2>
<section id="what-is-late-interaction-in-nlp" class="level3">
<h3 class="anchored" data-anchor-id="what-is-late-interaction-in-nlp">What is late interaction in NLP?</h3>
<p>Late interaction is a retrieval technique where query and document tokens interact at query time rather than during encoding. Models like ColBERT store token-level embeddings for documents and compute similarity using MaxSim operations when a query arrives.</p>
</section>
<section id="is-colbert-better-than-standard-embeddings" class="level3">
<h3 class="anchored" data-anchor-id="is-colbert-better-than-standard-embeddings">Is ColBERT better than standard embeddings?</h3>
<p>For retrieval tasks requiring fine-grained matching, yes. ColBERT outperforms bi-encoders on most benchmarks while being much faster than cross-encoders. However, it requires more storage since you store one vector per token instead of one per document.</p>
</section>
<section id="what-is-colpali" class="level3">
<h3 class="anchored" data-anchor-id="what-is-colpali">What is ColPali?</h3>
<p>ColPali applies late interaction to vision-language models for document retrieval. Instead of extracting text from PDFs via OCR, it processes page images directly and matches query tokens to visual patches. This handles tables, charts, and complex layouts better than text-based retrieval.</p>
</section>
<section id="how-much-storage-do-late-interaction-models-need" class="level3">
<h3 class="anchored" data-anchor-id="how-much-storage-do-late-interaction-models-need">How much storage do late interaction models need?</h3>
<p>Storage is higher than bi-encoders because you store vectors for every token. A 512-token document needs 512 vectors instead of 1. Compression techniques like PLAID indexing and quantization reduce this overhead significantly.</p>
</section>
<section id="when-should-i-use-late-interaction-vs.-bi-encoders" class="level3">
<h3 class="anchored" data-anchor-id="when-should-i-use-late-interaction-vs.-bi-encoders">When should I use late interaction vs.&nbsp;bi-encoders?</h3>
<p>Use late interaction when retrieval quality is critical and you can afford the extra storage. Use bi-encoders for very large-scale search where speed and storage efficiency matter most. A common hybrid approach: bi-encoder for initial retrieval, late interaction for re-ranking.</p>
</section>
</section>
<section id="related-posts" class="level2">
<h2 class="anchored" data-anchor-id="related-posts">Related Posts</h2>
<ul>
<li><a href="../../../../blog/posts/nlp/embedding_world/sparse_embedding/bm25_benchmark_full.html">BM25 Benchmarking</a> — Comparing sparse retrieval methods</li>
<li><a href="../../../../blog/posts/mteb_encoding/MTEB_massive_text_embedding_benchmark.html">MTEB Benchmark Explained</a> — How embedding models are evaluated</li>
<li><a href="../../../../blog/posts/mteb_encoding/tiny-gte_transformer_model.html">Tiny-GTE Transformer Model</a> — Efficient transformer architecture for embeddings</li>
</ul>


</section>

 ]]></description>
  <category>blogging</category>
  <category>embedding</category>
  <category>minishlab</category>
  <category>model2vec</category>
  <category>arabic</category>
  <guid>https://kareemai.com/blog/posts/nlp/embedding_world/late_interaction.html</guid>
  <pubDate>Wed, 14 May 2025 21:00:00 GMT</pubDate>
  <media:content url="https://kareemai.com/blog/posts/nlp/embedding_world/images/minishlab.jpg" medium="image" type="image/jpeg"/>
</item>
</channel>
</rss>
