<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Binesh's Data Sense Lab]]></title><description><![CDATA[My journey to make sense of the multimodal data I meet in life, through research notes, tech experiments, and book takeaways]]></description><link>https://thedatasense.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 08:05:27 GMT</lastBuildDate><atom:link href="https://thedatasense.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Circuit Tracing: Finding Medical Features in Gemma 3]]></title><description><![CDATA[Language models can answer medical questions with surprising accuracy. But do they actually encode medical knowledge in identifiable, interpretable ways? Or is it all just statistical soup?
Using Neuronpedia, we ran a simple experiment to find out. W...]]></description><link>https://thedatasense.com/circuit-tracing-finding-medical-features-in-gemma-3</link><guid isPermaLink="true">https://thedatasense.com/circuit-tracing-finding-medical-features-in-gemma-3</guid><category><![CDATA[#medgemma; #mechanistic interpretability]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 04 Feb 2026 22:07:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770242854571/eda21f99-b646-4424-a866-888fb80e9491.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Language models can answer medical questions with surprising accuracy. But do they actually encode medical knowledge in identifiable, interpretable ways? Or is it all just statistical soup?</p>
<p>Using <a target="_blank" href="https://www.neuronpedia.org">Neuronpedia</a>, we ran a simple experiment to find out. We searched for features related to angina (cardiac chest pain) inside Gemma 3 1B IT, then tested whether those features light up when the model processes related medical prompts. The short answer: they do, and the results are pretty clean.</p>
<h2 id="heading-what-are-features-and-why-should-you-care">What Are Features, and Why Should You Care?</h2>
<p>Sparse autoencoders (SAEs) decompose a model's internal activations into interpretable directions, often called "features." Each feature corresponds to a concept the model has learned. Neuronpedia hosts pretrained SAEs for several open models, including Google's Gemma family, and lets you search, inspect, and test these features through a browser interface.</p>
<p>If we can find features that reliably correspond to specific medical concepts, that tells us something about how the model organizes its knowledge. It also opens the door to monitoring, steering, or auditing model behavior at a mechanistic level.</p>
<h2 id="heading-the-experiment">The Experiment</h2>
<h3 id="heading-step-1-search-for-angina">Step 1: Search for "Angina"</h3>
<p>We used Neuronpedia's "Search via Inference" tool with GEMMASCOPE-2-RES-16K (Residual Stream, 16K features) across all layers. The search surfaced several candidate features. One stood out: <strong>"cardiac and blood flow"</strong> (feature 2224 at layer 17).</p>
<p>Its top activations included phrases like "Individuals with Existing Heart Conditions," "coronary artery disease, heart failure," and "Reduced Blood Pressure." The positive logits pointed to tokens like "Heart," "cardiac," and "cardiovascular." So far, this looks like a genuine cardiac concept feature.</p>
<h3 id="heading-step-2-test-it-on-a-medical-prompt">Step 2: Test It on a Medical Prompt</h3>
<p>Here's where it gets interesting. We used Neuronpedia's TopK feature analysis to see which features activate most strongly at the final token when the model processes:</p>
<blockquote>
<p>"Chest pain is frequently linked to"</p>
</blockquote>
<p>This is the exact position where the model predicts the next token. If the cardiac feature actually encodes what we think it does, it should activate here.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770240297440/5d534a4d-b660-4785-87c9-3f795ea49a86.png" alt /></p>
<p><strong>Result:</strong> The "cardiac and blood flow" feature ranked <strong>#1</strong> at the final token position, with an activation of 636.00. Not buried in the top 50. Not somewhere in the middle. Number one.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770242584412/f3c37d81-13b8-4735-9308-4515cdcba0fa.png" alt /></p>
<h3 id="heading-step-3-replicate-with-a-different-medical-domain">Step 3: Replicate with a Different Medical Domain</h3>
<p>We repeated the experiment for respiratory features.</p>
<p>Searching for "pneumonia" surfaced a feature called <strong>"respiratory and lung conditions"</strong> (feature 3791 at layer 22). Its positive logits included "respiratory," "lungs," "airflow," "breathing," "airways," and "coughing." The top activations contained clinical text about chronic cough, wheezing, and respiratory problems.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770242705242/d3447ef7-924c-45e4-87a3-c1ab8718ff9f.png" alt /></p>
<p>We then tested this feature against the prompt:</p>
<blockquote>
<p>"Shortness of breath can be a symptom of"</p>
</blockquote>
<p>The TopK analysis at the final "of" token showed the respiratory feature at <strong>708.00</strong>, landing in the top 5. The top feature was "Medical conditions and disorders" at 1992.00, which also makes sense since shortness of breath can be a symptom of many things beyond just lung conditions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770242773573/87d27287-2c6d-4557-9cf0-c089b7d96504.png" alt /></p>
<h2 id="heading-what-this-tells-us">What This Tells Us</h2>
<p>Both experiments follow the same pattern: features discovered through symptom-related searches activate strongly when the model processes related medical prompts. The cardiac feature found via "angina" fires at position one when the model encounters "chest pain." The respiratory feature found via "pneumonia" fires in the top five when the model encounters "shortness of breath."</p>
<p>This isn't proof that the model "understands" medicine in any deep sense. But it does show that Gemma 3 1B IT organizes medical knowledge into identifiable, interpretable features that activate in contextually appropriate ways. The model isn't just pattern-matching surface tokens. It has learned something about the semantic relationships between symptoms and conditions.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>A few caveats are worth noting.</p>
<p>This experiment only tests two medical domains (cardiac and respiratory). A broader study would need to cover many more domains to make strong claims about generalizability. We also only tested one prompt per domain. More diverse prompts, including edge cases and adversarial examples, would strengthen the findings.</p>
<p>The activation values themselves are hard to interpret in absolute terms. Is 636.00 "high"? Relative to what? The ranking (first place) is more meaningful than the raw number.</p>
<p>Finally, this was done on Gemma 3 1B IT, a relatively small model. Larger models may organize their features differently.</p>
<h2 id="heading-try-it-yourself">Try It Yourself</h2>
<p>The whole experiment is reproducible through Neuronpedia's web interface. No code required.</p>
<ul>
<li><p><a target="_blank" href="https://www.neuronpedia.org/gemma-3-1b-it/17-gemmascope-2-res-16k/2224">Cardiac feature (layer 17, index 2224)</a></p>
</li>
<li><p><a target="_blank" href="https://www.neuronpedia.org/gemma-3-1b-it/22-gemmascope-2-res-16k/3791">Respiratory feature (layer 22, index 3791)</a></p>
</li>
</ul>
<p>If you're interested in mechanistic interpretability for medical AI, this is a good starting point. Search for a medical concept, find its features, then test whether they activate on related prompts. It takes about five minutes, and the results can be surprisingly informative.</p>
]]></content:encoded></item><item><title><![CDATA[What Does Medical VLM Actually See? Experiments with  MedGemma and Sparse Autoencoders]]></title><description><![CDATA[When a medical Vision Language Model(VLM) looks at a chest X-ray and says "cardiomegaly present," what's actually happening inside the model? It's a black box. Billions of parameters. Dense activation vectors where every dimension encodes a tangled m...]]></description><link>https://thedatasense.com/what-does-medical-vlm-actually-see-experiments-with-medgemma-and-sparse-autoencoders</link><guid isPermaLink="true">https://thedatasense.com/what-does-medical-vlm-actually-see-experiments-with-medgemma-and-sparse-autoencoders</guid><category><![CDATA[gemmascope]]></category><category><![CDATA[Gemma AI]]></category><category><![CDATA[interpretability]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Sat, 24 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769392373719/e37f19b8-e374-4775-8ec8-52c7e18bcc86.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When a medical Vision Language Model(VLM) looks at a chest X-ray and says "cardiomegaly present," what's actually happening inside the model? It's a black box. Billions of parameters. Dense activation vectors where every dimension encodes a tangled mix of concepts. This is my experimentation using Sparse Autoencoders (SAEs) to interpret MedGemma, Google's vision-language model for medical imaging. Thanks to DeepMind Team for an open release of <a target="_blank" href="https://deepmind.google/blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/">Gemma Scope</a></p>
<h2 id="heading-blackbox-neural-networks-are-opaque">BlackBox : Neural Networks Are Opaque</h2>
<p>MedGemma has a hidden dimension of 2,560. When you feed it a chest X-ray and ask about cardiomegaly, the model produces an activation vector that looks something like this:</p>
<pre><code class="lang-markdown">[0.23, -0.15, 0.42, 0.08, -0.31, ...]
</code></pre>
<p>Every single one of those 2,560 numbers represents a mixture of many concepts. This is called superposition. The model packs more ideas than it has dimensions by encoding them as overlapping patterns. It is next to impossible to figuring out what any single value means.</p>
<h2 id="heading-saes-provide-a-way-in">SAEs provide a Way In</h2>
<p>SAEs learn to decompose those dense vectors into a much larger set of sparse features. Instead of 2,560 tangled dimensions, we get 65,000+ features where:</p>
<ul>
<li><p>Each feature tends to represent a single concept</p>
</li>
<li><p>Most features are zero for any given input</p>
</li>
<li><p>We can focus on the handful that actually matter</p>
</li>
</ul>
<p>My mental model for this goes like this, dense vector is like hearing an entire orchestra playing at once. The SAE separates out each instrument so you can listen to the violin, the cello, the flute individually.</p>
<pre><code class="lang-markdown">Dense activation → SAE Encoder → Sparse features (65k dims)
[0.23, -0.15, ...]  →  [0, 0, 0, 142, 0, 0, 89, 0, 0, ...]
<span class="hljs-code">                              ↑           ↑
                         Feature 3818  Feature 7241
                         "formal tone" "lung region"</span>
</code></pre>
<h2 id="heading-using-gemmascope-2-on-medgemma">Using GemmaScope 2 on MedGemma</h2>
<p>Training SAEs from scratch takes serious compute. Fortunately, Google released <a target="_blank" href="https://deepmind.google/blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/">GemmaScope 2</a>, a suite of pre-trained SAEs for the Gemma model family. One caveat is that GemmaScope 2 was trained on general Gemma 3 activations. MedGemma was fine-tuned for medical tasks. Would the SAE even work?</p>
<p>Based on my analysis, it appears so.I measured reconstruction quality using <a target="_blank" href="https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained">Fraction of Variance Unexplained (FVU)</a>. Lower is better. Here's what I found across different layers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Layer</td><td>FVU</td><td>Variance Explained</td></tr>
</thead>
<tbody>
<tr>
<td>9</td><td>0.006</td><td>99.4%</td></tr>
<tr>
<td>17</td><td>0.020</td><td>98.0%</td></tr>
<tr>
<td>22</td><td>0.013</td><td>98.7%</td></tr>
<tr>
<td>29</td><td>0.053</td><td>94.7%</td></tr>
</tbody>
</table>
</div><p>Medical fine-tuning didn't dramatically shift the activation distributions. The SAE captures over 95% of variance across all layers. That's enough for interpretability work.A few caveats though. Later layers show more drift. And the feature meanings might shift: a "formal language" feature in general Gemma 3 might fire on clinical terminology in MedGemma.</p>
<h2 id="heading-experiment-1-different-questions-different-circuits">Experiment 1: Different Questions, Different Circuits</h2>
<p>I loaded a sample chest X-ray and asked MedGemma four different clinical questions:</p>
<ul>
<li><p>Is there cardiomegaly?</p>
</li>
<li><p>Is there pneumonia?</p>
</li>
<li><p>Is there a pleural effusion?</p>
</li>
<li><p>Is this a normal chest X-ray?</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769391194298/870f23f1-656c-43b6-b993-d5fe99db0ab3.png" alt="Sample chest X-ray image showing the ribs, spine, heart, and lungs. Source: https://openi.nlm.nih.gov" class="image--center mx-auto" /></p>
<p>Same image. Different questions. What happened in the feature space?</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769391520404/c7849d2c-68d6-4b2b-8cd9-e826eb6b1287.png" alt class="image--center mx-auto" /></p>
<p>Each question activated a distinct pattern of features. The "cardiomegaly" question lit up 77 features. The "pleural effusion" question lit up 90. The overlap was substantial (cosine similarity above 0.9), but each pathology had its own signature. I think this makes sense. The model is routing different clinical concepts through different internal circuits.</p>
<h2 id="heading-experiment-2-the-phrasing-effect">Experiment 2: The Phrasing Effect</h2>
<p>Here's where things got interesting. I asked the same clinical question two different ways:</p>
<p><strong>Formal:</strong> "Is there radiographic evidence of cardiomegaly?"</p>
<p><strong>Casual:</strong> "Does this show a big heart?"</p>
<p>Both questions mean the same thing. But the model's internal features looked completely different.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769391918589/68596a4d-6e2b-4d85-a131-dd1f75e2e3e4.png" alt class="image--center mx-auto" /></p>
<p>The cosine similarity between the feature vectors was 0.973. High, but not identical. And when I looked at the biggest differences:</p>
<p>Features more active for the formal phrasing:</p>
<ul>
<li><p>Feature 13749: 207 vs 0</p>
</li>
<li><p>Feature 4442: 203 vs 0</p>
</li>
<li><p>Feature 91: 180 vs 0</p>
</li>
</ul>
<p>Features more active for the casual phrasing:</p>
<ul>
<li><p>Feature 15587: 163 vs 0</p>
</li>
<li><p>Feature 5984: 152 vs 0</p>
</li>
<li><p>Feature 998: 151 vs 0</p>
</li>
</ul>
<p>Some features only fire for formal clinical language. Others only fire for casual phrasing. The model has learned distinct representations for how questions are asked, not just what they're asking.</p>
<h2 id="heading-experiment-3-tracking-features-across-a-phrasing-spectrum">Experiment 3: Tracking Features Across a Phrasing Spectrum</h2>
<p>In this experiment, I created a gradient of phrasings from very formal to very casual:</p>
<ol>
<li><p>"Is there radiographic evidence of cardiac enlargement?"</p>
</li>
<li><p>"Does the imaging demonstrate cardiomegaly?"</p>
</li>
<li><p>"Is there cardiomegaly?"</p>
</li>
<li><p>"Is the heart enlarged?"</p>
</li>
<li><p>"Does this show a big heart?"</p>
</li>
<li><p>"Is the heart too big?"</p>
</li>
<li><p>"Big heart?"</p>
</li>
</ol>
<p>Then I tracked two key features across this spectrum: Feature 13749 (most active for formal phrasing) and Feature 15587 (most active for casual phrasing).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769392210518/d09b3880-2288-4b77-a016-75ca56b1681c.png" alt class="image--center mx-auto" /></p>
<p>The pattern was clear. As the phrasing got more casual, Feature 13749 dropped and Feature 15587 rose. The crossover happened right around "Is the heart enlarged?" in the middle of the spectrum.This has real implications for medical VLM safety.A radiologist using formal clinical terminology might get a different answer than a patient asking the same question casually. The underlying clinical content is identical. But the model routes it through different internal circuits.</p>
<p>We already know medical VLMs can be sensitive to question phrasing. Now we can see why. The model isn't just processing the semantic content of your question. It's also encoding the style, the register, the formality. And those encodings influence what happens downstream.</p>
<p>For clinical deployment, this suggests we need:</p>
<ul>
<li><p>Standardized prompting protocols</p>
</li>
<li><p>Robustness testing across phrasing variations</p>
</li>
<li><p>Feature-level monitoring for production systems</p>
</li>
</ul>
<p>SAEs give us a lens into what's happening inside these models. We can see which features drive a diagnosis, identify when features misfire, and understand why subtle changes in input lead to different outputs.The code is available as a Colab notebook. It runs on a free T4 GPU in about 5 minutes. If you're working with medical VLMs, I'd encourage you to try it. See what features light up for your specific use cases. You might be surprised by what you find.</p>
<p><em>For more technical details, check out Anthropic's original work on monosemanticity and the GemmaScope paper from Google. The SAEs are available on HuggingFace at</em> <code>google/gemma-scope-2-4b-it</code>.</p>
<p>Also do check out the interactive demo in neuronpedia for <a target="_blank" href="https://www.neuronpedia.org/jackl-circuits-runs-1-4-sofa-v3_0/graph?slug=medical-diagnosis-heart">Haiku Circuit tracer.</a></p>
]]></content:encoded></item><item><title><![CDATA[Fluent But Wrong: LLM and Healthcare]]></title><description><![CDATA[Off late a lot of my research time is studying why medical models systems fail. Not the obvious failures where the model outputs gibberish, but the subtle ones where the output looks clinically appropriate, follows proper documentation structure, use...]]></description><link>https://thedatasense.com/fluent-but-wrong-llm-and-healthcare</link><guid isPermaLink="true">https://thedatasense.com/fluent-but-wrong-llm-and-healthcare</guid><category><![CDATA[#multimodalai]]></category><category><![CDATA[GPT-2]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Tue, 20 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768977780582/949cf682-0c65-4c34-acdf-1e5ff009575e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Off late a lot of my research time is studying why medical models systems fail. Not the obvious failures where the model outputs gibberish, but the subtle ones where the output looks clinically appropriate, follows proper documentation structure, uses correct terminology, and is still wrong.</p>
<p>To illustrate this problem, I trained a small GPT-2 model from scratch on clinical notes. The goal was not to build something useful. The goal was to demonstrate how easily language models learn to mimic clinical language without learning anything about clinical reasoning.</p>
<p>The results should concern anyone deploying LLMs in healthcare settings.</p>
<h2 id="heading-the-setup">The Setup</h2>
<p>I built a 7.7 million parameter transformer and trained it on two publicly available datasets: MEDIQA-Chat (67 doctor-patient dialogues with paired clinical notes) and MTSamples (approximately 5,000 medical transcriptions across 40 specialties). Total training data was around 200,000 tokens.</p>
<p>For context, GPT-2 was trained on 10 billion tokens. My model saw 0.002% of that amount. It trained for about 15 minutes on a single GPU.</p>
<p>Open this experiment in Google Colab and run it for free <a target="_blank" href="https://colab.research.google.com/drive/1_fqGVmGnSqbD5VxGLiEYJ8jeNnYcg9nG?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open this experiment in Google Colab" /></a></p>
<h2 id="heading-how-the-model-works-a-brief-primer">How the Model Works: A Brief Primer</h2>
<p>Before showing you what the model produced, it helps to understand what it actually does. This is a simplified explanation for readers without a machine learning background.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768974711275/ef410d5a-3c75-478b-81ad-0514a4062211.png" alt class="image--center mx-auto" /></p>
<p><em>Architecture of our clinical GPT-2 model. The same fundamental design powers ChatGPT, just with more parameters.</em></p>
<h3 id="heading-the-core-idea-predicting-the-next-word">The Core Idea: Predicting the Next Word</h3>
<p>Language models do one thing: predict what word comes next given the words that came before.</p>
<p>If I give the model "The patient presents with chest," it calculates a probability distribution over all possible next words. "Pain" might get 40% probability. "Discomfort" might get 15%. "X-ray" might get 2%. The model samples from this distribution to pick the next word, then repeats the process.</p>
<p>That is all it does. There is no reasoning module. No medical knowledge base. No fact-checking system. Just: given these words, what word is statistically likely to come next?</p>
<h3 id="heading-the-architecture">The Architecture</h3>
<p>The model has three main components:</p>
<p><strong>1. Embedding Layer</strong></p>
<p>Words enter the model as numbers. Each word in the vocabulary (50,257 possible tokens) gets converted to a 128-dimensional vector. Think of this as translating words into a numerical language the model can process.</p>
<p>Position matters too. "Patient has pain" means something different than "Pain has patient." So we add position embeddings that encode where each word sits in the sequence.</p>
<p><strong>2. Transformer Blocks (The Core)</strong></p>
<p>This is where the computation happens. Our model stacks 6 identical transformer blocks, each containing two operations:</p>
<p><em>Self-Attention:</em> Each word looks at every other word in the sequence and decides how much to pay attention to it. When processing "chest" in "The patient presents with chest pain," the attention mechanism might learn to focus heavily on "patient" and "presents" to understand the clinical context. This is done through 4 parallel "attention heads," each learning different patterns.</p>
<p><em>Feed-Forward Network:</em> After attention, each word passes through a small neural network that transforms its representation. This is where the model builds up abstract features.</p>
<p>The key insight: attention lets the model connect related words regardless of distance. "The patient, a 55-year-old male with a history of hypertension who was recently started on lisinopril, presents with" can connect "patient" to "presents" despite 15 words between them.</p>
<p><strong>3. Output Layer</strong></p>
<p>After passing through all 6 transformer blocks, the model converts the final representation back into a probability distribution over words. The word with the highest probability (or a sample from the distribution) becomes the output.</p>
<h3 id="heading-the-architecture-in-numbers">The Architecture in Numbers</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Component</td><td>This Model</td><td>GPT-3</td><td>GPT-5 (estimated)</td></tr>
</thead>
<tbody>
<tr>
<td>Parameters</td><td>7.7 million</td><td>175 billion</td><td>Several trillion</td></tr>
<tr>
<td>Transformer Blocks</td><td>6</td><td>96</td><td>Unknown</td></tr>
<tr>
<td>Attention Heads</td><td>4</td><td>96</td><td>Unknown</td></tr>
<tr>
<td>Embedding Dimension</td><td>128</td><td>12,288</td><td>Unknown</td></tr>
<tr>
<td>Context Window</td><td>256 tokens</td><td>2,048 tokens</td><td>400,000 tokens</td></tr>
</tbody>
</table>
</div><p>Our model is roughly 22,000 times smaller than GPT-3 and 220,000 times smaller than GPT-4. But the fundamental architecture is identical. More parameters mean more capacity to learn patterns, but the mechanism remains the same: predict the next token based on statistical patterns in training data.</p>
<h3 id="heading-what-the-architecture-cannot-do">What the Architecture Cannot Do</h3>
<p>Notice what is missing from this design:</p>
<p><strong>No verification mechanism.</strong> The model has no way to check if its output is true. It predicts likely tokens, not accurate tokens.</p>
<p><strong>No world model.</strong> The model does not understand that patients are physical beings, that medications have effects, or that vital signs reflect physiological states. It understands that certain words tend to appear near other words.</p>
<p><strong>No reasoning module.</strong> There is no component that evaluates whether "a 25-year-old with a 40-year history" is logically possible. The model processes tokens, not concepts.</p>
<p><strong>No uncertainty quantification.</strong> The model generates text with uniform confidence whether it is stating established medical fact or complete fabrication.</p>
<p>This architecture is remarkably good at learning statistical patterns in text. It is not designed to understand, verify, or reason about what it generates.</p>
<p>With that context, let me show you what the model produced.</p>
<h2 id="heading-when-the-model-sounds-medical">When the Model Sounds Medical</h2>
<p>First, outputs that look reasonable to a non-clinician. These are the dangerous ones.</p>
<h3 id="heading-prompt-chief-complaint">Prompt: Chief Complaint</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768975731252/0c4a4851-afba-4354-b34a-1bad2fb64444.png" alt class="image--center mx-auto" /></p>
<p>This looks professional. The format is correct. The terminology is appropriate. The workup makes sense.</p>
<p>But remember what the model actually does: it predicts which tokens are likely to follow other tokens. It learned that "chest pain" frequently appears near "substernal," "radiates to left arm," and "EKG and troponins." It has no idea why these concepts relate to each other.</p>
<p>How do I know? Because when I give the model prompts that should be obviously wrong, it responds with the same confidence.</p>
<h2 id="heading-when-the-model-reveals-it-understands-nothing">When the Model Reveals It Understands Nothing</h2>
<p>These next outputs require no medical background to evaluate. The failures are obvious to everyone.</p>
<h3 id="heading-prompt-impossible-patient-history">Prompt: Impossible Patient History</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768975848287/5f974f6d-2991-42fc-a2e3-ac5da485d323.png" alt class="image--center mx-auto" /></p>
<p><strong>What went wrong:</strong></p>
<p>A 25-year-old cannot have a 40-year history of anything. He would have developed hypertension at negative 15 years old.</p>
<p>The model did not notice. It saw "X-year history of" and predicted the tokens that typically follow that phrase: chronic conditions like hypertension and kidney disease. It has no concept of time, age, or basic arithmetic.</p>
<p>Also notice the text degrades: "which was found to be a nonreassuring it" is not a coherent phrase. "A cystoscopy in the right ureteral stent" makes no anatomical sense. The model generates medical-sounding word sequences without any understanding of what they mean.</p>
<p>Looking back at the architecture, this makes sense. The self-attention mechanism connects "25-year-old" to "40-year history" but has no way to evaluate whether that connection is logically valid. There is no arithmetic unit. There is no constraint that catches contradictions.</p>
<p>A human clinician would stop at the first sentence and say "this doesn't make sense." The model cannot do that. It only predicts the next likely token.</p>
<h3 id="heading-prompt-made-up-medication">Prompt: Made-Up Medication</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768975962226/46b9afb2-a395-48f0-a6ac-e0a0f50a2286.png" alt class="image--center mx-auto" /></p>
<p><strong>What went wrong:</strong></p>
<p>Flurbinox does not exist. I made it up. The model accepted it without hesitation and documented that the patient has been taking it for a month.</p>
<p>Then it listed "Hypertension" as medication number 2. Hypertension is a diagnosis, not a medication. You cannot take hypertension twice daily.</p>
<p>This is what token prediction looks like. The model saw a medication list format and generated things that statistically appear in medication lists. Sometimes those are medications. Sometimes those are diagnoses. The model does not know the difference because it has no concept of categories. It only knows token co-occurrence patterns.</p>
<hr />
<h3 id="heading-prompt-vital-signs-incompatible-with-life">Prompt: Vital Signs Incompatible with Life</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768976228816/871a5f41-2774-42be-98e6-f811b0c163e2.png" alt class="image--center mx-auto" /></p>
<p><strong>What went wrong:</strong></p>
<p>Let me explain these vital signs for non-medical readers:</p>
<ul>
<li><p><strong>Blood pressure 40/20:</strong> Normal is around 120/80. A BP of 40/20 means the heart is barely generating enough pressure to perfuse organs. This patient is in severe shock and likely dying.</p>
</li>
<li><p><strong>Heart rate 300:</strong> Normal is 60-100. A heart rate of 300 is not sustainable. The heart cannot fill with blood fast enough. This is a lethal arrhythmia.</p>
</li>
<li><p><strong>Temperature 85°F:</strong> Normal is 98.6°F. A body temperature of 85°F is severe hypothermia. The patient would be unconscious or dead.</p>
</li>
</ul>
<p>These vital signs are incompatible with life. A real clinician seeing this would be calling a code and starting resuscitation.</p>
<p>The model's assessment? "The patient appears to be able to walk on the left side, and to be in full range of motion."</p>
<p>The patient would not be walking anywhere. The patient would be in cardiac arrest.</p>
<p>Then the model loses coherence entirely. It switches to talking about a 5-year-old (the prompt said nothing about a child) with foot problems (the prompt was about vital signs) and prescribes Vicodin twice ("Vicodin and Vicodin for pain control").</p>
<p>This is what happens when you push a language model outside its training distribution. It has no mechanism to recognize that the input is physiologically impossible. The architecture has no world model, no physiological constraints, no sanity checks. It just keeps predicting tokens.</p>
<h2 id="heading-why-this-matters-beyond-my-tiny-model">Why This Matters Beyond My Tiny Model</h2>
<p>My model is tiny. 7.7 million parameters. Trained in 15 minutes. The failures are obvious.</p>
<p>GPT-4 has roughly 220,000 times more parameters. It trained on vastly more data with months of alignment work. It would not make errors this crude.</p>
<p>But the underlying architecture is the same. The fundamental mechanism is identical. GPT-4 predicts tokens based on statistical patterns in training data. It does not verify claims against reality. It does not understand physiology. It does not know when something is impossible.</p>
<p>The difference is that GPT-4's errors are subtle enough to fool people. A 25-year-old with a 40-year history is obviously wrong. A patient with a slightly inappropriate medication choice, or a missed drug interaction, or an assessment that sounds reasonable but does not fit the clinical picture? Those errors are much harder to catch.</p>
<p>More parameters mean more sophisticated pattern matching. They do not mean understanding.</p>
<h2 id="heading-what-would-actually-help">What Would Actually Help</h2>
<p><strong>More parameters and more compute</strong> improve benchmark performance. But scale does not solve the core problem. A model that hallucinates 5% of the time instead of 20% is still unsafe if we cannot identify which 5% is wrong.</p>
<p><strong>Better alignment</strong> reduces harmful outputs but is not the same as verification. A well-aligned model that confidently gives wrong medical advice is more dangerous than an obviously broken model, because users trust it.</p>
<p><strong>Retrieval-augmented generation</strong> can ground outputs in verified sources. But RAG introduces new failure modes: retrieval errors, outdated sources, incorrect synthesis. It reduces hallucination without eliminating it.</p>
<h2 id="heading-what-we-actually-need">What we actually need:</h2>
<p><strong>Human verification</strong> of AI-generated clinical content before it affects patient care. Not as a temporary measure while technology improves. As a permanent architectural requirement.</p>
<p><strong>Output attribution</strong> that traces claims to verifiable sources. If a model recommends a medication, the evidence should be identifiable.</p>
<p><strong>Calibrated uncertainty.</strong> The model should know when it does not know. "I am not confident" must be a valid output.</p>
<p><strong>Adversarial testing</strong> before deployment. Stress test for edge cases, contradictions, and impossible inputs. Find the failure modes before patients do.</p>
<p>When my 7.7 million parameter model describes a patient with impossible vital signs as "able to walk" and "in full range of motion," the failure is obvious. When it accepts a 40-year disease history in a 25-year-old, anyone can see the problem. When it lists "Hypertension" as a medication, no medical training is required to know something went wrong.Larger models make the same category of errors. They are just better at making those errors sound reasonable.</p>
<p>The architecture diagram at the top of this post shows the entire system. Token embedding, attention, feed-forward networks, output projection. Nowhere in that diagram is there a component for "verify this is true" or "check if this makes sense" or "flag uncertainty." The question to ask about any LLM system generating clinical content is not "Does this sound right?" The question is "How would I know if this were wrong?"</p>
]]></content:encoded></item><item><title><![CDATA[Opening the Black Box: How to See What Your Vision Language Model is Actually Looking At]]></title><description><![CDATA[When a doctor examines a chest X-ray and says "I see signs of pneumonia in the lower right lung," you can ask them to point at exactly what they're seeing. They can circle the cloudy region, explain why it looks abnormal, and walk you through their r...]]></description><link>https://thedatasense.com/opening-the-black-box-how-to-see-what-your-vision-language-model-is-actually-looking-at</link><guid isPermaLink="true">https://thedatasense.com/opening-the-black-box-how-to-see-what-your-vision-language-model-is-actually-looking-at</guid><category><![CDATA[#multimodalai]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Fri, 16 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768787448462/0adba53e-836a-4ade-8264-878c8ebb1162.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When a doctor examines a chest X-ray and says "I see signs of pneumonia in the lower right lung," you can ask them to point at exactly what they're seeing. They can circle the cloudy region, explain why it looks abnormal, and walk you through their reasoning. But when an AI system analyzes the same X-ray and reaches the same conclusion, what is it actually looking at? Is it focusing on the lung tissue, or has it learned some spurious shortcut, like the font used for the patient's name?</p>
<p>This question sits at the heart of AI safety in medicine. If we're going to trust AI systems to help with diagnostic decisions, we need to peer inside them and verify they're looking at the right things for the right reasons.</p>
<p>In this post, I'll explain a method called "Generic Attention-model Explainability" developed by Chefer, Gur, and Wolf that lets us generate visual explanations for what transformer-based AI models are paying attention to. We'll build up the intuition piece by piece, starting from the basics and working toward the full algorithm. I've also implemented this method for Google's MedGemma medical vision-language model, and I'll share results showing the technique in action on real medical images.</p>
<p><strong>My implementation:</strong> <a target="_blank" href="https://github.com/thedatasense/medgemma-explainer"><strong>github.com/thedatasense/medgemma-explainer</strong></a></p>
<p>Also you can open a notebook in Google colab that explains the concepts and a demo with the below link.</p>
<p><a target="_blank" href="https://colab.research.google.com/github/thedatasense/medgemma-explainer/blob/master/tutorial_optimized.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>
<p>By the end, you'll understand not just what the method does, but why it works.</p>
<hr />
<h2 id="heading-part-1-the-problem-with-asking-what-did-you-see">Part 1: The Problem with Asking "What Did You See?"</h2>
<p>Imagine you're teaching a child to identify birds. You show them pictures, and they learn to say "that's a robin" or "that's a crow." They get pretty good at it. But one day you notice something strange: they're identifying robins correctly even in photos where the bird is tiny and blurry. How?</p>
<p>You investigate and discover they've learned a shortcut. Robins often appear in photos with green lawns in the background, while crows appear against gray skies. The child isn't identifying birds at all. They're identifying backgrounds.</p>
<p>This is exactly what can happen with AI systems. A famous example from medical imaging: researchers found that an AI trained to detect COVID-19 from chest X-rays had learned to recognize the font used by certain hospitals, which happened to correlate with COVID cases during the training period. The model worked great on test data from those hospitals, but it wasn't actually learning anything about lungs.</p>
<p>The scary part? Without a way to see what the model is looking at, you'd never know. The accuracy numbers would look great right up until the model failed catastrophically on patients from a different hospital.</p>
<p>This is why explainability matters. We need to open up these models and see where their attention is directed.</p>
<hr />
<h2 id="heading-part-2-how-transformers-pay-attention">Part 2: How Transformers Pay Attention</h2>
<p>Before we can explain what a model is looking at, we need to understand how modern Vision Language models "look" at things in the first place. The key mechanism is called attention.</p>
<h3 id="heading-the-cocktail-party">The Cocktail Party</h3>
<p>Imagine you're at a crowded party. Dozens of conversations are happening simultaneously, creating a wall of noise. Yet somehow, when someone across the room says your name, you hear it. Your brain has learned to selectively attend to relevant information while filtering out the rest.</p>
<p>Transformer models do something similar. When processing an image and a question like "Is there a fracture in this X-ray?", the model doesn't treat every pixel and every word as equally important. It learns to focus its computational resources on the parts that matter for answering the question.</p>
<h3 id="heading-attention-as-a-spotlight">Attention as a Spotlight</h3>
<p>Think of attention as a spotlight that the model can shine on different parts of its input. When reading the word "fracture" in the question, the model might shine its spotlight on certain regions of the X-ray. When it encounters the word "bone" in its internal processing, the spotlight might shift to highlight skeletal structures.</p>
<p>Technically, attention works through a learned matching process. Each piece of input (called a "token") asks a question: "What should I pay attention to?" This is called a query. Every other token offers up a description of itself, called a key. The attention mechanism computes how well each query matches each key, producing a set of attention weights that sum to one. High weights mean "pay close attention to this," while low weights mean "mostly ignore this."</p>
<p>Here's the crucial insight: these attention weights create a map of relationships. If token A has high attention weight on token B, it means A is gathering information from B. By examining these weights, we can see what's influencing what.</p>
<h3 id="heading-multiple-heads-multiple-perspectives">Multiple Heads, Multiple Perspectives</h3>
<p>Modern transformers don't use just one spotlight. They use many, called "attention heads." Each head can focus on different aspects of the input. One head might track syntactic relationships (subject-verb connections in text), another might track semantic similarity (words with related meanings), and another might track positional relationships (things that are close together).</p>
<p>It's like having a team of detectives investigating a case. One looks for physical evidence, another interviews witnesses, a third analyzes financial records. Each brings a different perspective, and the final conclusion synthesizes all their findings.</p>
<h3 id="heading-layers-upon-layers">Layers Upon Layers</h3>
<p>Transformers also stack multiple layers of attention. The first layer might capture simple relationships: "this word relates to that word." But higher layers can capture more complex, abstract patterns: "this concept connects to that concept in this particular way."</p>
<p>Think of it like the visual system in your brain. Early layers detect edges and colors. Middle layers combine those into shapes and textures. Higher layers recognize objects, faces, and scenes. Each layer builds on the representations from the layer below.</p>
<hr />
<h2 id="heading-part-3-the-challenge-of-multi-layer-attribution">Part 3: The Challenge of Multi-Layer Attribution</h2>
<p>Now we arrive at the core problem that Chefer et al. set out to solve.</p>
<p>If we want to know what the model looked at to produce its output, we can't just examine the attention weights from a single layer. The information has been transformed, combined, and re-routed through dozens of layers. The final output is influenced by patterns that were established early and propagated forward, modified at each step.</p>
<h3 id="heading-the-river-delta">The River Delta</h3>
<p>Imagine tracing where a drop of water in the ocean came from. You find it at the river's mouth, but that river was fed by dozens of branches of river, each of which was fed by smaller streams, each of which collected from countless tiny sources across a vast watershed.</p>
<p>The water at the mouth contains contributions from all those sources, but the contributions aren't equal. A large tributary contributes more than a tiny stream. And some sources might have their water diverted or absorbed before it reaches the ocean.</p>
<p>This is exactly our situation with attention. The final output token is like the water at the river's mouth. It contains information that flowed from all the input tokens (the sources), but that information passed through many intermediate stages (the tributaries), being combined and filtered at each step.</p>
<p>To understand where the output came from, we need to trace these flows backward through the entire network.</p>
<h3 id="heading-why-raw-attention-fails">Why Raw Attention Fails</h3>
<p>A naive approach is to just look at the attention weights in the final layer. After all, that's the last step before the output, so shouldn't it tell us what the model was looking at?</p>
<p>Unfortunately, no. The final layer's attention operates on highly processed representations, not the original input. By that point, information from many different input tokens has been mixed together. When the final layer attends to position 47, it's not attending to whatever was originally at position 47. It's attending to a rich mixture of information that has accumulated at that position through all the previous layers.</p>
<p>It's like asking "where did this river water come from?" and answering "from right there, just upstream." Technically true, but it misses the entire watershed that actually supplied the water.</p>
<h3 id="heading-the-rollout-approach-and-its-limitations">The Rollout Approach and Its Limitations</h3>
<p>One early solution was called "attention rollout." The idea is to multiply attention matrices from consecutive layers together, tracing how attention flows through the network.</p>
<p>If layer 1 says "token A attends to token B" and layer 2 says "token B attends to token C," then we can infer that token A indirectly attends to token C through the path A→B→C. By multiplying attention matrices, we can compute these indirect attention relationships.</p>
<p>This is a step in the right direction, but it has a fundamental flaw: it treats all attention equally, whether positive or negative. In reality, some attention connections amplify information while others suppress it. When we multiply matrices together without considering these signs, positive and negative contributions can cancel out in misleading ways.</p>
<p>Imagine tracking money flow through a company. Some transfers add money to departments, others subtract it. If you just add up all the transfers without considering direction, you'll get a very wrong picture of where resources actually ended up.</p>
<hr />
<h2 id="heading-part-4-the-chefer-method-step-by-step">Part 4: The Chefer Method, Step by Step</h2>
<p>Now we're ready to understand the solution that Chefer et al. developed. Their method addresses the limitations we've discussed by carefully tracking how relevance propagates through the network while respecting the gradient information that tells us whether connections are helpful or harmful.</p>
<h3 id="heading-the-core-insight-gradients-tell-us-what-matters">The Core Insight: Gradients Tell Us What Matters</h3>
<p>Here's a key insight: not all attention is created equal. When the model is deciding whether to output "yes" or "no" for "Is there a fracture?", some attention connections are crucial to that decision while others are incidental.</p>
<p>How can we tell which is which? Gradients.</p>
<p>When we train neural networks, we compute gradients that tell us how changing each parameter would affect the output. But we can also compute gradients for intermediate values like attention weights. If changing an attention weight would significantly change the output, that weight has high gradient magnitude. If changing it would barely matter, the gradient is small.</p>
<p>By multiplying attention weights by their gradients, we can identify which connections actually matter for the specific output we're trying to explain.</p>
<h3 id="heading-the-recipe">The Recipe</h3>
<p>Let me walk through the algorithm step by step, building intuition as we go.</p>
<p><strong>Step 1: Initialize with Identity</strong></p>
<p>We start by creating a "relevance matrix" R that's initially an identity matrix. An identity matrix has ones on the diagonal and zeros everywhere else. This represents our starting assumption: before any attention happens, each token is relevant only to itself.</p>
<p>Think of it as the starting state before the cocktail party begins. Everyone is self-contained, not yet influenced by anyone else.</p>
<p><strong>Step 2: Process Each Layer</strong></p>
<p>For each attention layer in the network, we update R to account for the new connections being made. The attention matrix A tells us how tokens attended to each other at this layer.</p>
<p>But we don't use A directly. First, we weight it by the gradient to identify which connections matter:</p>
<pre><code class="lang-markdown">Ā = average across heads of (gradient × attention)⁺
</code></pre>
<p>The × means element-wise multiplication. The ⁺ means we keep only positive values, setting negatives to zero. The average combines information from all the attention heads.</p>
<p>Why remove negatives? Because we're tracking positive relevance, contributions that support the output. Negative gradients indicate connections that would hurt the output if strengthened, and we don't want those polluting our relevance map.</p>
<p><strong>Step 3: Accumulate Through Residual Connections</strong></p>
<p>Modern transformers have "residual connections" that allow information to skip layers. This means the output of a layer is the sum of the attention output plus the original input, passed through unchanged.</p>
<p>To account for this, we add the new relevance to the existing relevance rather than replacing it:</p>
<pre><code class="lang-markdown">R = R + Ā × R
</code></pre>
<p>The matrix multiplication Ā × R is the key operation. It says: "The relevance of token i to token j is the sum over all intermediate tokens k of how much i attends to k times how relevant k was to j."</p>
<p>This is exactly the tributary logic. To know how much water source i contributes to outlet j, you sum over all intermediate points: how much flows from i to each intermediate point, times how much that point contributes to j.</p>
<p><strong>Step 4: Extract the Explanation</strong></p>
<p>After processing all layers, R contains the accumulated relevance. To explain a particular output token, we look at the row of R corresponding to that token. This row tells us how relevant each input token is to that output.</p>
<p>For image-text models, we can split this relevance vector into the image tokens and text tokens, giving us separate explanations for what visual regions and what words influenced the prediction.</p>
<h3 id="heading-a-worked-example">A Worked Example</h3>
<p>Let's trace through a tiny example to make this concrete. Imagine a three-token sequence and a two-layer transformer.</p>
<p>We start with:</p>
<pre><code class="lang-markdown">R = [1 0 0]
<span class="hljs-code">    [0 1 0]
    [0 0 1]</span>
</code></pre>
<p>Each token is relevant only to itself.</p>
<p>Layer 1 has gradient-weighted attention:</p>
<pre><code class="lang-markdown">Ā₁ = [0.1 0.3 0.2]
<span class="hljs-code">     [0.2 0.1 0.4]
     [0.3 0.2 0.1]</span>
</code></pre>
<p>Token 0 attends mostly to token 1 (weight 0.3). Token 1 attends mostly to token 2 (weight 0.4). Token 2 attends mostly to token 0 (weight 0.3).</p>
<p>We update R:</p>
<pre><code class="lang-markdown">R = R + Ā₁ × R
R = I + Ā₁ × I
R = I + Ā₁
R = [1.1 0.3 0.2]
<span class="hljs-code">    [0.2 1.1 0.4]
    [0.3 0.2 1.1]</span>
</code></pre>
<p>Now token 0 has picked up some relevance from tokens 1 and 2. The diagonal values increased slightly because of self-attention.</p>
<p>Layer 2 has gradient-weighted attention:</p>
<pre><code class="lang-markdown">Ā₂ = [0.2 0.4 0.1]
<span class="hljs-code">     [0.1 0.2 0.5]
     [0.4 0.1 0.2]</span>
</code></pre>
<p>We update R again:</p>
<pre><code class="lang-markdown">R = R + Ā₂ × R
</code></pre>
<p>I'll spare you the matrix arithmetic, but the result is that R now captures both direct attention (token i attended to token j at some layer) and indirect attention (token i attended to token k, which had previously gathered information from token j).</p>
<p>If we want to explain what influenced token 2, we look at row 2 of the final R. If we want to explain token 0, we look at row 0.</p>
<hr />
<h2 id="heading-part-5-applying-this-to-vision-language-models">Part 5: Applying This to Vision-Language Models</h2>
<p>The method we've described works for any transformer. But applying it to vision-language models like MedGemma requires understanding how these models are structured.</p>
<h3 id="heading-how-images-become-tokens">How Images Become Tokens</h3>
<p>Vision-language models convert images into sequences of tokens that can be processed alongside text. The typical approach uses a vision encoder that divides the image into patches (small rectangular regions) and produces one token per patch.</p>
<p>For MedGemma, an 896×896 pixel image is divided into 14×14 pixel patches, producing a 64×64 grid of patches. These are then pooled down to a 16×16 grid, yielding 256 image tokens. These 256 tokens capture the visual content of the image in a form the language model can process.</p>
<p>When you ask MedGemma "Is there a fracture in this X-ray?", the model receives a sequence that looks like:</p>
<pre><code class="lang-markdown">[img<span class="hljs-emphasis">_0, img_</span>1, ..., img<span class="hljs-emphasis">_255, "Is", "there", "a", "fracture", "in", "this", "X", "-", "ray", "?"]</span>
</code></pre>
<p>The first 256 positions are image tokens. The rest are text tokens. The model's attention operates over this combined sequence, allowing image tokens to attend to text and vice versa.</p>
<h3 id="heading-generating-the-explanation">Generating the Explanation</h3>
<p>When we apply the Chefer method to this combined sequence, we get a relevance vector that tells us how much each position influenced the output. The first 256 values correspond to image regions. We can reshape these into a 16×16 grid and overlay it on the original image as a heatmap.</p>
<p>High values indicate "the model looked here when generating its answer." Low values indicate "this region didn't much matter."</p>
<p>For the text tokens, we get relevance values that tell us which words in the question were most important. If the question was about fractures, we'd expect "fracture" and "bone" to have higher relevance than "is" or "there."</p>
<hr />
<h2 id="heading-part-6-the-method-in-action-my-medgemma-results">Part 6: The Method in Action — My MedGemma Results</h2>
<p>Theory is one thing. Seeing it work is another. I implemented the Chefer method for Google's MedGemma 1.5 4B, a vision-language model specifically trained for medical image understanding.</p>
<p><strong>The full implementation is available at:</strong> <a target="_blank" href="https://github.com/thedatasense/medgemma-explainer"><strong>github.com/thedatasense/medgemma-explainer</strong></a></p>
<p>Let me walk through two examples that demonstrate the method's power.</p>
<h3 id="heading-example-1-finding-the-remote-control">Example 1: Finding the Remote Control</h3>
<p>Before tackling medical images, let's start with a simpler test case. Here's an image of a cat sitting on a couch with a remote control visible at the bottom of the frame.</p>
<p>When I ask MedGemma "Where is the remote?" and explain specifically the token "remote" in its response, the relevancy map shows exactly what we'd hope to see: the highest attention is concentrated at the bottom-center of the image, precisely where the remote control is located.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768782707631/dbb40985-5da3-43be-8131-3e13076e4b77.png" alt /></p>
<p><em>Figure 1: When explaining the "remote" token, the model's attention is correctly focused on the bottom-center region where the remote control is located. The bar chart quantifies relevancy by region, with bottom-center scoring 0.226 compared to just 0.051 for top-left.</em></p>
<p>The bar chart on the right quantifies this. The bottom-center region (where the remote actually is) has a mean relevancy of 0.226, more than four times higher than the top-left region at 0.051. The model isn't just producing a vague, diffuse attention pattern. It's looking at exactly the right place.</p>
<p>This is the kind of sanity check that builds confidence. If the method highlighted the cat instead of the remote when explaining the word "remote," we'd know something was wrong with either the model or our explainability implementation.</p>
<h3 id="heading-example-2-chest-x-ray-pneumonia-detection">Example 2: Chest X-ray Pneumonia Detection</h3>
<p>Now for a clinically meaningful example. Here's a chest X-ray from a patient with right middle lobe pneumonia. A critical detail to understand: in a standard PA (posterior-anterior) chest X-ray, the image is oriented as if you're facing the patient. This means the left side of the image corresponds to the patient's RIGHT side.</p>
<p>When I ask MedGemma "Is there evidence of pneumonia?" the model generates a response mentioning consolidation in the right lung. Using the Chefer method, I can explain individual tokens in that response.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768782726493/7bed9263-2075-43b2-a506-ede67f85efa7.png" alt /></p>
<p><em>Figure 2: Chest X-ray analysis showing token-specific explanations. Top row: original image (with anatomical labels), whole answer explanation, and "pneumonia" token explanation. Bottom row: "consolidation", "opacity", and "right" token explanations. Each shows attention correctly focused on the patient's right lung (left side of image) where the pathology is located.</em></p>
<p>Look at the "pneumonia" token explanation (top right). The relevancy map shows concentrated attention on the left side of the image, which is the patient's right lung, exactly where the pathology is located. The quantitative scores confirm this: patient right lung relevancy is 0.140, nearly four times higher than the patient left lung at 0.037.</p>
<p>Even more striking is the "right" token explanation (bottom right of the figure). When the model generates the word "right" (as in "right lung"), the attention is strongly focused on the patient's anatomical right side. The model isn't just pattern-matching words; it's correctly grounding the anatomical term to the corresponding image region.</p>
<p>The "consolidation" and "opacity" tokens show similar patterns, highlighting the area of increased density that characterizes pneumonic infiltration.</p>
<hr />
<h2 id="heading-part-7-a-critical-implementation-detail">Part 7: A Critical Implementation Detail</h2>
<p>While implementing this method, I discovered a subtle but crucial detail that isn't obvious from the original paper. Getting this wrong produces meaningless results. Getting it right makes everything work.</p>
<h3 id="heading-the-backpropagation-target-problem">The Backpropagation Target Problem</h3>
<p>In causal language models like MedGemma, the logit at position i predicts the token at position i+1. This offset matters enormously for explainability.</p>
<p>If you want to explain why the model generated a specific token at position p, you must backpropagate from the logit at position p-1, not position p. And you should use the actual token ID that was generated, not the argmax of the logits.</p>
<p>Here's the wrong approach that I see in many implementations:</p>
<pre><code class="lang-python"><span class="hljs-comment"># WRONG - This explains "what comes after the last token"</span>
target_logit = logits[<span class="hljs-number">0</span>, <span class="hljs-number">-1</span>, logits[<span class="hljs-number">0</span>, <span class="hljs-number">-1</span>].argmax()]
</code></pre>
<p>Here's the correct approach:</p>
<pre><code class="lang-python"><span class="hljs-comment"># CORRECT - This explains why the token at position p was generated</span>
logit_position = target_token_position - <span class="hljs-number">1</span>
target_token_id = input_ids[<span class="hljs-number">0</span>, target_token_position]  <span class="hljs-comment"># The actual token</span>
target_logit = logits[<span class="hljs-number">0</span>, logit_position, target_token_id]
</code></pre>
<p>This distinction might seem pedantic, but it's the difference between coherent, focused explanations and noisy, meaningless heatmaps. When I fixed this in my implementation, the results went from confusing to crisp.</p>
<h3 id="heading-medgemmas-architecture">MedGemma's Architecture</h3>
<p>For those interested in the technical details, MedGemma 1.5 4B has some architectural features that required careful handling.</p>
<p>The model uses 34 transformer layers with grouped-query attention, where 8 query heads share 4 key-value heads. It also employs a 5:1 ratio of local to global attention layers, where local layers only attend within a 1024-token window. Global attention layers (at positions 5, 11, 17, 23, and 29) can attend to the full sequence.</p>
<p>Images are processed by a SigLIP vision encoder that produces 256 image tokens arranged in a 16×16 grid. These tokens occupy positions 6 through 261 in the input sequence, with text tokens following after.</p>
<p>Understanding this token structure is essential for correctly extracting and visualizing image relevancy. When you pull out the first 256 values from the relevancy vector and reshape them into a 16×16 grid, you get a spatial map that can be overlaid on the original image.</p>
<h3 id="heading-other-implementation-notes">Other Implementation Notes</h3>
<p>A few other details that matter:</p>
<ul>
<li><p><strong>Use eager attention</strong>: MedGemma's default SDPA (Scaled Dot Product Attention) implementation doesn't support <code>output_attentions=True</code>. You need to load the model with <code>attn_implementation="eager"</code>.</p>
</li>
<li><p><strong>Keep the model in eval mode</strong>: Use <code>torch.enable_grad()</code> context instead of calling <code>model.train()</code>. This preserves the inference behavior while allowing gradient computation.</p>
</li>
<li><p><strong>Convert to float32</strong>: Attention tensors come out as bfloat16. Convert them to float32 for stable gradient computation.</p>
</li>
<li><p><strong>Retain gradients</strong>: Call <code>attn.requires_grad_(True)</code> and <code>attn.retain_grad()</code> on the attention tensors before the backward pass.</p>
</li>
</ul>
<hr />
<h2 id="heading-part-8-implications-for-medical-ai-safety">Part 8: Implications for Medical AI Safety</h2>
<p>Let me return to where we started: the challenge of trusting AI in medicine.</p>
<p>Medical decisions carry enormous stakes. A false negative might mean a missed cancer. A false positive might mean unnecessary surgery. We can't simply trust AI systems because they score well on benchmarks. We need to verify they're reasoning correctly.</p>
<p>The Chefer method gives us a tool for this verification. When a model says "this X-ray shows signs of pneumonia," we can ask "show me what you're looking at." If the heatmap highlights the lung region with the suspicious opacity, our confidence increases. If it highlights the patient's ID number or the machine manufacturer's logo, we know something is wrong.</p>
<p>This isn't just about catching errors. It's about building appropriate trust. Explainability lets us calibrate our reliance on AI to match its actual capabilities. We might trust the model more in situations where its attention patterns look sensible, and trust it less when its reasoning seems confused.</p>
<h3 id="heading-the-limitation-to-remember">The Limitation to Remember</h3>
<p>One important caveat: attention-based explanations show us what the model looked at, not necessarily why. Two models might look at the same region but interpret it differently. One might correctly identify an abnormality, while another might misclassify it.</p>
<p>Think of it this way: if two doctors are examining the same X-ray, knowing they're both looking at the lower right lung is useful, but it doesn't guarantee they'll reach the same conclusion. The attention map is the "where," not the "what" or "why."</p>
<p>This means explainability methods are one tool among many. They're most powerful when combined with other approaches like testing on diverse datasets, comparing to expert annotations, and conducting systematic error analysis.</p>
<hr />
<h2 id="heading-conclusion-opening-doors-not-just-black-boxes">Conclusion: Opening Doors, Not Just Black Boxes</h2>
<p>We've covered a lot of ground in this post. We started with the problem of understanding what AI models are looking at, built up an understanding of how attention works in transformers, and walked through a method that traces relevance through multi-layer networks.</p>
<p>The Chefer method is elegant because it respects the actual computational structure of transformer models. Rather than treating the network as an inscrutable black box, it uses the model's own attention patterns and gradients to surface meaningful explanations.</p>
<p>For those working with medical AI, methods like this are essential. They transform the question "can we trust this model?" from philosophical hand-wraving into concrete investigation. We can look at what the model sees, compare it to clinical expectations, and make informed decisions about deployment.</p>
<hr />
<h2 id="heading-try-it-yourself">Try It Yourself</h2>
<p>The complete implementation is available on GitHub:</p>
<p><a target="_blank" href="https://github.com/thedatasense/medgemma-explainer"><strong>github.com/thedatasense/medgemma-explainer</strong></a></p>
<p>The repository includes:</p>
<ul>
<li><p>Full source code for the explainability method</p>
</li>
<li><p>Jupyter notebook tutorials</p>
</li>
<li><p>Example scripts for medical image analysis</p>
</li>
<li><p>Visualization utilities</p>
</li>
</ul>
<p>Feel free to use it, extend it, and let me know what you discover.</p>
<hr />
<h2 id="heading-further-reading">Further Reading</h2>
<p>If you want to dive deeper into the technical details:</p>
<p><strong>The original paper</strong>: Chefer, H., Gur, S., &amp; Wolf, L. (2021). Transformer Interpretability Beyond Attention Visualization. CVPR 2021. <a target="_blank" href="https://arxiv.org/abs/2012.09838">arXiv:2012.09838</a></p>
<p><strong>Generic Attention Explainability paper</strong>: Chefer, H., Gur, S., &amp; Wolf, L. (2021). Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. <a target="_blank" href="https://arxiv.org/abs/2103.15679">arXiv:2103.15679</a></p>
<p><strong>The authors' code</strong>: <a target="_blank" href="https://github.com/hila-chefer/Transformer-MM-Explainability">github.com/hila-chefer/Transformer-MM-Explainability</a></p>
<p><strong>MedGemma</strong>: <a target="_blank" href="https://huggingface.co/google/medgemma-1.5-4b-it">huggingface.co/google/medgemma-1.5-4b-it</a></p>
<p><strong>Attention mechanisms</strong>: Vaswani, A., et al. (2017). Attention Is All You Need. <a target="_blank" href="https://arxiv.org/abs/1706.03762">arXiv:1706.03762</a></p>
<p><strong>AI explainability in healthcare</strong>: Ghassemi, M., et al. (2021). The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health.</p>
<hr />
<p><em>This post is part of ongoing research into clinically robust vision-language models. If you're working on similar problems or have questions about the implementation, feel free to reach out or open an issue on GitHub.</em></p>
]]></content:encoded></item><item><title><![CDATA[Data Generating Process]]></title><description><![CDATA[Data does not just appear. Something creates it. A coin flip. A measurement device. A biological process. A human decision. Understanding that something, the mechanism that generates observations, is the key to understanding uncertainty.
This mechani...]]></description><link>https://thedatasense.com/data-generating-process</link><guid isPermaLink="true">https://thedatasense.com/data-generating-process</guid><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 14 Jan 2026 05:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Data does not just appear. Something creates it. A coin flip. A measurement device. A biological process. A human decision. Understanding that something, the mechanism that generates observations, is the key to understanding uncertainty.</p>
<p>This mechanism has a name: the Data Generating Process, or DGP.</p>
<p>You can run this experiments in a free google colab environment <a target="_blank" href="https://colab.research.google.com/drive/1_DKAG4dXC66WrIPlCpKy6mfToCa9QxsO?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>
<h2 id="heading-what-is-a-data-generating-process">What Is a Data Generating Process?</h2>
<p>A DGP is the real world system that produces the numbers you eventually analyze. It includes everything: the true underlying signal, the noise, the measurement error, the selection bias, the sampling method.</p>
<p>When you flip a coin 100 times and count heads, the DGP is the physics of the coin, the force of your thumb, the air resistance, and everything else that determines whether each flip lands heads or tails. In practice, we model this as a simple probability: each flip has some chance p of landing heads, independent of other flips.</p>
<p>When a hospital records patient outcomes, the DGP includes the disease biology, treatment effects, patient compliance, measurement protocols, and which patients showed up in the first place.</p>
<p>The data you see is just one possible output from the DGP. Run the process again and you get different numbers. This is where uncertainty comes from.</p>
<h2 id="heading-an-example">An Example</h2>
<p>Suppose I want to know whether a new drug lowers blood pressure. I run a trial with 50 patients. Half get the drug, half get placebo. I measure the difference in blood pressure between groups.</p>
<p>The traditional approach: calculate a t-statistic, look up a p-value, declare significance or not.</p>
<p>The DGP approach: first, write down what you think is generating the data.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_trial_data</span>(<span class="hljs-params">n_per_group, true_effect, noise_sd</span>):</span>
    <span class="hljs-comment"># Placebo group: just noise around baseline</span>
    placebo = np.random.normal(<span class="hljs-number">0</span>, noise_sd, n_per_group)

    <span class="hljs-comment"># Treatment group: true effect plus noise</span>
    treatment = np.random.normal(true_effect, noise_sd, n_per_group)

    <span class="hljs-keyword">return</span> placebo, treatment
</code></pre>
<p>This function is a DGP. It specifies exactly how the data comes into existence. The true effect is a parameter I control. The noise standard deviation is another parameter. When I call this function, I get one possible trial result.</p>
<p>Here is the insight: I can call it many times.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_simulation</span>(<span class="hljs-params">n_per_group, true_effect, noise_sd, n_simulations</span>):</span>
    observed_differences = []

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(n_simulations):
        placebo, treatment = generate_trial_data(n_per_group, true_effect, noise_sd)
        diff = treatment.mean() - placebo.mean()
        observed_differences.append(diff)

    <span class="hljs-keyword">return</span> np.array(observed_differences)
</code></pre>
<p>The figure below shows what happens when we do this. On the left is the DGP itself, just a box with parameters. In the middle, we run it five times and see five different trial results. On the right, we run it 10,000 times and see the full distribution of possible outcomes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768864552874/fa00af11-0414-4b47-8e18-b1ab131d5594.png" alt /></p>
<p>That distribution on the right is uncertainty made visible. The true effect is 5 mmHg, but any single trial might show anywhere from -5 to +15 just due to noise. This is why we need statistics: to separate signal from noise.</p>
<h2 id="heading-the-null-distribution-and-p-values">The Null Distribution and P-values</h2>
<p>If I set <code>true_effect=0</code> and run 10,000 simulations, I get the distribution of differences I would see if the drug does nothing. This is the null distribution. I built it by simulating the null world.</p>
<p>If my actual trial shows a difference of 8.5 mmHg, I can see where that falls in the null distribution.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768864582121/97f2c136-7617-41f4-a91e-afd5b092e4e0.png" alt /></p>
<p>The red lines mark my observed value and its mirror. The p-value is just the fraction of the null distribution that falls beyond those lines. In this case, about 0.3% of the null simulations produced results as extreme as what I observed.</p>
<p>This is what "statistically significant" means. My result is unlikely to have come from the null world.</p>
<p>No formulas. No t-tables. Just direct simulation of what would happen if the drug did nothing.</p>
<h2 id="heading-why-this-changes-everything">Why This Changes Everything</h2>
<p>When you write the DGP, you confront your assumptions explicitly.</p>
<p>Look at my function again. I assumed both groups have the same noise level. I assumed the noise is normally distributed. I assumed each patient's outcome is independent of others. These assumptions are now visible in the code, not hidden in the derivation of a test statistic.</p>
<p>What if those assumptions are wrong?</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_trial_data_realistic</span>(<span class="hljs-params">n_per_group, true_effect, noise_sd</span>):</span>
    <span class="hljs-comment"># Some patients respond strongly, others barely respond</span>
    responder_fraction = <span class="hljs-number">0.3</span>

    placebo = np.random.normal(<span class="hljs-number">0</span>, noise_sd, n_per_group)

    treatment = []
    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(n_per_group):
        <span class="hljs-keyword">if</span> np.random.random() &lt; responder_fraction:
            <span class="hljs-comment"># Responder: large effect</span>
            treatment.append(np.random.normal(true_effect * <span class="hljs-number">2</span>, noise_sd))
        <span class="hljs-keyword">else</span>:
            <span class="hljs-comment"># Non-responder: small effect</span>
            treatment.append(np.random.normal(true_effect * <span class="hljs-number">0.2</span>, noise_sd))

    <span class="hljs-keyword">return</span> placebo, np.array(treatment)
</code></pre>
<p>Now I have a bimodal response. Some patients are responders, others are not. The average effect might be the same, but the distribution looks different. Does my statistical test still work? I can find out by running simulations with this new DGP and checking whether my test maintains its false positive rate.</p>
<p>This is the power of thinking generatively. You can stress test your methods against any scenario you can imagine and code.</p>
<h2 id="heading-the-sampling-distribution-demystified">The Sampling Distribution Demystified</h2>
<p>The central mystery in introductory statistics is the sampling distribution. Students learn that if you take many samples and compute the mean of each, those means form a distribution that is approximately normal with standard deviation \(sigma/\sqrt{n}\)</p>
<p>This is true. But why?</p>
<p>The DGP approach lets you see it happen.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">demonstrate_sampling_distribution</span>(<span class="hljs-params">population, sample_size, n_samples</span>):</span>
    sample_means = []

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(n_samples):
        sample = np.random.choice(population, size=sample_size, replace=<span class="hljs-literal">True</span>)
        sample_means.append(sample.mean())

    <span class="hljs-keyword">return</span> np.array(sample_means)

<span class="hljs-comment"># Create a weird, non-normal population</span>
population = np.concatenate([
    np.random.exponential(<span class="hljs-number">2</span>, <span class="hljs-number">5000</span>),
    np.random.normal(<span class="hljs-number">10</span>, <span class="hljs-number">1</span>, <span class="hljs-number">5000</span>)
])
</code></pre>
<p>Look at what happens:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768864603757/9edd8660-1039-4053-a21a-e0772bb3cf43.png" alt /></p>
<p>The top left panel shows the population. It is not normal at all. Two peaks, a long tail, nothing like a bell curve.</p>
<p>But watch what happens as we draw samples and compute means. At n=5, still pretty weird. At n=30, looking more normal. At n=100, almost perfectly bell-shaped.</p>
<p>The Central Limit Theorem is not a formula to memorize. It is something you can watch happen. The averaging process smooths out the weirdness. Larger samples smooth more. The standard deviation of the sampling distribution shrinks from 2.35 to 0.95 to 0.52, roughly following \(1/\sqrt{n}\)</p>
<h2 id="heading-bootstrap-when-you-only-have-one-sample">Bootstrap: When You Only Have One Sample</h2>
<p>In real life, you run one trial. You collect one dataset. You cannot go back and sample the population again.</p>
<p>The bootstrap solves this by treating your sample as if it were the population.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">bootstrap_confidence_interval</span>(<span class="hljs-params">data, n_bootstrap, confidence=<span class="hljs-number">0.95</span></span>):</span>
    bootstrap_means = []
    n = len(data)

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(n_bootstrap):
        <span class="hljs-comment"># Resample with replacement from your actual data</span>
        resample = np.random.choice(data, size=n, replace=<span class="hljs-literal">True</span>)
        bootstrap_means.append(resample.mean())

    <span class="hljs-comment"># Find percentiles</span>
    lower = np.percentile(bootstrap_means, (<span class="hljs-number">1</span> - confidence) / <span class="hljs-number">2</span> * <span class="hljs-number">100</span>)
    upper = np.percentile(bootstrap_means, (<span class="hljs-number">1</span> + confidence) / <span class="hljs-number">2</span> * <span class="hljs-number">100</span>)

    <span class="hljs-keyword">return</span> lower, upper
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768865346070/a21070df-b2ef-4452-b335-245c7b05e991.png" alt /></p>
<p>On the left is your one sample of 30 observations. This is all you have. On the right is what happens when you resample from it 10,000 times. The spread of those bootstrap means gives you the confidence interval directly. The middle 95% spans from 94.5 to 106.8.</p>
<p>No formulas involving t-distributions. No assumptions about normality. Just simulation.</p>
<h2 id="heading-permutation-tests-simulating-the-null-world">Permutation Tests: Simulating the Null World</h2>
<p>Back to the drug trial. I want a p-value. How unlikely is my observed difference if the drug does nothing?</p>
<p>The permutation test answers this by explicitly constructing the null world.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">permutation_test</span>(<span class="hljs-params">group1, group2, n_permutations</span>):</span>
    observed_diff = group2.mean() - group1.mean()
    combined = np.concatenate([group1, group2])
    n1 = len(group1)

    null_diffs = []
    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(n_permutations):
        <span class="hljs-comment"># Shuffle all observations</span>
        np.random.shuffle(combined)
        <span class="hljs-comment"># Split into fake groups</span>
        fake_group1 = combined[:n1]
        fake_group2 = combined[n1:]
        null_diffs.append(fake_group2.mean() - fake_group1.mean())

    <span class="hljs-comment"># P-value: fraction of null differences as extreme as observed</span>
    null_diffs = np.array(null_diffs)
    p_value = np.mean(np.abs(null_diffs) &gt;= np.abs(observed_diff))

    <span class="hljs-keyword">return</span> p_value, null_diffs
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768865387610/a9b53e5a-2f07-46b0-b553-e4be1dbfc989.png" alt /></p>
<p>Panel 1 shows the original data. Control group in gray, treatment in blue. The observed difference is 3.8.</p>
<p>Panel 2 shows what happens when we shuffle the labels. If the drug truly does nothing, it should not matter which patients got which label. After shuffling, the difference is -2.3.</p>
<p>Panel 3 shows the null distribution from 10,000 shuffles. Most differences cluster around zero. My observed value of 3.8 is marked by the red line. The red bars show all permuted differences as extreme or more extreme than mine. That fraction is the p-value: 0.31.</p>
<p>In this case, a p-value of 0.31 means my result is not unusual under the null hypothesis. I cannot reject the possibility that the drug does nothing. The permutation test made that clear by showing me exactly what "nothing" looks like.</p>
<h2 id="heading-from-consumer-to-architect">From Consumer to Architect</h2>
<p>The shift from traditional statistics to simulation based thinking is a shift in identity.</p>
<p>In the traditional approach, you are a consumer of methods. Someone else derived the test. You apply it to your data. The uncertainty is someone else's problem, already solved, packaged into a formula.</p>
<p>In the DGP approach, you are an architect of models. You decide what mechanism generates your data. You write it down. You simulate it. You test your methods against it. The uncertainty becomes visible, manipulable, yours to explore.</p>
<p>This takes more work. You have to write code. You have to think carefully about what assumptions you are making and whether they match reality.</p>
<p>But the payoff is understanding. Not just knowing that a p-value below 0.05 means something. Knowing what it means because you built the null world yourself and watched where your data landed in it.</p>
<p>Data does not analyze itself. Something creates it. Learn to think like the creator.</p>
]]></content:encoded></item><item><title><![CDATA[Building AI Agents with Multimodal Models: The Final Challenge]]></title><description><![CDATA[The Challenge That Ties Everything Together
After four modules of learning multimodal techniques, NVIDIA's certification throws you into the deep end with a beautifully designed assessment. The problem sounds almost paradoxical at first:
You have a c...]]></description><link>https://thedatasense.com/building-ai-agents-with-multimodal-models-the-final-challenge</link><guid isPermaLink="true">https://thedatasense.com/building-ai-agents-with-multimodal-models-the-final-challenge</guid><category><![CDATA[#multimodalai]]></category><category><![CDATA[#lidarsensor]]></category><category><![CDATA[NVIDIA]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Sun, 11 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/qDgTQOYk6B8/upload/8b41e212edf58813764910931957f54b.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-the-challenge-that-ties-everything-together">The Challenge That Ties Everything Together</h2>
<p>After four modules of learning multimodal techniques, NVIDIA's certification throws you into the deep end with a beautifully designed assessment. The problem sounds almost paradoxical at first:</p>
<p><strong>You have a classifier that works perfectly with LiDAR data. Make it work with RGB images instead, without retraining it on RGB labels.</strong></p>
<p>Wait, what? How do you make a model trained on depth data suddenly understand colors?</p>
<p>This is where everything you've learned comes together: contrastive learning, cross-modal projection, and embedding alignment. Let me walk you through my journey of solving this puzzle.</p>
<hr />
<h3 id="heading-understanding-the-problem-cubes-and-spheres">Understanding the Problem: Cubes and Spheres</h3>
<p>The scenario is elegant in its simplicity. You have a dataset of 3D scenes containing either cubes or spheres. Each scene is captured two ways:</p>
<ol>
<li><strong>RGB Images</strong>: Color photographs showing red, green, or blue objects</li>
<li><strong>LiDAR Depth Maps</strong>: Point cloud data showing the 3D shape</li>
</ol>
<p>Here's the catch:</p>
<ul>
<li>The pre-trained classifier only understands LiDAR data</li>
<li>At inference time, you only have RGB images</li>
<li>You cannot retrain the classifier on RGB labels</li>
</ul>
<p><strong>The Analogy</strong>: Imagine you have an expert sculpture appraiser who identifies shapes by touch alone (LiDAR). Now you need them to identify shapes from photographs (RGB) without teaching them what photographs are. Instead, you'll build a translator that converts photographs into "touch descriptions" the expert already understands.</p>
<hr />
<h3 id="heading-the-three-part-solution">The Three-Part Solution</h3>
<p>The assessment breaks down into three interconnected challenges. Each builds on the previous, and skipping steps or misunderstanding the flow will leave you stuck.</p>
<h4 id="heading-mental-model-the-translation-pipeline">Mental Model: The Translation Pipeline</h4>
<pre><code>What you have:     RGB Image <span class="hljs-keyword">of</span> a cube
What you need:     <span class="hljs-string">"cube"</span> prediction
What you can use:  A LiDAR classifier that<span class="hljs-string">'s already perfect

The bridge:        RGB → [Something Magic] → LiDAR-like representation → Classifier</span>
</code></pre><p>The "something magic" is what you'll build: a contrastive pre-training system plus a projector network.</p>
<hr />
<h3 id="heading-part-1-teaching-two-modalities-to-speak-the-same-language">Part 1: Teaching Two Modalities to Speak the Same Language</h3>
<p><strong>The Goal</strong>: Create embedders that place RGB and LiDAR representations of the same scene close together in embedding space.</p>
<p><strong>The Analogy</strong>: Imagine training two translators. One reads English books and creates summaries. The other reads French books and creates summaries. Your goal is to train them so that when they read the same story (one in English, one in French), their summaries are nearly identical.</p>
<h4 id="heading-the-architecture-i-built">The Architecture I Built</h4>
<p>Two separate CNN encoders:</p>
<ul>
<li><strong>Image Embedder</strong>: Takes 4-channel RGB input, outputs a compact embedding</li>
<li><strong>LiDAR Embedder</strong>: Takes 1-channel depth input, outputs an embedding of the same size</li>
</ul>
<p>The key insight is that both embedders output vectors of identical dimensions. This is crucial because you'll be comparing them directly.</p>
<h4 id="heading-the-training-objective">The Training Objective</h4>
<p>For each batch:</p>
<ol>
<li>Pass RGB images through the image embedder</li>
<li>Pass corresponding LiDAR data through the LiDAR embedder</li>
<li>Normalize both sets of embeddings (this is critical and easy to forget)</li>
<li>Calculate similarity between every RGB embedding and every LiDAR embedding</li>
<li>The diagonal of this similarity matrix should be high (matching pairs)</li>
<li>Off-diagonal entries should be low (non-matching pairs)</li>
</ol>
<h4 id="heading-where-i-got-stuck-and-how-i-fixed-it">Where I Got Stuck (And How I Fixed It)</h4>
<p><strong>Problem 1: The Similarity Matrix</strong></p>
<p>My first attempt produced garbage results. The issue? I was calculating similarity wrong.</p>
<p>When you have a batch of N image embeddings and N LiDAR embeddings, you need an N×N matrix where entry (i,j) represents the similarity between image i and LiDAR j.</p>
<p>The trick is creating all pairwise combinations efficiently:</p>
<ul>
<li>Take your image embeddings and repeat each one N times</li>
<li>Take your LiDAR embeddings and tile the entire batch N times</li>
<li>Now you have N² pairs that you can compare</li>
</ul>
<p>I initially confused <code>repeat</code> with <code>repeat_interleave</code>. These do very different things:</p>
<ul>
<li><code>repeat_interleave</code>: [A, B, C] with repeats=2 → [A, A, B, B, C, C]</li>
<li><code>repeat</code>: [A, B, C] with repeats=2 → [A, B, C, A, B, C]</li>
</ul>
<p>Getting this wrong meant my similarity matrix had the wrong structure, and the model couldn't learn meaningful alignments.</p>
<p><strong>Problem 2: Cosine Similarity Dimensions</strong></p>
<p>Another subtle bug: when using cosine similarity on batched pairwise comparisons, you need to specify the correct dimension. The embedding dimension (not the batch dimension) is where the dot product happens.</p>
<p><strong>Problem 3: Loss Function Setup</strong></p>
<p>The contrastive loss treats this as a classification problem. For each image, the "correct class" is the index of its matching LiDAR pair. With proper normalization and similarity calculation, cross-entropy loss does the heavy lifting.</p>
<h4 id="heading-the-aha-moment">The "Aha" Moment</h4>
<p>Once I fixed the similarity matrix construction, training loss dropped dramatically. Watching the validation loss decrease below the threshold was satisfying, but the real test was visualizing the embeddings.</p>
<p>After training, RGB images of cubes clustered near LiDAR scans of cubes. Spheres clustered with spheres. The two modalities had learned a shared language.</p>
<hr />
<h3 id="heading-part-2-building-the-bridge-between-worlds">Part 2: Building the Bridge Between Worlds</h3>
<p><strong>The Goal</strong>: Project RGB embeddings into the space where the LiDAR classifier operates.</p>
<p>Here's a subtlety that tripped me up: the CILP embedders produce 200-dimensional vectors, but the pre-trained LiDAR classifier expects 3200-dimensional inputs (from its internal <code>get_embs()</code> method).</p>
<p><strong>The Analogy</strong>: You've taught two translators to write similar summaries. But the expert appraiser doesn't read summaries. They read detailed technical reports in a specific format. Now you need a "report writer" that converts summaries into the format the expert expects.</p>
<h4 id="heading-the-architecture">The Architecture</h4>
<p>A simple multi-layer perceptron (MLP) that:</p>
<ul>
<li>Takes 200-dim input (CILP image embeddings)</li>
<li>Outputs 3200-dim vectors (matching the LiDAR classifier's embedding space)</li>
</ul>
<h4 id="heading-the-training-strategy">The Training Strategy</h4>
<p>This is where the two-stage training approach from the course pays off:</p>
<ol>
<li><strong>Freeze the CILP embedders</strong>: They've already learned good representations</li>
<li><strong>Generate embedding pairs</strong>: For each training sample, get both the RGB embedding (from CILP) and the LiDAR embedding (from the pre-trained classifier's internal method)</li>
<li><strong>Train the projector</strong>: Minimize the MSE between projected RGB embeddings and actual LiDAR embeddings</li>
</ol>
<h4 id="heading-where-i-got-stuck-and-how-i-fixed-it-1">Where I Got Stuck (And How I Fixed It)</h4>
<p><strong>Problem: Dimension Mismatch</strong></p>
<p>My first projector architecture was too shallow. A single linear layer from 200 to 3200 dimensions struggled to capture the complex mapping. Adding intermediate layers with non-linearities helped significantly.</p>
<p><strong>Problem: Not Using the Right LiDAR Embeddings</strong></p>
<p>Initially, I tried to project to the CILP LiDAR embeddings (200-dim). Wrong target! The goal is to project to where the <em>classifier</em> expects its input, which is the 3200-dim space from <code>lidar_cnn.get_embs()</code>.</p>
<p>This distinction is crucial: CILP learns alignment, but the projector bridges to the classifier's specific representation space.</p>
<hr />
<h3 id="heading-part-3-assembling-the-complete-pipeline">Part 3: Assembling the Complete Pipeline</h3>
<p><strong>The Goal</strong>: Chain everything together so RGB images flow through to correct predictions.</p>
<h4 id="heading-the-final-architecture">The Final Architecture</h4>
<pre><code>RGB Image
    │
    ▼
┌─────────────────────┐
│  CILP Image Embedder │  ← Frozen (<span class="hljs-keyword">from</span> Part <span class="hljs-number">1</span>)
│     (<span class="hljs-number">4</span>ch → <span class="hljs-number">200</span>-dim)  │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│      Projector       │  ← Trainable (<span class="hljs-keyword">from</span> Part <span class="hljs-number">2</span>)
│   (<span class="hljs-number">200</span> → <span class="hljs-number">3200</span>-dim)   │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│   LiDAR Classifier   │  ← Frozen (pre-trained)
│  (<span class="hljs-number">3200</span>-dim → <span class="hljs-class"><span class="hljs-keyword">class</span>)  │
└─────────────────────┘
    │
    ▼
"<span class="hljs-title">cube</span>" <span class="hljs-title">or</span> "<span class="hljs-title">sphere</span>"</span>
</code></pre><h4 id="heading-the-final-training-loop">The Final Training Loop</h4>
<p>With the complete pipeline assembled:</p>
<ol>
<li>Pass RGB images through the frozen CILP image embedder</li>
<li>Project the embeddings to 3200 dimensions</li>
<li>Pass through the frozen LiDAR classifier</li>
<li>Compare predictions to ground truth labels</li>
<li>Backpropagate through the projector only</li>
</ol>
<h4 id="heading-the-moment-of-truth">The Moment of Truth</h4>
<p>Running validation on RGB images the model had never seen during training:</p>
<p><strong>Accuracy: 97.2%</strong></p>
<p>The model correctly classified cubes and spheres from color images, despite never being trained on RGB labels directly. All it learned was:</p>
<ol>
<li>How RGB and LiDAR representations relate (contrastive pre-training)</li>
<li>How to translate from CILP space to classifier space (projection)</li>
</ol>
<p>The classifier did what it always does. The magic was in the translation layers.</p>
<hr />
<h3 id="heading-key-insights-from-the-assessment">Key Insights from the Assessment</h3>
<h4 id="heading-1-contrastive-learning-creates-bridges-not-solutions">1. Contrastive Learning Creates Bridges, Not Solutions</h4>
<p>CILP doesn't solve the classification problem. It creates aligned representations that make downstream tasks possible. The embeddings have no inherent "cube-ness" or "sphere-ness." They only know that certain RGB patterns correspond to certain LiDAR patterns.</p>
<h4 id="heading-2-projection-is-surprisingly-simple">2. Projection is Surprisingly Simple</h4>
<p>I expected the projector to be complex. In reality, a few linear layers with activations suffice. The heavy lifting was done by CILP. The projector just needs to reshape the information.</p>
<h4 id="heading-3-freezing-is-your-friend">3. Freezing is Your Friend</h4>
<p>Trying to train everything end-to-end from scratch would be a nightmare. The staged approach (freeze CILP, train projector, freeze everything) provides stability and interpretability.</p>
<h4 id="heading-4-dimension-awareness-is-critical">4. Dimension Awareness is Critical</h4>
<p>Throughout the assessment, I had to track:</p>
<ul>
<li>RGB input channels: 4</li>
<li>LiDAR input channels: 1</li>
<li>CILP embedding dimension: 200</li>
<li>Classifier embedding dimension: 3200</li>
<li>Output classes: 2</li>
</ul>
<p>Mixing these up causes silent failures where the model trains but learns nothing useful.</p>
<h4 id="heading-5-the-similarity-matrix-is-the-heart-of-contrastive-learning">5. The Similarity Matrix is the Heart of Contrastive Learning</h4>
<p>If I could give one piece of advice: spend extra time understanding how the similarity matrix is constructed. Draw it out on paper. Trace through the tensor operations. This is where most bugs hide.</p>
<hr />
<h3 id="heading-what-this-assessment-taught-me">What This Assessment Taught Me</h3>
<p>Beyond the technical implementation, this assessment crystallized why multimodal AI matters:</p>
<p><strong>You can transfer knowledge across modalities without paired labels.</strong></p>
<p>Think about the implications:</p>
<ul>
<li>Train a model on abundant labeled data in one modality</li>
<li>Transfer to a modality where labels are scarce or expensive</li>
<li>The bridge is learned from unlabeled paired data</li>
</ul>
<p>This is how modern AI systems handle:</p>
<ul>
<li>Medical imaging (transfer from annotated scans to new imaging techniques)</li>
<li>Robotics (transfer from simulation to real sensors)</li>
<li>Accessibility (convert between visual and audio representations)</li>
</ul>
<hr />
<h3 id="heading-final-thoughts">Final Thoughts</h3>
<p>The CILP assessment is cleverly designed. It doesn't just test whether you can copy code from notebooks. It tests whether you understand:</p>
<ul>
<li>Why contrastive learning works</li>
<li>How embedding spaces relate</li>
<li>When to freeze and when to train</li>
<li>How information flows through multimodal pipelines</li>
</ul>
<p>If you're attempting this assessment, my advice:</p>
<ol>
<li>Draw the architecture before writing code</li>
<li>Print tensor shapes obsessively</li>
<li>Verify each component independently before combining</li>
<li>Trust the staged training approach</li>
</ol>
<p>The satisfaction of seeing 95%+ accuracy on a modality your classifier was never trained on is worth the debugging struggle.</p>
<hr />
<p><em>This post documents my experience completing the assessment for NVIDIA's Deep Learning Institute course: <a target="_blank" href="https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+C-FX-17+V1">Building AI Agents with Multimodal Models</a>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Building AI Agents with Multimodal Models: Part 4]]></title><description><![CDATA[Video Understanding & Graph-RAG: AI That Watches, Remembers, and Reasons
This is Part 4 (Final) of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.

The Final Frontier: Understanding Video
We...]]></description><link>https://thedatasense.com/building-ai-agents-with-multimodal-models-part-4</link><guid isPermaLink="true">https://thedatasense.com/building-ai-agents-with-multimodal-models-part-4</guid><category><![CDATA[#multimodalai]]></category><category><![CDATA[NVIDIA]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[graphrag]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Sat, 10 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/2uwFEAGUm6E/upload/ba1dc2e22c6453fc22d4b1918c55a671.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-video-understanding-amp-graph-rag-ai-that-watches-remembers-and-reasons">Video Understanding &amp; Graph-RAG: AI That Watches, Remembers, and Reasons</h2>
<p><em>This is Part 4 (Final) of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.</em></p>
<hr />
<h3 id="heading-the-final-frontier-understanding-video">The Final Frontier: Understanding Video</h3>
<p>We've covered images, text, and documents. Now we tackle the most challenging modality: <strong>video</strong>.</p>
<p>Video isn't just a collection of images. It's a temporal sequence where:</p>
<ul>
<li>Actions unfold over time</li>
<li>Objects enter and exit scenes</li>
<li>Events have causes and effects</li>
<li>Context from the past informs the present</li>
</ul>
<p><strong>The Analogy</strong>: Imagine describing a movie to someone who hasn't seen it. You don't describe each frame. You summarize scenes, explain character motivations, and connect plot points. This requires understanding time, causality, and narrative structure.</p>
<p>Teaching AI to do this is the challenge of Video Search and Summarization (VSS).</p>
<hr />
<h3 id="heading-nvidias-video-search-and-summarization-pipeline">NVIDIA's Video Search and Summarization Pipeline</h3>
<p>NVIDIA provides a production-ready blueprint for video understanding. Let's break down how it works.</p>
<h4 id="heading-the-architecture-three-stage-processing">The Architecture: Three-Stage Processing</h4>
<pre><code>┌─────────────────────────────────────────────────────────────────────┐
│                        VIDEO INPUT                                  │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  STAGE <span class="hljs-number">1</span>: Dense Captioning                                         │
│  Video chunks ──&gt; VLM ──&gt; Detailed captions <span class="hljs-keyword">with</span> timestamps        │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  STAGE <span class="hljs-number">2</span>: Caption Aggregation                                      │
│  Overlapping captions ──&gt; LLM ──&gt; Condensed, coherent descriptions │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  STAGE <span class="hljs-number">3</span>: Summary Generation                                       │
│  All descriptions ──&gt; LLM ──&gt; Final coherent summary               │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
              Vector Database (Milvus) <span class="hljs-keyword">for</span> RAG queries
</code></pre><h4 id="heading-stage-1-dense-captioning-with-vision-language-models">Stage 1: Dense Captioning with Vision Language Models</h4>
<p><strong>What Happens</strong>: The video is split into chunks (e.g., 30-second segments with 5-second overlap). A Vision Language Model (VLM) watches each chunk and generates detailed captions.</p>
<p><strong>The Analogy</strong>: Like a court stenographer who watches a trial and creates detailed notes of everything that happens, with timestamps.</p>
<p><strong>Key Parameters</strong>:</p>
<ul>
<li><code>chunk_duration</code>: How long each segment is (trade-off between detail and processing time)</li>
<li><code>chunk_overlap_duration</code>: Overlap between segments to catch events at boundaries</li>
<li><code>prompt</code>: Instructions to the VLM on what to describe and how</li>
</ul>
<p><strong>Example VLM Output</strong>:</p>
<pre><code>[<span class="hljs-number">00</span>:<span class="hljs-number">00</span><span class="hljs-number">-00</span>:<span class="hljs-number">30</span>] A silver sedan approaches the intersection <span class="hljs-keyword">from</span> the north.
              The traffic light is green. Two pedestrians wait on the sidewalk.

[<span class="hljs-number">00</span>:<span class="hljs-number">25</span><span class="hljs-number">-00</span>:<span class="hljs-number">55</span>] The sedan enters the intersection. A red SUV approaches <span class="hljs-keyword">from</span>
              the east, running a yellow light. The pedestrians begin crossing.
</code></pre><p><strong>Why Overlap Matters</strong>: If a car crash happens exactly at second 30, without overlap, neither chunk fully captures it. The 5-second overlap ensures boundary events are seen by at least one chunk.</p>
<h4 id="heading-stage-2-caption-aggregation">Stage 2: Caption Aggregation</h4>
<p><strong>The Problem</strong>: Overlapping chunks produce redundant descriptions. The same event might be described twice.</p>
<p><strong>The Solution</strong>: An LLM reads overlapping captions and condenses them, removing redundancy while preserving all unique information.</p>
<p><strong>Before Aggregation</strong>:</p>
<pre><code>Chunk <span class="hljs-number">1</span>: <span class="hljs-string">"A worker places a box on the shelf. The box appears heavy."</span>
Chunk <span class="hljs-number">2</span>: <span class="hljs-string">"A heavy box is placed on the shelf. It appears unstable."</span>
Chunk <span class="hljs-number">3</span>: <span class="hljs-string">"The unstable box falls from the shelf onto the floor."</span>
</code></pre><p><strong>After Aggregation</strong>:</p>
<pre><code><span class="hljs-string">"A worker places a heavy box on the shelf. The box appears unstable
and subsequently falls onto the floor."</span>
</code></pre><h4 id="heading-stage-3-summary-generation">Stage 3: Summary Generation</h4>
<p><strong>What Happens</strong>: All aggregated descriptions are combined into a final, coherent summary that reads like a narrative rather than a collection of observations.</p>
<p><strong>The Output</strong>: A comprehensive summary that can answer questions like:</p>
<ul>
<li>"What happened in this video?"</li>
<li>"Were there any safety violations?"</li>
<li>"Describe the sequence of events."</li>
</ul>
<hr />
<h3 id="heading-prompt-engineering-for-video-the-secret-sauce">Prompt Engineering for Video: The Secret Sauce</h3>
<p>The quality of video understanding depends heavily on prompt engineering. NVIDIA's training emphasizes three components of effective prompts:</p>
<h4 id="heading-1-persona">1. Persona</h4>
<p>Tell the VLM who it is and what expertise it has.</p>
<pre><code>You are a traffic safety analyst reviewing intersection footage.
You have expertise <span class="hljs-keyword">in</span> identifying traffic violations, near-misses,
and pedestrian safety concerns.
</code></pre><h4 id="heading-2-specific-details-to-capture">2. Specific Details to Capture</h4>
<p>List exactly what information you want extracted.</p>
<pre><code>For each scene, <span class="hljs-attr">note</span>:
- Vehicle types, colors, and directions <span class="hljs-keyword">of</span> travel
- Traffic signal states (red, yellow, green)
- Pedestrian positions and movements
- Any violations or concerning behaviors
- Timestamp <span class="hljs-keyword">of</span> each observation
</code></pre><h4 id="heading-3-output-format">3. Output Format</h4>
<p>Specify how results should be structured.</p>
<pre><code>Format your observations <span class="hljs-keyword">as</span>:
[TIMESTAMP] OBSERVATION
Include severity levels <span class="hljs-keyword">for</span> any safety concerns: LOW, MEDIUM, HIGH
</code></pre><p><strong>Why This Matters</strong>: Generic prompts like "Describe this video" produce generic results. Specific prompts produce actionable intelligence.</p>
<hr />
<h3 id="heading-from-summaries-to-qampa-vector-rag">From Summaries to Q&amp;A: Vector-RAG</h3>
<p>Once videos are processed, you can query them using Retrieval Augmented Generation.</p>
<p><strong>The Process</strong>:</p>
<ol>
<li>User asks: "Were there any safety violations in today's warehouse footage?"</li>
<li>Question is embedded as a vector</li>
<li>Similar caption segments are retrieved from Milvus</li>
<li>Retrieved context is fed to LLM with the question</li>
<li>LLM generates an answer grounded in the video content</li>
</ol>
<p><strong>Example Query Flow</strong>:</p>
<pre><code>Question: <span class="hljs-string">"What time did the forklift enter the frame?"</span>

Retrieved Context:
[<span class="hljs-number">00</span>:<span class="hljs-number">02</span>:<span class="hljs-number">15</span>] A yellow forklift enters the warehouse <span class="hljs-keyword">from</span> the loading dock.
[<span class="hljs-number">00</span>:<span class="hljs-number">02</span>:<span class="hljs-number">45</span>] The forklift operator picks up a pallet <span class="hljs-keyword">of</span> boxes.

Answer: <span class="hljs-string">"The forklift entered the frame at approximately 2 minutes
and 15 seconds into the footage."</span>
</code></pre><hr />
<h3 id="heading-graph-rag-when-vector-search-isnt-enough">Graph-RAG: When Vector Search Isn't Enough</h3>
<p>Vector-RAG works great for simple queries. But what about complex reasoning?</p>
<p><strong>The Limitation of Vector Search</strong>:
Query: "What caused the accident?"</p>
<p>Vector search finds segments mentioning "accident" but may miss:</p>
<ul>
<li>The speeding vehicle 30 seconds before</li>
<li>The obscured stop sign 2 minutes earlier</li>
<li>The wet road conditions mentioned at the start</li>
</ul>
<p>These are causally related but semantically distant. Vector similarity misses the connection.</p>
<p><strong>The Analogy</strong>: Imagine a detective investigating a crime. They don't just search for clues similar to the crime scene. They build a web of relationships: who knew whom, who was where when, what events led to what. This web of connections is a <strong>knowledge graph</strong>.</p>
<hr />
<h3 id="heading-building-a-knowledge-graph-from-video">Building a Knowledge Graph from Video</h3>
<p>Graph-RAG extracts entities and relationships to build a queryable knowledge structure.</p>
<h4 id="heading-the-three-gs-of-graph-rag">The Three G's of Graph-RAG</h4>
<p><strong>1. G-Extraction (Building the Graph)</strong></p>
<p>An LLM analyzes video captions and extracts:</p>
<ul>
<li><strong>Entities</strong>: Objects, people, locations, events</li>
<li><strong>Relationships</strong>: How entities connect to each other</li>
<li><strong>Properties</strong>: Attributes of entities</li>
</ul>
<p><strong>Example Extraction</strong>:</p>
<pre><code>Caption: <span class="hljs-string">"A worker places a heavy box on the top shelf.
         The box falls due to improper placement."</span>

<span class="hljs-attr">Entities</span>:
- Worker (type: person)
- Box (type: object, <span class="hljs-attr">property</span>: heavy)
- Top Shelf (type: location)
- Fall Event (type: event)

<span class="hljs-attr">Relationships</span>:
- Worker PLACES Box
- Box ON Top Shelf
- Box FALLS_DUE_TO improper_placement
- improper_placement CAUSES Fall Event
</code></pre><p>This creates a graph structure:</p>
<pre><code>       [Worker]
          │
       PLACES
          │
          ▼
        [Box] ─── heavy
          │
          ON
          │
          ▼
     [Top Shelf]
          │
     FALLS_DUE_TO
          │
          ▼
  [improper_placement]
          │
       CAUSES
          │
          ▼
    [Fall Event]
</code></pre><p><strong>2. G-Retrieval (Querying the Graph)</strong></p>
<p>Instead of vector similarity, queries are converted to graph queries (Cypher for Neo4j):</p>
<pre><code class="lang-cypher">// Query: "What caused the box to fall?"
MATCH (b:Object {name: 'Box'})-[:FALLS_DUE_TO]-&gt;(cause)
RETURN cause

// Result: improper_placement
</code></pre>
<pre><code class="lang-cypher">// Query: "Show all safety incidents and their causes"
MATCH (event:Event)-[:CAUSED_BY]-&gt;(cause)
WHERE event.type = 'safety_incident'
RETURN event, cause
</code></pre>
<p><strong>3. G-Generation (Answering with Context)</strong></p>
<p>Retrieved graph data is fed to an LLM which synthesizes a natural language answer:</p>
<pre><code>Graph Data Retrieved:
- Box FALLS_DUE_TO improper_placement
- Worker PLACES Box
- improper_placement CAUSED_BY rushing

LLM Answer: <span class="hljs-string">"The box fell because of improper placement. The worker
placed the box hastily on the top shelf without ensuring stability.
This appears to be caused by rushing to meet a deadline."</span>
</code></pre><hr />
<h3 id="heading-vector-rag-vs-graph-rag-when-to-use-which">Vector-RAG vs. Graph-RAG: When to Use Which</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Aspect</td><td>Vector-RAG</td><td>Graph-RAG</td></tr>
</thead>
<tbody>
<tr>
<td>Best For</td><td>Simple fact retrieval</td><td>Complex reasoning</td></tr>
<tr>
<td>Query Type</td><td>"What happened at 2pm?"</td><td>"What caused the failure?"</td></tr>
<tr>
<td>Speed</td><td>Faster</td><td>Slower (graph traversal)</td></tr>
<tr>
<td>Setup</td><td>Simpler</td><td>Requires graph construction</td></tr>
<tr>
<td>Reasoning</td><td>Shallow (similarity)</td><td>Deep (relationships)</td></tr>
<tr>
<td>Storage</td><td>Vector database</td><td>Graph database (Neo4j)</td></tr>
</tbody>
</table>
</div><p><strong>Use Vector-RAG When</strong>:</p>
<ul>
<li>Questions are about specific facts or timestamps</li>
<li>Real-time response is critical (live streaming)</li>
<li>Relationships between events are not important</li>
</ul>
<p><strong>Use Graph-RAG When</strong>:</p>
<ul>
<li>Questions involve causality or chains of events</li>
<li>You need to understand how things connect</li>
<li>Complex multi-hop reasoning is required</li>
</ul>
<hr />
<h3 id="heading-practical-applications">Practical Applications</h3>
<h4 id="heading-traffic-monitoring">Traffic Monitoring</h4>
<ul>
<li>Detect violations and near-misses</li>
<li>Analyze accident causes</li>
<li>Track traffic patterns over time</li>
</ul>
<h4 id="heading-warehouse-safety">Warehouse Safety</h4>
<ul>
<li>Monitor worker compliance</li>
<li>Track inventory movement</li>
<li>Identify safety hazards</li>
</ul>
<h4 id="heading-bridge-inspection">Bridge Inspection</h4>
<ul>
<li>Detect structural anomalies</li>
<li>Track changes over time</li>
<li>Prioritize maintenance needs</li>
</ul>
<h4 id="heading-security-surveillance">Security Surveillance</h4>
<ul>
<li>Track persons of interest</li>
<li>Detect unusual behavior</li>
<li>Generate incident reports</li>
</ul>
<hr />
<h3 id="heading-the-complete-multimodal-picture">The Complete Multimodal Picture</h3>
<p>Looking back at this 4-part series, we've covered the full spectrum:</p>
<p><strong>Part 1</strong>: How to combine different data types (fusion strategies)
<strong>Part 2</strong>: How to align different modalities (contrastive learning)
<strong>Part 3</strong>: How to extract intelligence from documents (OCR + RAG)
<strong>Part 4</strong>: How to understand temporal content (Video + Graph-RAG)</p>
<p>Together, these techniques enable AI systems that can:</p>
<ul>
<li>See images and video</li>
<li>Read documents and text</li>
<li>Understand depth and 3D structure</li>
<li>Connect concepts across modalities</li>
<li>Reason about relationships and causality</li>
</ul>
<hr />
<h3 id="heading-key-takeaways-from-the-complete-series">Key Takeaways from the Complete Series</h3>
<ol>
<li><p><strong>Multimodal AI is about combining strengths</strong>: Each modality has unique capabilities. Fusion multiplies them.</p>
</li>
<li><p><strong>Embeddings are the universal language</strong>: Converting everything to vectors enables cross-modal comparison.</p>
</li>
<li><p><strong>Contrastive learning aligns modalities</strong>: Push matching pairs together, pull non-matches apart.</p>
</li>
<li><p><strong>RAG grounds AI in your data</strong>: Retrieval prevents hallucination and enables factual answers.</p>
</li>
<li><p><strong>Graphs capture relationships</strong>: When causality matters, knowledge graphs outperform vector search.</p>
</li>
<li><p><strong>Prompt engineering is crucial</strong>: Specific, well-structured prompts dramatically improve results.</p>
</li>
<li><p><strong>Production systems need pipelines</strong>: Real applications require chunking, batching, and careful orchestration.</p>
</li>
</ol>
<hr />
<h3 id="heading-where-to-go-from-here">Where to Go From Here</h3>
<p>This certification provides a foundation. To deepen your expertise:</p>
<ul>
<li><strong>Experiment</strong>: Build your own multimodal pipelines with the techniques learned</li>
<li><strong>Explore NVIDIA NIMs</strong>: Pre-built microservices for production deployment</li>
<li><strong>Study Attention Mechanisms</strong>: Transformers power most modern multimodal models</li>
<li><strong>Follow Research</strong>: Multimodal AI is evolving rapidly with new architectures monthly</li>
</ul>
<p>The future of AI is multimodal. The ability to process and reason across data types will define the next generation of intelligent systems.</p>
<hr />
<p><em>This content is inspired by NVIDIA's Deep Learning Institute course: <a target="_blank" href="https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+C-FX-17+V1">Building AI Agents with Multimodal Models</a>. For hands-on experience with these techniques, consider enrolling in their official courses.</em></p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Understanding Random Variables: A Practical Guide for Engineers]]></title><description><![CDATA[Part 1: Discrete Random Variables
Discrete random variables represent countable outcomes—like the roll of a die, the number of users on a site, or binary classification labels.
Expected Value
The expected value (or mean) tells us the "center of mass"...]]></description><link>https://thedatasense.com/understanding-random-variables-a-practical-guide-for-engineers</link><guid isPermaLink="true">https://thedatasense.com/understanding-random-variables-a-practical-guide-for-engineers</guid><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Thu, 08 Jan 2026 18:59:54 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/XIIsv6AshJY/upload/ce5c09a0c18383f7eb5bf29a99811bee.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-part-1-discrete-random-variables">Part 1: Discrete Random Variables</h2>
<p>Discrete random variables represent countable outcomes—like the roll of a die, the number of users on a site, or binary classification labels.</p>
<h3 id="heading-expected-value">Expected Value</h3>
<p>The expected value (or mean) tells us the "center of mass" of a distribution.</p>
<p><strong>Analogy:</strong> Imagine playing a carnival game thousands of times. Sometimes you win $10, sometimes you lose $5. The expected value is your average profit per game in the long run. It is the "steady state" of your luck.</p>
<p>For a discrete random variable \( X \) with probability mass function (PMF) \( p_X(x) \) :</p>
<p>$$E[X] = \sum_x x \cdot p_X(x)$$</p><p>You multiply each outcome by its probability and sum them up. Heavily weighted outcomes pull the average closer to them.</p>
<h3 id="heading-the-expected-value-rule-lotus">The Expected Value Rule (LOTUS)</h3>
<p>When you apply a function \( g \) to a random variable \( X \) , creating \( Y = g(X) \) , you don't need to find the distribution of \( Y \) first. You can calculate the expected value directly using \( X \) .</p>
<p>$$E[Y] = E[g(X)] = \sum_x g(x) \cdot p_X(x)$$</p><p><strong>Analogy:</strong> If \( X \) is the number of hours you work and your pay is \( g(X) = 15X + 50 \) , you can calculate your expected pay directly from the distribution of your hours.</p>
<h3 id="heading-pmf-of-a-transformed-variable">PMF of a Transformed Variable</h3>
<p>If \( Y = g(X) \) , the probability of \( Y \) taking a value \( y \) is the sum of probabilities of all \( x \) values that map to \( y \) .</p>
<p>$$p_Y(y) = \sum_{x: g(x) = y} p_X(x)$$</p><p><strong>Analogy:</strong> Think of \( g \) as a sorting machine. If \( Y=X^2 \) , both \( x=-2 \) and \( x=2 \) fall into the " \( y=4 \) " bucket. You combine their probabilities to get the total probability of observing 4.</p>
<h3 id="heading-important-warning-jensens-inequality">⚠️ Important Warning: Jensen's Inequality</h3>
<p>In general, expectation does not commute with non-linear functions.</p>
<p>$$g(E[X]) \neq E[g(X)]$$</p><p><strong>Analogy:</strong> The average of squares is not the square of the average. If your test scores are 0 and 100, your average is 50 ( \( 50^2 = 2500 \) ). But the average of your squared scores ( \( 0 \) and \( 10,000 \) ) is 5,000.</p>
<h3 id="heading-variance-and-standard-deviation">Variance and Standard Deviation</h3>
<p>Variance measures the "spread" or "risk" in a distribution.</p>
<p><strong>Analogy:</strong> Two archers both hit the bullseye on average. Archer A clusters shots tightly (low variance). Archer B hits the outer rings on opposite sides (high variance).</p>
<p><strong>Variance:</strong> $$ \text{Var}(X) = E[(X - \mu)^2] = \sum_x (x - \mu)^2 \cdot p_X(x) $$</p>
<p><strong>Standard Deviation:</strong> $$ \sigma_X = \sqrt{\text{Var}(X)} $$</p>
<h3 id="heading-variance-properties">Variance Properties</h3>
<p>These rules are essential for manipulating uncertainty:</p>
<p>$$\text{Var}(aX) = a^2 \cdot \text{Var}(X)$$</p><p>$$\text{Var}(X + b) = \text{Var}(X)$$</p><p>$$\text{Var}(aX + b) = a^2 \cdot \text{Var}(X)$$</p><p><strong>Key Insight:</strong> Adding a constant ( \( +b \) ) shifts the distribution but doesn't change the spread. Multiplying by a constant ( \( a \) ) scales the spread, and since variance is squared units, the factor becomes \( a^2 \) .</p>
<h3 id="heading-conditioning-on-an-event">Conditioning on an Event</h3>
<p>When you learn that event \( A \) has occurred, the probability space shrinks. You eliminate impossible outcomes and "renormalize" the remaining ones so they sum to 1.</p>
<p>$$p_{X|A}(x) = \begin{cases} \frac{p_X(x)}{P(A)} &amp; \text{if } x \in A \\ 0 &amp; \text{otherwise} \end{cases}$$</p><h3 id="heading-total-expectation-theorem">Total Expectation Theorem</h3>
<p>This is a "divide and conquer" strategy. You can find the overall average by weighting the averages of subpopulations.</p>
<p>$$E[X] = \sum_i P(A_i) \cdot E[X|A_i]$$</p><h3 id="heading-multiple-random-variables">Multiple Random Variables</h3>
<p>When dealing with multiple variables (like Age and Income), we use the <strong>Joint PMF</strong>:</p>
<p>$$p_{X,Y}(x,y) = P(X=x, Y=y)$$</p><p><strong>Marginalization:</strong> To get back the distribution of just \( X \) , you sum over all possible values of \( Y \) :</p>
<p>$$p_X(x) = \sum_y p_{X,Y}(x, y)$$</p><p><strong>Linearity of Expectation:</strong> This is one of the most powerful properties in probability. It holds <strong>even if variables are dependent</strong>.</p>
<p>$$E[X + Y] = E[X] + E[Y]$$</p><hr />
<h2 id="heading-part-2-continuous-random-variables">Part 2: Continuous Random Variables</h2>
<p>For continuous variables (time, distance, temperature), the probability of being exactly equal to a specific number is 0. Instead, we measure probability over intervals using a <strong>Probability Density Function (PDF)</strong>, \( f_X(x) \) .</p>
<h3 id="heading-probability-as-area">Probability as Area</h3>
<p>$$ P(a \\le X \\le b) = \\int\_a^b f\_X(x) , dx $$</p><h3 id="heading-expectation-continuous">Expectation (Continuous)</h3>
<p>$$ E\[X\] = \\int\_{-\\infty}^{\\infty} x \\cdot f\_X(x) , dx $$</p><h3 id="heading-common-distributions">Common Distributions</h3>
<p><strong>1. Uniform Distribution ( \( X \sim \text{Uni}(a,b) \) )</strong> Every interval of the same length is equally likely. $$ f_X(x) = \frac{1}{b-a} \quad \text{for } a &lt; x &lt; b $$</p>
<p><strong>2. Exponential Distribution ( \( X \sim \text{Exp}(\lambda) \) )</strong> Models waiting times (e.g., time until the next server request). It has the unique <strong>Memoryless Property</strong>: $$ P(X &gt; s + t | X &gt; t) = P(X &gt; s) $$ If you've waited 10 minutes, the probability of waiting another 5 is the same as if you just started waiting.</p>
<p><strong>3. Normal (Gaussian) Distribution ( \( X \sim \mathcal{N}(\mu, \sigma^2) \) )</strong> The bell curve. $$ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$ <strong>Linear Transformation:</strong> If \( X \) is Normal, then \( aX+b \) is also Normal.</p>
<h3 id="heading-cumulative-distribution-function-cdf">Cumulative Distribution Function (CDF)</h3>
<p>The CDF is the integral of the PDF. It represents the probability that \( X \) is less than or equal to \( x \) .</p>
<p>$$F_X(x) = P(X \le x) = \int_{-\infty}^x f_X(t) \, dt$$</p><p><strong>Pro Tip:</strong> When transforming continuous variables ( \( Y=g(X) \) ), it is often safer to work with the CDF first and then differentiate to find the new PDF.</p>
<p>$$f_Y(y) = \frac{d}{dy} F_Y(y)$$</p><hr />
<h2 id="heading-part-3-bayes-rule">Part 3: Bayes' Rule</h2>
<p>Bayes' rule allows us to "flip" conditional probabilities. It is the foundation of inference.</p>
<p>$$p_{X|Y}(x|y) = \frac{p_X(x) \cdot p_{Y|X}(y|x)}{p_Y(y)}$$</p><p><strong>Analogy:</strong></p>
<ul>
<li><p><strong>Prior</strong> \( p_X(x) \) : What you believed before seeing data.</p>
</li>
<li><p><strong>Likelihood</strong> \( p_{Y|X}(y|x) \) : How likely the data is given your belief.</p>
</li>
<li><p><strong>Posterior</strong> \( p_{X|Y}(x|y) \) : Your updated belief after seeing the data.</p>
</li>
</ul>
<hr />
<h2 id="heading-quick-reference-table">Quick Reference Table</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Concept</td><td>Discrete</td><td>Continuous</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Distribution</strong></td><td>PMF: \( p_X(x) \)</td><td>PDF: \( f_X(x) \)</td></tr>
<tr>
<td><strong>Expectation</strong></td><td>\( \sum x \cdot p_X(x) \)</td><td>\( \int x \cdot f_X(x) \, dx \)</td></tr>
<tr>
<td><strong>Variance</strong></td><td>\( \sum (x-\mu)^2 \cdot p_X(x) \)</td><td>\( \int (x-\mu)^2 \cdot f_X(x) \, dx \)</td></tr>
<tr>
<td><strong>Independence</strong></td><td>\( p_{X,Y} = p_X \cdot p_Y \)</td><td>\( f_{X,Y} = f_X \cdot f_Y \)</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-study-tips">Study Tips</h2>
<ol>
<li><p><strong>Linearity of Expectation</strong> is your best friend. It works regardless of independence.</p>
</li>
<li><p><strong>Variance of Sums</strong> ( \( \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) \) ) only works if \( X \) and \( Y \) are <strong>independent</strong>. If they are dependent, you must add the Covariance term.</p>
</li>
<li><p>For continuous transformations, <strong>always go through the CDF</strong> if you are unsure. It prevents mistakes with boundaries and derivatives.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Building AI Agents with Multimodal Models: Part 3]]></title><description><![CDATA[Document Intelligence: Teaching AI to Read, Understand, and Remember PDFs
This is Part 3 of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.

The Challenge: Documents Are Messy
Think about a ...]]></description><link>https://thedatasense.com/building-ai-agents-with-multimodal-models-part-3</link><guid isPermaLink="true">https://thedatasense.com/building-ai-agents-with-multimodal-models-part-3</guid><category><![CDATA[#multimodalai]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Thu, 08 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/RLw-UC03Gwc/upload/002a43623f7287b9f4f925baa11e37a4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-document-intelligence-teaching-ai-to-read-understand-and-remember-pdfs">Document Intelligence: Teaching AI to Read, Understand, and Remember PDFs</h2>
<p><em>This is Part 3 of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.</em></p>
<hr />
<h3 id="heading-the-challenge-documents-are-messy">The Challenge: Documents Are Messy</h3>
<p>Think about a typical business document. It might have:</p>
<ul>
<li>Paragraphs of text in multiple columns</li>
<li>Tables with financial data</li>
<li>Charts and graphs</li>
<li>Images and diagrams</li>
<li>Headers, footers, and page numbers</li>
<li>Different fonts and formatting</li>
</ul>
<p>For humans, navigating this is intuitive. But for AI, a PDF is just a jumble of pixels or raw text blobs with no inherent structure. Teaching AI to extract meaningful information from documents is one of the most practical applications of multimodal AI.</p>
<p>This is where <strong>Optical Character Recognition (OCR)</strong> meets <strong>Retrieval Augmented Generation (RAG)</strong> to create intelligent document processing systems.</p>
<hr />
<h3 id="heading-ocr-from-pixels-to-text">OCR: From Pixels to Text</h3>
<p><strong>The Analogy</strong>: Imagine you're teaching a child to read. First, they learn to recognize individual letters. Then words. Then sentences. Eventually, they understand that text flows in certain directions and formats.</p>
<p><strong>Optical Character Recognition</strong> follows a similar journey:</p>
<ol>
<li><strong>Image Processing</strong>: Clean up the document image (remove noise, fix rotation)</li>
<li><strong>Layout Detection</strong>: Find regions of text, tables, images</li>
<li><strong>Character Recognition</strong>: Convert pixel patterns to characters</li>
<li><strong>Post-Processing</strong>: Apply language models to fix errors</li>
</ol>
<p>Modern OCR goes far beyond simple text extraction. It understands document structure.</p>
<hr />
<h3 id="heading-the-document-processing-pipeline">The Document Processing Pipeline</h3>
<p>NVIDIA's training demonstrates a comprehensive pipeline for extracting multimodal data from PDFs. Let's break it down.</p>
<h4 id="heading-step-1-document-partitioning">Step 1: Document Partitioning</h4>
<p>Before extracting content, you need to identify what's in the document.</p>
<p><strong>The Analogy</strong>: Before renovating a house, you walk through each room and catalog what's there. "Living room has a couch, TV, and bookshelf. Kitchen has appliances and a dining table."</p>
<p>Document partitioning creates an inventory of elements:</p>
<ul>
<li>Text blocks</li>
<li>Tables</li>
<li>Images</li>
<li>Charts</li>
<li>Headers and titles</li>
</ul>
<p>Tools like the <code>unstructured</code> library do this automatically, identifying element types and their locations.</p>
<h4 id="heading-step-2-smart-chunking">Step 2: Smart Chunking</h4>
<p>Once you have text, you need to break it into digestible pieces for the AI. But how you chunk matters enormously.</p>
<p><strong>Naive Chunking (Bad Approach)</strong>:
Split text every 500 characters regardless of content.</p>
<p><em>Problem</em>: You might split a sentence mid-thought, separate a header from its content, or break apart related concepts.</p>
<pre><code>Chunk <span class="hljs-number">1</span>: <span class="hljs-string">"The quarterly revenue reached $5.2 million, an increase of 23%"</span>
Chunk <span class="hljs-number">2</span>: <span class="hljs-string">"compared to the previous quarter. Key drivers included..."</span>
</code></pre><p><strong>Semantic Chunking (Better Approach)</strong>:
Split at natural boundaries like titles, section breaks, or paragraph endings.</p>
<pre><code>Chunk <span class="hljs-number">1</span>: [Header: Financial Results]
         <span class="hljs-string">"The quarterly revenue reached $5.2 million, an increase of 23%
          compared to the previous quarter."</span>

Chunk <span class="hljs-number">2</span>: [Header: Key Drivers]
         <span class="hljs-string">"Key drivers included expanded market presence and new product
          launches in the enterprise segment..."</span>
</code></pre><p>The semantic approach preserves meaning and context. When the AI retrieves this chunk later, it gets complete thoughts.</p>
<h4 id="heading-step-3-table-extraction">Step 3: Table Extraction</h4>
<p>Tables are notoriously tricky. They encode relationships through spatial position, not linear text.</p>
<p><strong>The Challenge</strong>:</p>
<pre><code>| Product | Q1 Sales | Q2 Sales |
|---------|----------|----------|
| Widget  | $<span class="hljs-number">50</span>,<span class="hljs-number">000</span>  | $<span class="hljs-number">65</span>,<span class="hljs-number">000</span>  |
| Gadget  | $<span class="hljs-number">30</span>,<span class="hljs-number">000</span>  | $<span class="hljs-number">45</span>,<span class="hljs-number">000</span>  |
</code></pre><p>If you just extract text left-to-right, you get: "Product Q1 Sales Q2 Sales Widget $50,000 $65,000..."</p>
<p>This loses all the relational information. Which number belongs to which product?</p>
<p><strong>The Solution</strong>: Use specialized table extraction models that understand grid structure. NVIDIA's pipeline uses models like Microsoft's Table Transformer to:</p>
<ol>
<li>Detect table regions in the document</li>
<li>Identify rows and columns</li>
<li>Extract cell contents with their positions</li>
<li>Convert to structured formats (HTML, JSON)</li>
</ol>
<p>The extracted HTML preserves structure:</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">table</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">tr</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>Product<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>Q1 Sales<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>Q2 Sales<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">tr</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">tr</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>Widget<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>$50,000<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">td</span>&gt;</span>$65,000<span class="hljs-tag">&lt;/<span class="hljs-name">td</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">tr</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">table</span>&gt;</span>
</code></pre>
<h4 id="heading-step-4-image-and-chart-extraction">Step 4: Image and Chart Extraction</h4>
<p>Documents often contain figures that carry critical information.</p>
<p><strong>The Approach</strong>:</p>
<ol>
<li><strong>Object Detection</strong>: Use models like YOLOX to find figures, charts, and diagrams</li>
<li><strong>Region Extraction</strong>: Crop these regions as separate images</li>
<li><strong>Metadata Preservation</strong>: Keep track of page number, position, and nearby text (captions)</li>
<li><strong>Visual Analysis</strong>: Optionally use Vision Language Models to describe the content</li>
</ol>
<p>This enables queries like "Show me all the architecture diagrams in this documentation."</p>
<hr />
<h3 id="heading-rag-retrieval-augmented-generation">RAG: Retrieval Augmented Generation</h3>
<p>Now you've extracted all this content. How do you make it useful?</p>
<p><strong>The Analogy</strong>: Imagine you're a researcher with a library of 10,000 books. When someone asks you a question, you don't read all 10,000 books. You:</p>
<ol>
<li>Search the catalog for relevant books</li>
<li>Pull those specific books off the shelf</li>
<li>Read the relevant sections</li>
<li>Synthesize an answer</li>
</ol>
<p>RAG does exactly this with AI.</p>
<h4 id="heading-the-rag-pipeline">The RAG Pipeline</h4>
<pre><code>User Question
     │
     ▼
┌─────────────┐
│  Embedding  │ ← Convert question to vector
└─────────────┘
     │
     ▼
┌─────────────┐
│  Retrieval  │ ← Find similar chunks <span class="hljs-keyword">in</span> vector database
└─────────────┘
     │
     ▼
┌─────────────┐
│   Context   │ ← Combine retrieved chunks
└─────────────┘
     │
     ▼
┌─────────────┐
│     LLM     │ ← Generate answer using context
└─────────────┘
     │
     ▼
   Answer
</code></pre><h4 id="heading-step-1-indexing-one-time-setup">Step 1: Indexing (One-Time Setup)</h4>
<p>Take all your extracted chunks and convert them to embeddings:</p>
<pre><code>Chunk <span class="hljs-number">1</span> ──&gt; [Encoder] ──&gt; [<span class="hljs-number">0.2</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.1</span>, ...]
Chunk <span class="hljs-number">2</span> ──&gt; [Encoder] ──&gt; [<span class="hljs-number">0.5</span>, <span class="hljs-number">0.3</span>, <span class="hljs-number">0.9</span>, ...]
Chunk <span class="hljs-number">3</span> ──&gt; [Encoder] ──&gt; [<span class="hljs-number">0.1</span>, <span class="hljs-number">0.7</span>, <span class="hljs-number">0.4</span>, ...]
...
</code></pre><p>Store these embeddings in a vector database like Milvus, Pinecone, or FAISS.</p>
<h4 id="heading-step-2-retrieval-at-query-time">Step 2: Retrieval (At Query Time)</h4>
<p>When a user asks a question:</p>
<ol>
<li>Convert the question to an embedding</li>
<li>Find the K most similar chunks (using cosine similarity)</li>
<li>Return those chunks as context</li>
</ol>
<pre><code class="lang-python">question = <span class="hljs-string">"What was Q2 revenue?"</span>
question_embedding = encoder.encode(question)
similar_chunks = vector_db.search(question_embedding, k=<span class="hljs-number">5</span>)
</code></pre>
<h4 id="heading-step-3-generation">Step 3: Generation</h4>
<p>Feed the retrieved context plus the question to an LLM:</p>
<pre><code>Context: [Retrieved chunks about Q2 revenue]
<span class="hljs-attr">Question</span>: What was Q2 revenue?

Answer: Based on the financial report, Q2 revenue was $<span class="hljs-number">65</span>,<span class="hljs-number">000</span> <span class="hljs-keyword">for</span>
the Widget product line and $<span class="hljs-number">45</span>,<span class="hljs-number">000</span> <span class="hljs-keyword">for</span> Gadgets, totaling $<span class="hljs-number">110</span>,<span class="hljs-number">000.</span>
</code></pre><p>The LLM generates an answer grounded in your actual documents, not its training data.</p>
<hr />
<h3 id="heading-object-detection-with-yolox">Object Detection with YOLOX</h3>
<p>For intelligent document analysis, you need to detect where different elements are located.</p>
<p><strong>The Model</strong>: NVIDIA provides specialized models like <code>nv-yolox-page-elements</code> trained specifically for document analysis.</p>
<p><strong>What It Detects</strong>:</p>
<ul>
<li>Tables</li>
<li>Charts and graphs</li>
<li>Titles and headers</li>
<li>Figures and images</li>
</ul>
<p><strong>How It Works</strong>:</p>
<ol>
<li>Process each page as an image</li>
<li>Model outputs bounding boxes with confidence scores</li>
<li>Use boxes to crop and extract specific regions</li>
</ol>
<pre><code>Page Image ──&gt; [YOLOX Model] ──&gt; Detected Regions:
  • Table at (<span class="hljs-number">100</span>, <span class="hljs-number">200</span>, <span class="hljs-number">500</span>, <span class="hljs-number">400</span>) - Confidence: <span class="hljs-number">0.95</span>
  • Chart at (<span class="hljs-number">100</span>, <span class="hljs-number">450</span>, <span class="hljs-number">500</span>, <span class="hljs-number">650</span>) - Confidence: <span class="hljs-number">0.89</span>
  • Title at (<span class="hljs-number">50</span>, <span class="hljs-number">50</span>, <span class="hljs-number">400</span>, <span class="hljs-number">80</span>) - Confidence: <span class="hljs-number">0.97</span>
</code></pre><p>This enables intelligent routing: text goes to OCR, tables go to table extractors, charts go to visual analysis models.</p>
<hr />
<h3 id="heading-handling-large-documents">Handling Large Documents</h3>
<p>Real documents can be hundreds of pages. Processing all at once is impractical.</p>
<p><strong>The Solution</strong>: Batch processing with memory management.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Process in batches of 10 pages</span>
<span class="hljs-keyword">for</span> start_page <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, total_pages, <span class="hljs-number">10</span>):
    end_page = min(start_page + <span class="hljs-number">10</span>, total_pages)
    batch = extract_pages(document, start_page, end_page)
    process_batch(batch)
    save_results(batch)
    clear_memory()  <span class="hljs-comment"># Prevent memory overflow</span>
</code></pre>
<p>Each batch is processed independently, results are saved, and memory is cleared before the next batch.</p>
<hr />
<h3 id="heading-practical-example-processing-a-technical-datasheet">Practical Example: Processing a Technical Datasheet</h3>
<p>Let's walk through processing NVIDIA's Grace-Blackwell datasheet (a real example from the training):</p>
<p><strong>Input</strong>: 20-page PDF with specifications, architecture diagrams, and performance tables</p>
<p><strong>Processing Steps</strong>:</p>
<ol>
<li><strong>Partition</strong>: Identify 150+ elements across 20 pages</li>
<li><strong>Extract Text</strong>: Pull out 45 text blocks with semantic chunking</li>
<li><strong>Extract Tables</strong>: Identify 12 specification tables, convert to HTML</li>
<li><strong>Extract Figures</strong>: Locate 8 architecture diagrams</li>
<li><strong>Index</strong>: Embed all content into vector database</li>
<li><strong>Query</strong>: "What are the memory bandwidth specs?"</li>
</ol>
<p><strong>Result</strong>: System retrieves relevant table chunks and generates accurate answer with source citations.</p>
<hr />
<h3 id="heading-key-takeaways">Key Takeaways</h3>
<ol>
<li><p><strong>Document processing is inherently multimodal</strong>: Text, tables, images all carry information</p>
</li>
<li><p><strong>Smart chunking preserves meaning</strong>: Semantic boundaries beat arbitrary character limits</p>
</li>
<li><p><strong>Tables need special handling</strong>: Spatial structure encodes relationships that linear text loses</p>
</li>
<li><p><strong>Object detection enables routing</strong>: YOLOX identifies what's where so appropriate extractors can be used</p>
</li>
<li><p><strong>RAG grounds AI in your data</strong>: Retrieved context prevents hallucination and enables factual answers</p>
</li>
<li><p><strong>Batch processing handles scale</strong>: Process large documents in manageable chunks to control memory</p>
</li>
</ol>
<hr />
<h3 id="heading-when-to-use-document-rag">When to Use Document RAG</h3>
<ul>
<li><strong>Enterprise Knowledge Bases</strong>: Make company documentation searchable and queryable</li>
<li><strong>Legal Document Analysis</strong>: Extract clauses, find precedents, compare contracts</li>
<li><strong>Financial Analysis</strong>: Query annual reports, extract metrics from filings</li>
<li><strong>Technical Documentation</strong>: Create intelligent assistants for product manuals</li>
<li><strong>Research</strong>: Build queryable databases of academic papers</li>
</ul>
<hr />
<h3 id="heading-whats-next">What's Next?</h3>
<p>In Part 4, we'll explore the most exciting frontier: <strong>Video Understanding and Graph-RAG</strong>. You'll learn how AI can watch, understand, and answer questions about video content, and how knowledge graphs enable complex reasoning that simple vector search cannot achieve.</p>
<hr />
<p><em>This content is inspired by NVIDIA's Deep Learning Institute course: <a target="_blank" href="https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+C-FX-17+V1">Building AI Agents with Multimodal Models</a>. For hands-on experience, consider enrolling in their official courses.</em></p>
]]></content:encoded></item><item><title><![CDATA[OCR on Engineering Drawings with a 0.9B Vision-Language Model]]></title><description><![CDATA[Late last year, I started exploring how to extract metadata from product drawings. Part numbers, material specifications, revision history, manufacturing process notes. The kind of information that lives in title blocks and needs to end up in a PLM d...]]></description><link>https://thedatasense.com/ocr-on-engineering-drawings-with-a-09b-vision-language-model</link><guid isPermaLink="true">https://thedatasense.com/ocr-on-engineering-drawings-with-a-09b-vision-language-model</guid><category><![CDATA[Drawings]]></category><category><![CDATA[pdf]]></category><category><![CDATA[OCR ]]></category><category><![CDATA[paddlepaddle]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 07 Jan 2026 05:29:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767764396726/0f4d1dc4-5746-4201-9a13-10acb3eac70f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Late last year, I started exploring how to extract metadata from product drawings. Part numbers, material specifications, revision history, manufacturing process notes. The kind of information that lives in title blocks and needs to end up in a PLM database. I tried various OCR techniques - with the tolerance call outs, dimensions, it was a mess and I stretched the limit of what can be done with regular expressions. Then I found <a target="_blank" href="https://ernie.baidu.com/blog/posts/paddleocr-vl/">PaddleOCR-VL</a>. It is Vision Language Model (VLM) with a few preprocessors finetuned for OCR tasks.</p>
<p>PaddleOCR-VL-0.9B integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model. It uses a two-stage approach. First, PP-DocLayoutV2 performs layout analysis, localizing semantic regions and predicting reading order. Then PaddleOCR-VL-0.9B recognizes the content. A post-processing module outputs structured Markdown and JSON. On OmniDocBench v1.5, it achieves an overall score of 92.56, surpassing MinerU2.5-1.2B (90.67) and general VLMs like Qwen2.5-VL-72B. A model 80 times smaller achieving higher accuracy.</p>
<p>For my use case, I used a two-stage pipeline:</p>
<pre><code class="lang-plaintext">PDF → Images → PaddleOCR-VL (OCR) → Qwen3-0.6B (Extraction) → Structured JSON
</code></pre>
<p>The input is the entire drawing in pdf.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767763315772/2f1d9aba-64f3-4844-a9cc-1c442bb12dec.png" alt class="image--center mx-auto" /></p>
<p>PaddleOCR-VL handles the OCR. Then I pass the extracted text to Qwen3-0.6B, a 600M parameter LLM, for structured information extraction. No complex regex patterns. The LLM figures out which text corresponds to which field.</p>
<p>The output looks like this:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"part_number"</span>: <span class="hljs-string">"3814200"</span>,
  <span class="hljs-attr">"drawing_number"</span>: <span class="hljs-string">"4095700.M00.027PI1/2"</span>,
  <span class="hljs-attr">"material"</span>: <span class="hljs-string">"Polycarbonate (makrolon cristal ref:2458)"</span>,
  <span class="hljs-attr">"finish"</span>: <span class="hljs-string">"poli"</span>,
  <span class="hljs-attr">"description"</span>: <span class="hljs-string">"CAPOT INTERRUPTEUR / SWITCH COVER"</span>,
  <span class="hljs-attr">"product"</span>: <span class="hljs-string">"LEGENDAIR XL2 US"</span>
}
</code></pre>
<p>The whole thing runs on a laptop with 16GB RAM. GPU helps but is not required. Even with mutliple waves of digital transformatio, product manufactures accumulated vast archives of engineering drawings that containe the recipe, part numbers, material specifications, supplier references, revision histories.Cloud-based OCR means documents leave your network, they might be logged or used for training - which could lead to IP Leaks.</p>
<p><strong><em>VLMs for OCR is promising, 0.9B parameter model changes this. It runs locally on a computer without network access and the documents never leave your infrastructure. The Apache 2.0 license allows commercial use for free.</em></strong></p>
<p>I have shared my extraction pipeline on GitHub: <a target="_blank" href="https://github.com/thedatasense/PaddleOCR_Engineering_Drawings">PaddleOCR_Engineering_Drawings</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Building AI Agents with Multimodal Models: Part 2]]></title><description><![CDATA[Contrastive Learning: Teaching AI That a Picture is Worth a Thousand Words
This is Part 2 of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.

The Big Question: How Do You Connect Pictures an...]]></description><link>https://thedatasense.com/building-ai-agents-with-multimodal-models-part-2</link><guid isPermaLink="true">https://thedatasense.com/building-ai-agents-with-multimodal-models-part-2</guid><category><![CDATA[Multimodal AI]]></category><category><![CDATA[NVIDIA]]></category><category><![CDATA[contrastive learning]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 07 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/5Q07sS54D0Q/upload/6f2d8d4b9c8674cf6a049c008e040e89.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-contrastive-learning-teaching-ai-that-a-picture-is-worth-a-thousand-words">Contrastive Learning: Teaching AI That a Picture is Worth a Thousand Words</h2>
<p><em>This is Part 2 of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.</em></p>
<hr />
<h3 id="heading-the-big-question-how-do-you-connect-pictures-and-words">The Big Question: How Do You Connect Pictures and Words?</h3>
<p>Here's a puzzle: You have a photo of a golden retriever playing fetch, and you have the text "a happy dog catching a frisbee." To you, these obviously go together. But to a computer, an image is just a grid of numbers, and text is a sequence of characters. They're completely different data types.</p>
<p>How do we teach AI that these two things represent the same concept?</p>
<p>The answer is <strong>Contrastive Learning</strong>, and it's the secret sauce behind revolutionary models like OpenAI's CLIP and forms the foundation of modern image search, text-to-image generation, and visual question answering.</p>
<hr />
<h3 id="heading-the-embedding-space-a-universe-where-ideas-live">The Embedding Space: A Universe Where Ideas Live</h3>
<p>Before we dive into contrastive learning, we need to understand <strong>embeddings</strong>.</p>
<p><strong>The Analogy</strong>: Imagine a massive library where every book has a specific location. Similar books are shelved near each other. Mystery novels are in one section, cooking books in another, and within cooking, Italian cuisine is close to French cuisine.</p>
<p>An <strong>embedding</strong> is like giving every piece of data (an image, a sentence, a sound) coordinates in this library. The magic is that similar concepts get similar coordinates, regardless of their original format.</p>
<p>So when we "embed" an image of a dog and the word "dog," if done correctly, both should end up in the same neighborhood of this mathematical space.</p>
<pre><code>Image <span class="hljs-keyword">of</span> dog  ──&gt; [Image Encoder] ──&gt; [<span class="hljs-number">0.8</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>, ...] ──┐
                                                              ├──&gt; Close together!
Text <span class="hljs-string">"a dog"</span>  ──&gt; [Text Encoder]  ──&gt; [<span class="hljs-number">0.79</span>, <span class="hljs-number">0.21</span>, <span class="hljs-number">0.48</span>, ...] ┘
</code></pre><hr />
<h3 id="heading-contrastive-learning-learning-by-comparison">Contrastive Learning: Learning by Comparison</h3>
<p><strong>The Analogy</strong>: Imagine you're teaching a child to identify animals using flashcards. You show them two cards and ask: "Are these the same animal?"</p>
<ul>
<li>Show a dog photo and say "dog" → "Yes, same!"</li>
<li>Show a dog photo and say "cat" → "No, different!"</li>
</ul>
<p>Through thousands of these comparisons, the child learns what "dog" means without you ever explicitly defining it.</p>
<p>Contrastive learning works the same way. You don't tell the model what a dog is. Instead, you show it:</p>
<ul>
<li><strong>Positive pairs</strong>: Image of dog + text "dog" (these should be similar)</li>
<li><strong>Negative pairs</strong>: Image of dog + text "cat" (these should be different)</li>
</ul>
<p>The model learns to push positive pairs together and pull negative pairs apart in the embedding space.</p>
<hr />
<h3 id="heading-the-math-behind-the-magic-cosine-similarity">The Math Behind the Magic: Cosine Similarity</h3>
<p>How do we measure if two embeddings are "close"?</p>
<p><strong>The Analogy</strong>: Imagine two arrows pointing from the center of a room. If they point in the same direction, they're similar. If they point in opposite directions, they're different. The angle between them tells you how similar they are.</p>
<p><strong>Cosine Similarity</strong> measures exactly this. It calculates the angle between two vectors (embeddings):</p>
<ul>
<li><strong>Score of 1.0</strong>: Pointing in the exact same direction (identical meaning)</li>
<li><strong>Score of 0.0</strong>: Perpendicular (unrelated)</li>
<li><strong>Score of -1.0</strong>: Opposite directions (opposite meaning)</li>
</ul>
<p>The formula normalizes vectors to unit length (arrows of length 1) so we only care about direction, not magnitude.</p>
<pre><code>Similarity = (A · B) / (|A| × |B|)

<span class="hljs-attr">Where</span>:
- A · B is the dot product
- |A| and |B| are the magnitudes
</code></pre><hr />
<h3 id="heading-building-a-clip-style-model-step-by-step">Building a CLIP-Style Model: Step by Step</h3>
<p>Let's walk through how this works in practice, using a simplified example from NVIDIA's training.</p>
<h4 id="heading-step-1-create-two-encoder-networks">Step 1: Create Two Encoder Networks</h4>
<p>You need one encoder for each modality:</p>
<pre><code>Image Encoder: Takes images → Produces image embeddings
Text Encoder:  Takes text   → Produces text embeddings
</code></pre><p>These can be any architecture (CNNs for images, Transformers for text). The key is that both produce vectors of the same size.</p>
<h4 id="heading-step-2-normalize-the-embeddings">Step 2: Normalize the Embeddings</h4>
<p>Before comparing, we normalize all embeddings to unit vectors. This ensures we're comparing direction only.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Normalize to unit vectors</span>
image_embedding = F.normalize(image_embedding, dim=<span class="hljs-number">1</span>)
text_embedding = F.normalize(text_embedding, dim=<span class="hljs-number">1</span>)
</code></pre>
<h4 id="heading-step-3-calculate-the-similarity-matrix">Step 3: Calculate the Similarity Matrix</h4>
<p>For a batch of N image-text pairs:</p>
<ul>
<li>Row i contains similarities between image i and all N texts</li>
<li>Column j contains similarities between text j and all N images</li>
<li>The diagonal should be high (matching pairs)</li>
<li>Off-diagonal should be low (non-matching pairs)</li>
</ul>
<pre><code>              Text1   Text2   Text3   Text4
Image1      [ <span class="hljs-number">0.95</span>   <span class="hljs-number">0.10</span>    <span class="hljs-number">0.05</span>    <span class="hljs-number">0.12</span> ]  ← Image1 matches Text1
Image2      [ <span class="hljs-number">0.08</span>   <span class="hljs-number">0.92</span>    <span class="hljs-number">0.15</span>    <span class="hljs-number">0.20</span> ]  ← Image2 matches Text2
Image3      [ <span class="hljs-number">0.12</span>   <span class="hljs-number">0.18</span>    <span class="hljs-number">0.89</span>    <span class="hljs-number">0.10</span> ]  ← Image3 matches Text3
Image4      [ <span class="hljs-number">0.05</span>   <span class="hljs-number">0.22</span>    <span class="hljs-number">0.08</span>    <span class="hljs-number">0.91</span> ]  ← Image4 matches Text4
</code></pre><h4 id="heading-step-4-apply-cross-entropy-loss">Step 4: Apply Cross-Entropy Loss</h4>
<p>We treat this as a classification problem. For each image, the correct text is its "class." We use cross-entropy loss to:</p>
<ul>
<li>Maximize diagonal values (correct pairs)</li>
<li>Minimize off-diagonal values (wrong pairs)</li>
</ul>
<p>The loss is computed in both directions:</p>
<ol>
<li>Given image, predict correct text</li>
<li>Given text, predict correct image</li>
</ol>
<p>Final loss = Average of both directions</p>
<hr />
<h3 id="heading-a-practical-example-fashion-item-search">A Practical Example: Fashion Item Search</h3>
<p>NVIDIA's training demonstrates this with the FashionMNIST dataset. The twist? Instead of pairing images with text, they pair original images with their edge-detected outlines (using Sobel filters).</p>
<p><strong>The Use Case</strong>: Build a system where you can sketch a rough outline of clothing, and the system finds matching products.</p>
<p><strong>How It Works</strong>:</p>
<ol>
<li>Take images of t-shirts, pants, shoes, etc.</li>
<li>Extract edge outlines using Sobel filters (simulating hand-drawn sketches)</li>
<li>Train contrastively: Original image ↔ Outline should be close</li>
<li>At inference: User draws a sketch → System finds images with similar embeddings</li>
</ol>
<p>This is the foundation of <strong>visual search systems</strong> used by e-commerce platforms.</p>
<hr />
<h3 id="heading-cross-modal-projection-the-bridge-between-worlds">Cross-Modal Projection: The Bridge Between Worlds</h3>
<p>Contrastive learning creates aligned embeddings, but sometimes you need to go further. What if you have a model trained on LiDAR data, and you want to use RGB images instead?</p>
<p><strong>The Analogy</strong>: Imagine you have a expert translator who only speaks Japanese. You speak English. Instead of training a new expert, you hire an interpreter (a projector) who converts your English into Japanese.</p>
<p><strong>Cross-Modal Projection</strong> trains a simple neural network to convert embeddings from one modality space to another:</p>
<pre><code>RGB Embedding ──&gt; [Projector Network] ──&gt; LiDAR Embedding Space
</code></pre><p>The projector is typically just a few linear layers, trained using Mean Squared Error (MSE) loss to match the target embeddings.</p>
<h4 id="heading-why-this-matters">Why This Matters</h4>
<ol>
<li><p><strong>Reuse Expensive Models</strong>: LiDAR models are expensive to train. Projection lets you reuse them with cheaper RGB data.</p>
</li>
<li><p><strong>Missing Modality at Inference</strong>: Your training data has both RGB and depth, but your deployment camera only captures RGB. Project to fill the gap.</p>
</li>
<li><p><strong>Transfer Learning</strong>: Project from a modality where you have lots of data to one where you have less.</p>
</li>
</ol>
<hr />
<h3 id="heading-two-stage-training-strategy">Two-Stage Training Strategy</h3>
<p>For complex multimodal systems, NVIDIA recommends a two-stage approach:</p>
<p><strong>Stage 1: Alignment</strong>
Train the projector to align embeddings using frozen pre-trained encoders.</p>
<ul>
<li>Freeze: Both encoders</li>
<li>Train: Projector only</li>
<li>Loss: MSE between projected and target embeddings</li>
</ul>
<p><strong>Stage 2: Fine-tuning</strong>
Optionally unfreeze everything and fine-tune end-to-end for your specific task.</p>
<ul>
<li>Unfreeze: Everything (or selectively)</li>
<li>Train: Whole pipeline</li>
<li>Loss: Task-specific (classification, regression, etc.)</li>
</ul>
<p>This staged approach prevents catastrophic forgetting and ensures stable training.</p>
<hr />
<h3 id="heading-key-takeaways">Key Takeaways</h3>
<ol>
<li><p><strong>Embeddings are coordinates in meaning-space</strong>: Similar concepts cluster together regardless of original data type</p>
</li>
<li><p><strong>Contrastive learning teaches through comparison</strong>: Push matching pairs together, pull non-matching pairs apart</p>
</li>
<li><p><strong>Cosine similarity measures directional alignment</strong>: Normalized dot product tells you how "same direction" two vectors point</p>
</li>
<li><p><strong>Cross-modal projection bridges modality gaps</strong>: A simple network can translate between embedding spaces</p>
</li>
<li><p><strong>Two-stage training is more stable</strong>: First align embeddings, then fine-tune for your task</p>
</li>
</ol>
<hr />
<h3 id="heading-real-world-applications">Real-World Applications</h3>
<ul>
<li><strong>Image Search</strong>: Type "sunset over mountains" → Find matching photos (CLIP)</li>
<li><strong>Product Discovery</strong>: Upload a photo → Find similar products (Pinterest, Amazon)</li>
<li><strong>Content Moderation</strong>: Align images with violation categories for detection</li>
<li><strong>Accessibility</strong>: Connect images to audio descriptions for visually impaired users</li>
<li><strong>Robotics</strong>: Align camera views with depth sensors for navigation</li>
</ul>
<hr />
<h3 id="heading-whats-next">What's Next?</h3>
<p>In Part 3, we'll explore how to extract and process multimodal data from documents using <strong>OCR and RAG pipelines</strong>. You'll learn how AI can read PDFs, extract tables and images, and build searchable knowledge bases from unstructured documents.</p>
<hr />
<p><em>This content is inspired by NVIDIA's Deep Learning Institute course: <a target="_blank" href="https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+C-FX-17+V1">Building AI Agents with Multimodal Models</a>. For hands-on experience, consider enrolling in their official courses.</em></p>
]]></content:encoded></item><item><title><![CDATA[Building AI Agents with Multimodal Models : Part 1]]></title><description><![CDATA[Understanding How AI Learns to See, Hear, and Feel All at Once
Why Do We Need Multimodal AI?
Imagine you're trying to identify a fruit in complete darkness. You can feel its round shape, its smooth skin, and smell its citrusy aroma. Now imagine you c...]]></description><link>https://thedatasense.com/building-ai-agents-with-multimodal-models-part-1</link><guid isPermaLink="true">https://thedatasense.com/building-ai-agents-with-multimodal-models-part-1</guid><category><![CDATA[#multimodalai]]></category><category><![CDATA[NVIDIA]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Mon, 05 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768192457344/6f0d1749-9534-4f1b-b413-0e9e561ad856.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-understanding-how-ai-learns-to-see-hear-and-feel-all-at-once">Understanding How AI Learns to See, Hear, and Feel All at Once</h2>
<h3 id="heading-why-do-we-need-multimodal-ai">Why Do We Need Multimodal AI?</h3>
<p>Imagine you're trying to identify a fruit in complete darkness. You can feel its round shape, its smooth skin, and smell its citrusy aroma. Now imagine you can only see a photo of it but can't touch or smell it. In either case alone, you might confuse an orange with a tangerine. But combine all your senses together, and suddenly the identification becomes much easier.</p>
<p>This is exactly the challenge AI faces. Traditional AI models are like humans with only one sense. A camera sees colors but doesn't understand depth. A LiDAR sensor measures precise distances but sees the world in points, not colors. Neither alone tells the complete story.</p>
<h2 id="heading-multimodal-ai-is-about-teaching-machines-to-combine-multiple-senses-to-understand-the-world-more-completely"><strong>Multimodal AI is about teaching machines to combine multiple "senses" to understand the world more completely.</strong></h2>
<h3 id="heading-the-core-problem-different-data-types-dont-speak-the-same-language">The Core Problem: Different Data Types Don't Speak the Same Language</h3>
<p>Here's where it gets interesting. When you combine senses, your brain does it effortlessly. But for computers, mixing an image (a grid of pixels) with depth data (a cloud of 3D points) is like trying to add apples and equations together. They're fundamentally different.</p>
<p>Think of it like this:</p>
<ul>
<li><strong>RGB Image Data</strong>: A painting on a flat canvas with colors</li>
<li><strong>LiDAR Point Cloud</strong>: A 3D sculpture made of tiny dots</li>
<li><strong>Text</strong>: A story written in words</li>
<li><strong>Audio</strong>: Vibrations over time</li>
</ul>
<h2 id="heading-the-magic-of-multimodal-ai-lies-in-finding-smart-ways-to-combine-these-completely-different-data-formats">The magic of multimodal AI lies in finding smart ways to combine these completely different data formats.</h2>
<h3 id="heading-the-three-fusion-strategies-when-to-combine-your-ingredients">The Three Fusion Strategies: When to Combine Your Ingredients</h3>
<p>Just like cooking, the order in which you combine ingredients matters. NVIDIA's training introduces three fundamental approaches to fusion, each with its own strengths.</p>
<h4 id="heading-1-early-fusion-mix-everything-at-the-start">1. Early Fusion: Mix Everything at the Start</h4>
<p><strong>The Analogy</strong>: Making a smoothie. You throw all your fruits into the blender right at the beginning and blend them together.</p>
<p><strong>How It Works</strong>: Concatenate (stack) all your input data together before feeding it into a single neural network. If your image has 3 color channels (RGB) and your depth map has 1 channel, you create a 4-channel input.</p>
<p><strong>When to Use It</strong>:</p>
<ul>
<li>When your modalities capture complementary low-level features</li>
<li>When the raw data naturally aligns (same resolution, same timestamps)</li>
<li>When you want a simpler, more efficient architecture</li>
</ul>
<p><strong>The Trade-off</strong>: You're betting that the network can figure out how to use both data types from the very beginning. Sometimes this works beautifully. Other times, the model gets confused trying to learn two things at once.</p>
<pre><code>Input A ─┐
         ├──&gt; [Concatenate] ──&gt; [Single Neural Network] ──&gt; Output
Input B ─┘
</code></pre><h4 id="heading-2-late-fusion-let-experts-work-separately-then-vote">2. Late Fusion: Let Experts Work Separately, Then Vote</h4>
<p><strong>The Analogy</strong>: A panel of specialist doctors. The eye doctor examines vision, the hearing specialist checks audio, and at the end they meet to discuss and reach a combined diagnosis.</p>
<p><strong>How It Works</strong>: Train separate neural networks for each modality. Each network becomes an expert at its own data type. At the very end, combine their predictions (by averaging, voting, or concatenating).</p>
<p><strong>When to Use It</strong>:</p>
<ul>
<li>When each modality has unique patterns that require specialized learning</li>
<li>When you want modality-specific interpretability</li>
<li>When you have pre-trained models for individual modalities</li>
</ul>
<p><strong>The Trade-off</strong>: You need more parameters (two full networks instead of one). But each network can fully focus on mastering its own domain without interference.</p>
<pre><code>Input A ──&gt; [Network A] ──&gt; Prediction A ─┐
                                          ├──&gt; [Combine] ──&gt; Final Output
Input B ──&gt; [Network B] ──&gt; Prediction B ─┘
</code></pre><h4 id="heading-3-intermediate-fusion-meet-in-the-middle">3. Intermediate Fusion: Meet in the Middle</h4>
<p><strong>The Analogy</strong>: Jazz musicians improvising together. Each plays their own instrument, but at key moments they sync up, listen to each other, and let one musician's riff influence another's response.</p>
<p><strong>How It Works</strong>: Each modality has its own pathway that extracts features. At intermediate layers (not the beginning, not the end), these pathways exchange information. This exchange can happen through:</p>
<ul>
<li><strong>Concatenation</strong>: Stacking feature maps together at a middle layer</li>
<li><strong>Matrix Multiplication</strong>: Having features from one modality modulate or gate the other</li>
</ul>
<p><strong>When to Use It</strong>:</p>
<ul>
<li>When you want the best of both worlds</li>
<li>When modalities need some individual processing before they can meaningfully interact</li>
<li>When you need rich cross-modal interactions</li>
</ul>
<p><strong>The Trade-off</strong>: More complex to design. You need to decide where and how fusion happens.</p>
<pre><code>Input A ──&gt; [Early Layers A] ──┐
                               ├──&gt; [Fusion Layer] ──&gt; [Later Layers] ──&gt; Output
Input B ──&gt; [Early Layers B] ──┘
</code></pre><hr />
<h3 id="heading-a-practical-example-colored-cubes-with-rgb-and-lidar">A Practical Example: Colored Cubes with RGB and LiDAR</h3>
<p>NVIDIA's training uses a brilliant example to demonstrate these concepts. Imagine a scene with three cubes: one red, one green, and one blue. Your task is to classify which cube is which.</p>
<p><strong>Challenge 1: RGB Camera Only</strong>
The camera sees colors perfectly. Red cube? Check. Green cube? Check. But wait, where exactly are they in 3D space? The camera flattens everything to 2D. If the cubes overlap visually, things get confusing.</p>
<p><strong>Challenge 2: LiDAR Only</strong>
The LiDAR sensor knows exact 3D positions. It can tell you precisely where each cube sits in space. But all cubes look the same because LiDAR doesn't see color.</p>
<p><strong>The Solution: Combine Both</strong>
With multimodal fusion, the model gets the best of both worlds. LiDAR provides spatial precision while RGB provides color identification. Together, they solve what neither could alone.</p>
<p>This is multimodal AI in action: combining complementary strengths to overcome individual weaknesses.</p>
<hr />
<h3 id="heading-key-takeaways">Key Takeaways</h3>
<ol>
<li><p><strong>Multimodal AI combines different data types</strong> (images, text, audio, depth) to create more robust understanding</p>
</li>
<li><p><strong>Fusion timing matters</strong>:</p>
<ul>
<li>Early fusion is simple but requires data compatibility</li>
<li>Late fusion allows specialization but needs more resources</li>
<li>Intermediate fusion offers flexibility but adds complexity</li>
</ul>
</li>
<li><p><strong>Choose your strategy based on your data</strong>: If modalities complement each other at a low level, go early. If they need expertise first, go late. If you need both, go intermediate.</p>
</li>
<li><p><strong>The goal is complementary strengths</strong>: Each modality should bring something unique to the table</p>
</li>
</ol>
<hr />
<h3 id="heading-whats-next">What's Next?</h3>
<p>In Part 2, we'll explore how AI learns to connect completely different modalities through a technique called <strong>Contrastive Learning</strong>. Imagine teaching a computer that a photo of a dog and the word "dog" should live close together in the AI's understanding. This is the foundation of models like CLIP that power modern image search and generation.</p>
<hr />
<p><em>This content is inspired by NVIDIA's Deep Learning Institute course: <a target="_blank" href="https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+C-FX-17+V1">Building AI Agents with Multimodal Models</a>. For hands-on experience, consider enrolling in their official courses.</em></p>
]]></content:encoded></item><item><title><![CDATA[Why a 0.9B VLM can be a serious OCR engine]]></title><description><![CDATA[In this post, I will discuss PaddleOCR-VL, focusing on what is important for OCR and document parsing: stable layout, high-resolution text capture, low error rates, and fast deployment.The paper’s main claim is simple but important: you can get state...]]></description><link>https://thedatasense.com/why-a-09b-vlm-can-be-a-serious-ocr-engine</link><guid isPermaLink="true">https://thedatasense.com/why-a-09b-vlm-can-be-a-serious-ocr-engine</guid><category><![CDATA[#paddleocr #ocr-vlm]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Sat, 03 Jan 2026 05:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/BvqmW7VGRRk/upload/becd1fd58c147e8d263fae168abbb227.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this post, I will discuss <a target="_blank" href="https://arxiv.org/pdf/2510.14528"><strong>PaddleOCR-VL</strong></a>, focusing on what is important for OCR and document parsing: stable layout, high-resolution text capture, low error rates, and fast deployment.The paper’s main claim is simple but important: you can get state of the art document parsing with an ultra compact vision language model, if you design the system around the real constraints of OCR.</p>
<p>For a classic OCR, the stack consists on mechanisms to detect regions with text, recognize the text using algorithms like <a target="_blank" href="https://sid2697.github.io/Blog_Sid/algorithm/2019/10/19/CTC-Loss.html">CTC</a>, with bolt on rules for tables, figures and so on.A VLM changes the contract. Instead of predicting characters from a section, you probe “Given pixels, generate a sequence that encodes the content I want”. I was quick to build a parallel with image captioning tasks but the more we dive in, a captioning task can miss a few axis on the tick labels and still say chart with sales rising and be accurate. But for OCR, the expectation is lossless meaning if you miss a character in a number it can break retrieval, matching and QA.</p>
<h3 id="heading-the-core-architectural-idea-decouple-layout-from-recognition">The core architectural idea - decouple layout from recognition</h3>
<p>The system has three stages</p>
<ol>
<li><p><strong>PP-DocLayoutV2</strong> for layout detection and reading order</p>
</li>
<li><p><strong>PaddleOCR-VL-0.9B</strong> for element level recognition</p>
</li>
<li><p>A lightweight post step builds Markdown and JSON outputs</p>
</li>
</ol>
<p>The paper’s position is: do not ask the VLM to solve layout implicitly through generation. Make layout explicit with a fast detector plus ordering network, then let the VLM do what it is best at: recognition.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767842813987/d856611f-9dba-49a9-9903-a15bc9eabbec.png" alt class="image--center mx-auto" /></p>
<p>Now if you are interested lets dig deep in to each of those stage.</p>
<h2 id="heading-stage-1-pp-doclayoutv2">Stage 1: PP-DocLayoutV2</h2>
<p>PP-DocLayoutV2 combines <strong>RT-DETR</strong> for detecting and classifying layout elements and a lightweight <strong>pointer network</strong> with 6 transformer layers for <strong>reading order prediction</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767844616814/a205df0c-1976-473f-a4f5-fda56361af47.png" alt class="image--center mx-auto" /></p>
<p>The ordering part has details that matter:</p>
<ul>
<li><p>it embeds proposals with absolute 2D positional encodings and class label embeddings</p>
</li>
<li><p>it adds a geometric bias mechanism inspired by Relation-DETR to model pairwise geometry</p>
</li>
<li><p>it predicts an N by N pairwise ordering matrix</p>
</li>
<li><p>it recovers a consistent reading order with a deterministic “win accumulation” decoding algorithm</p>
</li>
</ul>
<p>This is the backbone of the system. If reading order is wrong, our OCR can be perfect and our parsed document is still unusable.</p>
<h2 id="heading-stage-2-paddleocr-vl-09b">Stage 2: PaddleOCR-VL-0.9B</h2>
<p>PaddleOCR-VL-0.9B follows a LLaVA inspired structure: vision encoder, projector, language model.Instead of fixed resolution resizing or tiling, the paper uses <strong>native dynamic high resolution preprocessing</strong> and a <strong>NaViT style encoder</strong> initialized from Keye-VL, designed to support native resolution inputs without distortion.The authors claim this yields fewer hallucinations and stronger performance on text heavy tasks.This is a big deal for dense documents and drawings, where tiny glyph details decide correctness.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767983051214/5e2a570b-a840-42a0-869c-fe8378abb484.png" alt class="image--center mx-auto" /></p>
<p>The projector is a randomly initialized 2 layer MLP with GELU, using a merge size of 2 to bridge vision features into the language embedding space efficiently.In plain terms: reduce the token burden before the decoder pays attention to everything.Autoregressive decoding cost is tied to decoder size. The paper explicitly chooses <strong>ERNIE-4.5-0.3B</strong> for inference efficiency and adds <strong>3D-RoPE</strong> for positional representation.The element recognizer is also built via post adaptation using pretrained weights: Keye-VL for the vision side and ERNIE-4.5-0.3B for the language side.</p>
<h2 id="heading-stage-3-post-processing">Stage 3: Post processing</h2>
<p>After layout and element recognition, PaddleOCR-VL runs a <strong>lightweight post processing module</strong> that aggregates outputs from both stages and formats the final result into <strong>structured Markdown and JSON</strong>.</p>
<p>This is where the system becomes a document parser instead of a bag of OCR strings.What this stage effectively does, based on the paper’s description, is:</p>
<ul>
<li><p>follow the reading order predicted by PP-DocLayoutV2</p>
</li>
<li><p>place each recognized element back into a page level representation</p>
</li>
<li><p>serialize the page into Markdown for human readable output</p>
</li>
<li><p>serialize the same content into JSON for programmatic use</p>
</li>
</ul>
<p>One way to think about it is s that Stage 2 gives you “<strong>content</strong>”, Stage 3 gives you “a <strong>document</strong>”.If you care about RAG, this stage is not optional. The paper describes document parsing as a foundation for retrieval and downstream LLM use, especially when combined with RAG systems.</p>
<p>Training Approach</p>
<p>The VLM training is two stage:</p>
<ul>
<li><p>Stage 1 alignment on <strong>29M</strong> image text pairs</p>
</li>
<li><p>Stage 2 instruction fine tuning on <strong>2.7M</strong> samples</p>
</li>
</ul>
<p>The paper also describes a large scale data construction pipeline: over 30M samples collected via public acquisition and synthesis, refined using prompt driven labeling with larger models, plus cleaning to remove low quality or hallucinated annotations.</p>
<p>Inference</p>
<p>PaddleOCR-VL is also designed to run fast end to end. The paper describes multi threading asynchronous execution split into three parallel stages:</p>
<ul>
<li><p>data loading</p>
</li>
<li><p>layout model processing</p>
</li>
<li><p>VLM inference</p>
</li>
</ul>
<p>Data flows through queues. VLM batching triggers when the queue hits a threshold or when items have waited too long, so blocks across different pages can be aggregated for better parallelism.On their end to end benchmark, they report that with FastDeploy the system achieves <strong>53.1 percent higher page throughput</strong> and <strong>50.9 percent higher token throughput</strong> than MinerU2.5. In my experience, I got a throughput of 45s per a page of engineering drawing.</p>
<p>I used PaddleOCR-VLM for extracting key manufacturing information from engineering drawings, in my analysis the advantages of this model are that</p>
<ul>
<li><p>Stage 1 can isolate title blocks, revision tables, notes, and callouts so Stage 2 never has to guess what region matters</p>
</li>
<li><p>Stage 2 can run at high resolution on tight crops, which is exactly what tiny labels need</p>
</li>
<li><p>Stage 3 can output clean Markdown for inspection and JSON for downstream matching</p>
</li>
</ul>
<p>If you want to learn about using this model for Engineering drawings, review my blog here - <a target="_blank" href="https://hashnode.com/edit/cmk3kwptz000a02kz2fezaphi">OCR on Engineering Drawings with a 0.9B Vision-Language Model</a></p>
]]></content:encoded></item><item><title><![CDATA[Multimodal LLMs in Healthcare: What's Actually Working]]></title><description><![CDATA[If you've tried asking ChatGPT to interpret a chest X-ray, you know the answer: it can't. Not because the technology doesn't exist, but because most general-purpose models weren't built for medical imaging.
That's changing fast. A new generation of v...]]></description><link>https://thedatasense.com/multimodal-llms-in-healthcare-whats-actually-working</link><guid isPermaLink="true">https://thedatasense.com/multimodal-llms-in-healthcare-whats-actually-working</guid><category><![CDATA[#multimodalai]]></category><category><![CDATA[healthcare]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Fri, 02 Jan 2026 01:33:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/_9vLJxxHrBo/upload/af6af3544c2446b4478251907689f1ae.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you've tried asking ChatGPT to interpret a chest X-ray, you know the answer: it can't. Not because the technology doesn't exist, but because most general-purpose models weren't built for medical imaging.</p>
<p>That's changing fast. A new generation of vision-language models can now look at scans, read clinical notes, and answer questions about both. Some of these models are already matching specialist performance on diagnostic benchmarks.</p>
<p>Here's what's actually working, what's still experimental, and what it takes to deploy these systems safely.</p>
<h2 id="heading-why-healthcare-needs-multimodal-ai">Why Healthcare Needs Multimodal AI</h2>
<p>Healthcare data doesn't fit neatly into text or images alone. A single patient encounter might include X-rays, MRI scans, lab results, vital signs over time, and pages of clinical notes. Traditionally, each data type required its own specialized model.</p>
<p>Multimodal models change this. A single architecture can detect subtle abnormalities in a scan, summarize a 20-page discharge summary, spot concerning trends in vital signs, and explain its reasoning in plain language. The potential is obvious: faster diagnoses, fewer missed findings, less cognitive load on clinicians.</p>
<p>But potential and reality are different things. Let's look at the models that are actually delivering results.</p>
<h2 id="heading-vision-language-models-that-work-on-medical-images">Vision-Language Models That Work on Medical Images</h2>
<p>These models take an image and a question, then return a text answer. The architecture typically combines a vision encoder (to "see" the image) with a language model (to understand questions and generate responses).</p>
<h3 id="heading-llava-med-15">LLaVA-Med 1.5</h3>
<p>LLaVA-Med pairs a CLIP vision encoder with Vicuna, a 13B parameter language model. The team trained it on 200,000 image-text pairs from PubMed Central, supplemented with synthetic instructions generated by GPT-4.</p>
<p>The results are solid. On radiology and pathology question-answering benchmarks, it matches or beats prior approaches. The architecture is straightforward: the vision encoder extracts image features, an MLP projects them into the language model's embedding space, and the language model handles the rest.</p>
<h3 id="heading-visual-med-alpaca">Visual Med-Alpaca</h3>
<p>This one takes a different approach. Instead of a single end-to-end model, Visual Med-Alpaca uses a routing system. A classifier first determines what type of input it's dealing with, then dispatches to specialized experts (Med-GIT for general medical images, DePlot for charts and graphs). The outputs feed into a LLaMA-7B core with LoRA adapters.</p>
<p>Training data came from 54,000 Q&amp;A pairs drawn from BigBIO and ROCO radiology datasets. The team used GPT-3.5 to generate additional prompts, then filtered them with human review.</p>
<p>One caveat: this is strictly research-use only, with no FDA approval.</p>
<h3 id="heading-chexagent">CheXagent</h3>
<p>CheXagent focuses specifically on chest X-rays. The image encoder (SigLIP-Large) processes 512×512 pixel images through 24 transformer layers. A projection MLP maps those features into a Phi-2.7B decoder trained on medical and scientific text.</p>
<p>The training corpus is impressive: over one million chest X-ray and report pairs, plus 2.7 billion tokens from clinical notes and research articles. The intended use cases include drafting radiology reports, flagging abnormalities, and explaining findings to patients.</p>
<h3 id="heading-medgemma-4b-it">MedGemma-4B-IT</h3>
<p>Google's entry into this space launched in July 2025. It's a decoder-only transformer with 4B parameters, built on the Gemma 3 base. The SigLIP image encoder was pre-trained on de-identified data spanning chest X-rays, dermatology, ophthalmology, and histopathology.</p>
<p>The context window is generous: 128K tokens of text plus images (each 896×896 image converts to 256 tokens). Here's how it compares to the base Gemma model:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Task</td><td>Base Gemma 3 4B</td><td>MedGemma 4B-IT</td></tr>
</thead>
<tbody>
<tr>
<td>MIMIC-CXR macro F1 (top 5)</td><td>81.2</td><td>88.9</td></tr>
<tr>
<td>CheXpert macro F1 (top 5)</td><td>32.6</td><td>48.1</td></tr>
<tr>
<td>CXR14 macro F1 (3 conditions)</td><td>32.0</td><td>50.1</td></tr>
<tr>
<td>SLAKE VQA token F1</td><td>40.2</td><td>72.3</td></tr>
<tr>
<td>PathMCQA histopathology accuracy</td><td>37.1</td><td>69.8</td></tr>
<tr>
<td>EyePACS fundus accuracy</td><td>14.4</td><td>64.9</td></tr>
</tbody>
</table>
</div><p>The improvements are substantial across the board. MedGemma is available on Hugging Face under the Health AI Developer Foundations license, with fine-tuning notebooks on GitHub.</p>
<h2 id="heading-language-models-for-clinical-text">Language Models for Clinical Text</h2>
<p>Not everything in healthcare is an image. Electronic health records contain millions of words: admission notes, progress updates, discharge summaries, lab interpretations. Models trained specifically on this text outperform general-purpose LLMs.</p>
<h3 id="heading-gatortron">GatorTron</h3>
<p>GatorTron comes in sizes from 110M to 8.9B parameters, all trained on 82 billion words of de-identified clinical text. The researchers tested it on concept extraction, relation extraction, clinical inference, and question answering. The finding won't surprise anyone who's followed scaling laws: bigger models and more data improve everything.</p>
<h3 id="heading-few-shot-health-learners">Few-Shot Health Learners</h3>
<p>This work from Google explores whether large language models can handle time-series health data with minimal examples. Starting from PaLM-24B (pre-trained on 780B tokens), the team fine-tuned on ECG waveforms and vital signs using just a handful of examples per task.</p>
<p>The results suggest that LLMs can ground numeric health data surprisingly well. Applications include arrhythmia detection, activity recognition, and estimating calorie expenditure or stress levels from sensor data.</p>
<h2 id="heading-how-do-you-test-these-models">How Do You Test These Models?</h2>
<p>Benchmarks matter. A model that aces one dataset might fail completely in a real clinical setting. Here are the validation sets researchers are using:</p>
<p><strong>NEJM Clinicopathologic Cases</strong> contains 143 diagnostic puzzles from 2021 to 2024, scored on a Bond Scale (0-5) and Likert scale (0-2). These are the kind of cases that stump experienced clinicians.</p>
<p><strong>NEJM Healer Series</strong> walks models through 20 complete patient encounters across four stages: triage, examination, testing, and management. Scoring uses the R-IDEA rubric (0-10).</p>
<p><strong>Grey Matters Management</strong> presents 5 complex scenarios scored on a 100-point rubric. Notably, this benchmark compares GPT-4 against physicians working with and without AI assistance.</p>
<p><strong>MIMIC-IV-Ext Clinical Decision Making</strong> draws from 2,400 emergency department visits for abdominal pain, testing whether models can distinguish appendicitis, cholecystitis, diverticulitis, and pancreatitis.</p>
<p><strong>Probabilistic Reasoning Challenges</strong> test whether models can perform Bayesian inference with lab results. This matters because clinical decision-making is fundamentally probabilistic, and models that give false confidence are dangerous.</p>
<h2 id="heading-what-it-takes-to-deploy-safely">What It Takes to Deploy Safely</h2>
<p>Research performance doesn't guarantee safe clinical use. Several factors separate a promising paper from a deployable system.</p>
<p><strong>Privacy</strong> is non-negotiable. Patient data must be de-identified and encrypted. Models trained on identifiable data face both legal liability and the risk of memorizing sensitive information.</p>
<p><strong>Generalization</strong> trips up many models. Performance on one hospital's data often doesn't transfer to another institution with different patient populations, imaging equipment, or documentation practices. Diverse testing is essential.</p>
<p><strong>Explainability</strong> helps clinicians trust (and appropriately distrust) model outputs. Attention maps, saliency scores, and counterfactual explanations all help, though none fully solve the interpretability problem.</p>
<p><strong>Regulation</strong> remains unsettled. The FDA and CE marking bodies are still working out how to evaluate AI that learns and updates. Liability questions are largely unresolved.</p>
<h2 id="heading-where-this-is-heading">Where This Is Heading</h2>
<p>The immediate future is clear: these models will get better, handle more modalities, and integrate more tightly into clinical workflows.</p>
<p>Longer term, expect models that incorporate genomic data, wearable sensor streams, and even environmental factors. Real-time decision support integrated directly into EHRs is coming. Personalization based on individual patient histories will follow.</p>
<p>The harder problems are institutional and regulatory. Who's liable when an AI-assisted diagnosis is wrong? How do you validate a model that keeps learning? What does informed consent look like when AI is involved in care decisions?</p>
<p>Multimodal LLMs will transform healthcare. The technology is nearly ready. The question is whether our institutions can adapt fast enough to deploy it safely.</p>
<hr />
<h2 id="heading-references">References</h2>
<p>Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., &amp; Gao, J. (2023). LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. <em>arXiv preprint arXiv:2306.00890</em>.</p>
<p>Han, T., Adams, L. C., Papaioannou, J. M., Grundmann, P., Oberhauser, T., Löser, A., Truhn, D., &amp; Bressem, K. K. (2023). MedAlpaca: An open-source collection of medical conversational AI models and training data. <em>arXiv preprint arXiv:2304.08247</em>.</p>
<p>Chen, Z., Diao, S., Wang, B., Wang, H., Liu, T., Hu, Z., &amp; Jiang, L. (2024). CheXagent: Towards a foundation model for chest X-ray interpretation. <em>arXiv preprint arXiv:2401.12208</em>.</p>
<p>Google (2025). MedGemma: Medical vision-language models. <em>Google Health AI Developer Foundations</em>. Retrieved from <a target="_blank" href="https://huggingface.co/google/medgemma">https://huggingface.co/google/medgemma</a></p>
<p>Yang, X., Chen, A., PourNejatian, N., Shin, H. C., Smith, K. E., Parisien, C., Compas, C., Martin, C., Costa, A. B., Flores, M. G., Zhang, Y., Magoc, T., Harle, C. A., Lipori, G., Mitchell, D. A., Hogan, W. R., Shenkman, E. A., Bian, J., &amp; Wu, Y. (2022). GatorTron: A large clinical language model to unlock patient information from unstructured electronic health records. <em>arXiv preprint arXiv:2203.03540</em>.</p>
<p>Rasul, K., Ashok, A., Williams, A. R., Khorasani, M., Adamopoulos, G., Bhagwatkar, R., Biloš, M., Ghonia, H., Hassen, N. V., Anderson, D., Schneider, J., Nevmyvaka, Y., &amp; Rätsch, G. (2023). Medical time-series data generation using generative adversarial networks. <em>Proceedings of Machine Learning Research</em>, 182.</p>
]]></content:encoded></item><item><title><![CDATA[When AI Radiologists Get Confused: The Critical Challenge of VLM Robustness in Medical Diagnostics]]></title><description><![CDATA[Picture this: You’re in the emergency room with chest pain and shortness of breath. The doctor orders a chest X-ray, and while waiting for the radiologist, you pull out your phone. Could ChatGPT help interpret what’s wrong? You’ve used it for math pr...]]></description><link>https://thedatasense.com/when-ai-radiologists-get-confused-the-critical-challenge-of-vlm-robustness-in-medical-diagnostics</link><guid isPermaLink="true">https://thedatasense.com/when-ai-radiologists-get-confused-the-critical-challenge-of-vlm-robustness-in-medical-diagnostics</guid><category><![CDATA[vlms]]></category><category><![CDATA[healthcare]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 15 Oct 2025 04:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/NMZdj2Zu36M/upload/0ff90e28843bf67f86bbe3318b790a8d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Picture this: You’re in the emergency room with chest pain and shortness of breath. The doctor orders a chest X-ray, and while waiting for the radiologist, you pull out your phone. Could ChatGPT help interpret what’s wrong? You’ve used it for math problems and recipe suggestions. Surely it could read an X-ray?</p>
<p>This isn’t a hypothetical anymore. We’re already there. An Australian study found that 9.9% of adults had used ChatGPT for health questions in just six months, with nearly 40% of non-users considering it. When people get health advice from ChatGPT, nearly half simply follow it. No questions asked, no double-checking with their doctor.</p>
<p>Our recent research reveals that when these sophisticated AI systems move from answering text questions to interpreting medical images, they become dangerously brittle. We evaluated 125 chest X-ray interpretations using state-of-the-art vision language models, including Google’s MedGemma 4B and GPT-4V. Simply changing “vascular dilation” to “vascular congestion” in a question made the same AI system provide completely different diagnoses for the same X-ray image.</p>
<hr />
<h2 id="heading-the-promise-of-vlm-based-radiologists"><strong>The Promise of VLM-Based Radiologists</strong></h2>
<p>When we first started working with vision language models for medical imaging, the promise seemed clear. Unlike traditional AI that just spits out labels like “pneumonia: 87% probability,” these models could actually explain what they saw. They’d tell you why they thought something looked abnormal. You could ask follow-up questions. Feed them a patient’s history and watch them adjust their interpretation accordingly.</p>
<p>But here’s what we discovered matters just as much as accuracy: consistency. We call it robustness in the lab, but what it really means is whether the AI gives you the same answer when you ask the same question slightly differently. Think about it. A radiologist doesn’t suddenly see pneumonia just because you say “chest radiograph” instead of “X-ray.” They know “lung volumes” and “lung capacity” mean the same thing.</p>
<p>Yet that’s exactly what happens with today’s most advanced models. And we’re not talking about edge cases or trick questions. We tested basic medical synonyms, the kind any first-year resident would recognize as identical. The models fell apart.</p>
<hr />
<h2 id="heading-when-terminology-becomes-a-diagnostic-trap"><strong>When Terminology Becomes a Diagnostic Trap</strong></h2>
<p>Let’s walk through real examples from our evaluation that show how catastrophically these models can fail.</p>
<h3 id="heading-case-1-the-vascular-confusion"><strong>Case 1: The Vascular Confusion</strong></h3>
<p>We showed a chest X-ray to one of the most advanced vision language models available. Asked about vascular findings. The model correctly identified pulmonary vascular dilation, which is exactly what we’d expect. It’s a widening of blood vessels that might indicate various conditions but isn’t immediately life threatening.</p>
<p>Then we changed two words. Just two. “Vascular dilation” became “vascular congestion.”</p>
<p><img src="https://bineshkumar.me/assets/case-1.webp" alt /></p>
<p>Suddenly the model was talking about cardiac congestion. Possible heart failure. Recommending completely different follow-up procedures. Same image, nearly identical question, completely different medical pathway. The clinical implications hit us immediately. A patient might get rushed into unnecessary cardiac workup while their actual condition goes untreated. Or worse, someone might start urgent cardiac treatment for what’s actually a non-cardiac issue.</p>
<h3 id="heading-case-2-the-imaginary-pneumonia"><strong>Case 2: The Imaginary Pneumonia</strong></h3>
<p>This one still makes us shake our heads. We had an X-ray showing clear pleural effusion. That’s fluid around the lungs, often serious enough to need drainage. The model saw it correctly when we asked about lung findings.</p>
<p>But when we added the phrase “chest radiograph” to our question? The model suddenly “saw” pneumonia that wasn’t there.</p>
<p><img src="https://bineshkumar.me/assets/case-2.webp" alt /></p>
<p>It didn’t just add pneumonia to its diagnosis. It completely forgot about the pleural effusion and started recommending antibiotics. This isn’t just wrong. It’s actively harmful. A patient with fluid crushing their lungs needs drainage, not antibiotics for an infection that doesn’t exist.</p>
<h3 id="heading-case-3-the-vanishing-diagnosis"><strong>Case 3: The Vanishing Diagnosis</strong></h3>
<p>Perhaps most concerning was when changing “lung volumes” to “lung capacity” made critical findings disappear entirely. The model went from correctly identifying pleural effusion and potential cardiac issues to completely missing the effusion and focusing only on cardiac problems.</p>
<p><img src="https://bineshkumar.me/assets/case-3.webp" alt /></p>
<p>Pleural effusion can kill you if it’s not treated. It can lead to respiratory failure. Yet a simple synonym made the AI blind to its presence. The model confidently described other findings while missing the one thing that might send someone to the ICU.</p>
<p>What makes these failures so unsettling is their unpredictability. You can’t train staff to avoid certain phrases or create a list of “safe” terminology. The brittleness runs deeper than that.</p>
<hr />
<h2 id="heading-making-sense-of-the-brittleness-what-were-learning-in-the-lab"><strong>Making Sense of the Brittleness: What We’re Learning in the Lab</strong></h2>
<p>The brittleness we observed in chest X-ray VLMs sent us down a research rabbit hole. How could models that seem so sophisticated fail so spectacularly when we barely changed our words? We needed a systematic way to measure this vulnerability, which led us to develop VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) here at the SAIL Lab at University of New Haven.</p>
<h3 id="heading-vsf-med-our-systematic-approach"><strong>VSF-Med: Our Systematic Approach</strong></h3>
<p>VSF-Med isn’t just another benchmark. It’s our attempt to quantify exactly how and why these models break. We evaluated 68,478 attack scenarios across five models including:</p>
<ul>
<li><p><strong>CheXagent 8B</strong>: Medical-specialized model</p>
</li>
<li><p><strong>Llama 3.2 11B Vision</strong>: General-purpose VLM</p>
</li>
<li><p><strong>GPT-4o</strong>: State-of-the-art multimodal model</p>
</li>
<li><p><strong>Google MedGemma 4B</strong>: Medical-focused model</p>
</li>
<li><p><strong>GPT-4V</strong>: Previous generation flagship</p>
</li>
</ul>
<p>The framework measures vulnerability across nine different attack vectors, giving us concrete numbers for what we’d been observing anecdotally.</p>
<h3 id="heading-the-sobering-results"><strong>The Sobering Results</strong></h3>
<p>What we found was concerning. Even CheXagent 8B, a model specifically trained for medical imaging, showed moderate vulnerability (z-score: 0.68) to our prompt injection attacks. That “dilation vs congestion” problem we showed you earlier? Not an isolated incident.</p>
<p><strong>Key findings</strong>:</p>
<ul>
<li><p>Performance scores ranged from <strong>7 to 97</strong> across different phrasings of identical X-ray questions</p>
</li>
<li><p>Medical-specialized models demonstrated only <strong>36% better resilience</strong> compared to general-purpose VLMs</p>
</li>
<li><p>Current best-in-class models still have vulnerability spreads exceeding 0.3 standard deviations</p>
</li>
</ul>
<p>Better than general models, yes. Good enough for clinical use? Not even close.</p>
<h3 id="heading-why-are-these-models-so-fragile"><strong>Why Are These Models So Fragile?</strong></h3>
<p>We have theories we’re exploring:</p>
<ol>
<li><p><strong>Contrastive Learning Issues</strong>: Many vision-language models use contrastive learning during training, where they learn to match images with text descriptions. This might create brittle associations between specific phrases and visual features.</p>
</li>
<li><p><strong>The Alignment Problem</strong>: These models are fine-tuned to be helpful and responsive, which might make them overeager to provide different answers when prompted differently. They’re trying so hard to be useful that they forget to be consistent.</p>
</li>
<li><p><strong>Medical Language Complexity</strong>: Radiological language is precise but full of synonyms. Models trained on general text might not grasp that “increased opacity” and “increased density” mean the same thing in a chest X-ray context.</p>
</li>
<li><p><strong>Architectural Limitations</strong>: The transformer architecture itself might contribute to this sensitivity. Attention mechanisms that work beautifully for general language tasks might amplify small prompt variations in high-stakes medical contexts.</p>
</li>
</ol>
<hr />
<h2 id="heading-what-this-means-for-medical-ai"><strong>What This Means for Medical AI</strong></h2>
<p>The momentum behind medical AI is undeniable. Just this week, Mayo Clinic Press highlighted how AI is already being used for stroke diagnosis, heart failure detection, and cancer screening. They describe an optimistic future where “AI has the potential to improve the work of human healthcare teams, making care more personal and effective.”</p>
<p>While this enthusiasm is understandable given AI’s promise, our research suggests we need to address fundamental robustness issues before these systems can truly deliver on that potential.</p>
<h3 id="heading-safety-implications"><strong>Safety Implications</strong></h3>
<ol>
<li><p><strong>Diagnostic Inconsistency</strong>: Same image, different terminology = different diagnoses</p>
</li>
<li><p><strong>Clinical Risk</strong>: Unreliable AI could misguide medical decisions</p>
</li>
<li><p><strong>Trust Issues</strong>: Healthcare providers need consistent, predictable AI behavior</p>
</li>
<li><p><strong>Patient Safety</strong>: When people follow AI medical advice without verification, inconsistency becomes dangerous</p>
</li>
</ol>
<h3 id="heading-the-real-world-context"><strong>The Real-World Context</strong></h3>
<p>Remember those statistics we opened with? People are already using these tools for health decisions. An Australian study found that people with limited health literacy use ChatGPT at nearly twice the rate of others (18.4% vs 9.4%). Those from non-English speaking backgrounds? Even higher at 29.2%. These are exactly the populations who might be most vulnerable to inconsistent AI responses.</p>
<hr />
<h2 id="heading-future-directions"><strong>Future Directions</strong></h2>
<p>Our research highlights the urgent need for:</p>
<ol>
<li><p><strong>Robust Training Methods</strong>: VLMs that maintain consistency across terminology variations</p>
</li>
<li><p><strong>Comprehensive Testing</strong>: Systematic evaluation of medical AI before clinical deployment</p>
</li>
<li><p><strong>Safety Frameworks</strong>: Guidelines for reliable medical AI implementation</p>
</li>
<li><p><strong>Architectural Innovations</strong>: New approaches that improve multimodal robustness</p>
</li>
</ol>
<h3 id="heading-open-science-commitment"><strong>Open Science Commitment</strong></h3>
<p>Our VSF-Med framework is completely open source because we believe this problem is too important for any single team to tackle alone. We’ve made it so researchers anywhere can benchmark their medical VLM with a single command, generating over 30,000 adversarial test cases automatically.</p>
<p>We’re diving deeper into architectural modifications that might improve robustness. We’re exploring whether different training objectives could create more stable image-text associations. And we’re working with clinicians to understand which types of brittleness pose the greatest real-world risks.</p>
<hr />
<h2 id="heading-the-path-forward"><strong>The Path Forward</strong></h2>
<p>Medical AI has immense potential to revolutionize healthcare, from reducing diagnostic errors to making expertise available in underserved areas. But as our research shows, we’re not there yet. The brittleness we’ve uncovered isn’t just a technical curiosity—it’s a fundamental barrier to safe clinical deployment.</p>
<p>Until we can get vulnerability spreads much, much lower, these systems remain too fragile for autonomous clinical use. This isn’t about being pessimistic about AI. It’s about being realistic about what needs to be fixed before we can responsibly deploy these powerful tools in life-and-death situations.</p>
<p>Because ultimately, this isn’t just an interesting technical puzzle. It’s about making sure AI tools genuinely help rather than harm when lives are on the line.</p>
]]></content:encoded></item><item><title><![CDATA[A guide to LLM evaluation metrics]]></title><description><![CDATA[No single metric reliably captures LLM output quality. But the right combination of metrics, carefully chosen for your task, gets surprisingly close to human judgment. This guide covers mathematical formulations, failure modes, and runnable code for ...]]></description><link>https://thedatasense.com/a-guide-to-llm-evaluation-metrics</link><guid isPermaLink="true">https://thedatasense.com/a-guide-to-llm-evaluation-metrics</guid><category><![CDATA[LLM's ]]></category><category><![CDATA[Evaluation]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 17 Sep 2025 04:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770512236838/51c54efa-7ace-4efe-91e8-dc10de27d416.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>No single metric reliably captures LLM output quality. But the right combination of metrics, carefully chosen for your task, gets surprisingly close to human judgment. This guide covers mathematical formulations, failure modes, and runnable code for every major evaluation metric, from classical perplexity through modern LLM-as-judge approaches.</p>
<p>The field has shifted fast since 2023. LLM-based judges now achieve over 80% agreement with human annotators. Meanwhile, n-gram metrics like BLEU persist largely through institutional inertia. Knowing when each metric works, and when it fails, is the difference between rigorous evaluation and self-deception.</p>
<p><a target="_blank" href="https://colab.research.google.com/drive/1pxS1oznBOaS23QAHGcMsZ7sr5gUcZhlb?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>
<p><em>Note: You can run this experiment using the free tier of Google Colab.</em></p>
<h2 id="heading-1-perplexity-and-bits-per-byte-the-intrinsic-baselines">1. Perplexity and bits-per-byte: the intrinsic baselines</h2>
<p>Perplexity remains the default intrinsic metric for language models. It's defined as the exponentiated average negative log-likelihood over a token sequence:</p>
<p>$$\text{PPL}(X) = \exp\left(-\frac{1}{N} \sum_i \log P(x_i \mid x_{<p>This equals <code>exp(cross-entropy loss)</code>, making it a direct readout of training loss. Lower perplexity means the model assigns higher probability to observed text. GPT-2 large scores about 16.4 PPL on WikiText-2 with sliding-window evaluation (stride=512), compared to 19.4 without overlap. That's a methodological detail that matters more than many researchers realize.</p>
<p>Here's the critical pitfall: <strong>tokenizer dependence</strong>. Perplexity normalizes per token, but different tokenizers produce different token counts for the same text. The Weighted Perplexity Benchmark (2025) showed tokenization differences affect measurements by up to 21.6% across 19 models on WikiText-2. Comparing Llama 2 (32K vocabulary) to Llama 3 (128K vocabulary) on perplexity is meaningless. Llama 3's per-token perplexity is higher simply because each token covers more underlying bytes.</p>
<p><strong>Bits-per-byte (BPB)</strong> solves this by normalizing total information content by UTF-8 bytes rather than tokens:</p>
<p>$$\text{BPB} = \frac{\text{total NLL in nats}}{\ln(2) \times \text{total bytes}}$$</p><p>Since byte count stays fixed regardless of tokenization, BPB enables fair cross-model comparison. Shannon estimated English entropy at about 1.0 to 1.3 bits per character. GPT-2 achieved 0.93 BPB on enwik8. Two models with identical predictive quality but different tokenizers can show perplexities of 20.09 vs 7.39, yet produce identical BPB of 1.08.</p>
<p>Recent work has exposed deeper problems. Fang et al. (ICLR 2025) showed that standard perplexity averages across all tokens equally, masking poor performance on "key tokens" that are essential for long-context understanding. Their LongPPL metric focuses on key tokens via a long-short context contrastive method and achieves −0.96 Pearson correlation with downstream benchmarks, versus near-zero for standard PPL. Kuribayashi et al. separately demonstrated that lower perplexity doesn't always correlate with more human-like text processing.</p>
<p><strong>When to use perplexity:</strong> comparing checkpoints within the same model family. <strong>When to use BPB:</strong> cross-model comparison. <strong>When to avoid both:</strong> measuring output quality, fluency, or task performance. They measure model fit to data, not generation quality.</p>
<h2 id="heading-2-n-gram-overlap-metrics-still-everywhere-often-wrong">2. N-gram overlap metrics: still everywhere, often wrong</h2>
<p>Despite well-documented limitations, BLEU and ROUGE remain the most-cited evaluation metrics in NLP. A 2025 analysis of 14,171 papers across four major NLP conferences found that 63.6% of papers using BLEU provide no implementation details. That's a reproducibility crisis hiding in plain sight.</p>
<h3 id="heading-bleu-precision-over-substance">BLEU: precision over substance</h3>
<p>BLEU computes a weighted geometric mean of modified n-gram precisions, multiplied by a brevity penalty:</p>
<p>$$\text{BLEU} = \text{BP} \times \exp\left(\sum w_n \times \log p_n\right)$$</p><p>where BP = exp(1 − r/c) if c ≤ r, else 1. Modified precision clips n-gram counts against maximum reference counts to prevent gaming through repetition. Standard BLEU-4 uses uniform weights (w₁ = w₂ = w₃ = w₄ = 0.25).</p>
<p>The original designers built BLEU for corpus-level machine translation. Applying it to single sentences causes the geometric mean to collapse to zero when any n-gram precision hits zero, which happens frequently for short sentences. The <code>sacrebleu</code> library exists specifically to fix reproducibility. It produces a version signature string (e.g., <code>BLEU|nrefs:1|case:mixed|tok:13a|smooth:exp|version:2.0.0</code>) that ensures exact reproducibility. Always use <code>sacrebleu</code> for paper-reportable scores. Never roll your own tokenization.</p>
<h3 id="heading-rouge-recall-oriented-but-semantically-blind">ROUGE: recall-oriented but semantically blind</h3>
<p>ROUGE computes n-gram recall (plus precision and F1):</p>
<p>$$\text{ROUGE-N recall} = \frac{\sum \text{Count\_match}(\text{gram}_n)}{\sum \text{Count}(\text{gram}_n \text{ in reference})}$$</p><p>ROUGE-L uses the Longest Common Subsequence (LCS), which captures word ordering without requiring contiguity. ROUGE-Lsum splits on newlines for multi-sentence evaluation. State-of-the-art summarization models typically achieve ROUGE-1: 40-47%, ROUGE-2: 18-28% on news benchmarks.</p>
<h3 id="heading-meteor-the-forgotten-improvement">METEOR: the forgotten improvement</h3>
<p>METEOR creates alignments through four stages: exact match, stemming, synonym (WordNet), and paraphrase. It then computes a recall-weighted harmonic mean with a fragmentation penalty:</p>
<p>$$\text{METEOR} = F_\text{mean} \times (1 - \gamma \times (\text{chunks}/\text{matched})^\beta)$$</p><p>It achieves Pearson correlation of 0.964 at corpus level (vs. BLEU's 0.817). Yet it remains underused due to WordNet dependency and version sensitivity, where scores can differ ±10 points between v1.0 and v1.5.</p>
<p>A new contender worth watching: the GEM metric (ICLR 2025), a reference-free approach based on mutual information, now outperforms BLEU, ROUGE-L, BERTScore, and BARTScore in correlation with human annotations, while also resisting manipulation.</p>
<h2 id="heading-3-embedding-based-metrics-semantics-at-a-cost">3. Embedding-based metrics: semantics at a cost</h2>
<h3 id="heading-bertscore-greedy-matching-in-embedding-space">BERTScore: greedy matching in embedding space</h3>
<p>BERTScore extracts contextual embeddings from a pre-trained model, then uses greedy cosine-similarity matching between candidate and reference tokens:</p>
<p>$$R_\text{BERT} = \frac{1}{|x|} \sum_{x_i \in x} \max_{\hat{x}_j \in \hat{x}} \cos(x_i, \hat{x}_j)$$</p><p>$$P_\text{BERT} = \frac{1}{|\hat{x}|} \sum_{\hat{x}j \in \hat{x}} \max{x_i \in x} \cos(x_i, \hat{x}_j)$$</p><p>$$F_\text{BERT} = 2 \cdot P \cdot R / (P + R)$$</p><p>The default model is <code>roberta-large</code> (layer 17), but <code>microsoft/deberta-xlarge-mnli</code> achieves the highest Pearson correlation with human judgments. Without <strong>baseline rescaling</strong>, scores cluster in a narrow range (0.92 to 1.0 for RoBERTa), making interpretation hard. Rescaling spreads scores from 0.93 to a more readable 0.58 average.</p>
<p>Three limits matter here. First, a 512-token maximum means longer texts get silently truncated. Second, Sun et al. (EMNLP 2022) demonstrated social bias across 6 sensitive attributes ("BERTScore is Unfair"). Third, changing the underlying model can flip rankings between systems.</p>
<h3 id="heading-moverscore-optimal-transport-over-embeddings">MoverScore: optimal transport over embeddings</h3>
<p>MoverScore formulates evaluation as an Earth Mover's Distance problem. Instead of BERTScore's greedy 1-to-1 matching, it uses globally optimal soft alignment:</p>
<p>$$\text{MoverScore}(x, \hat{x}) = 1 - \text{EMD}(x, \hat{x})$$</p><p>This allows many-to-one alignments, which matter when one concept gets expressed with multiple words. On WMT17, MoverScore achieved Pearson correlation of 0.743 vs BERTScore's 0.719. But the improvement is marginal, the <code>moverscore</code> PyPI package is inactive, and the O(n³) optimal transport computation runs substantially slower.</p>
<p><strong>When to use BERTScore:</strong> paraphrase detection and semantic similarity evaluation. <strong>When to avoid it:</strong> texts exceeding 512 tokens, fairness-sensitive applications, or when factual correctness (not semantic similarity) is the target.</p>
<h2 id="heading-4-llm-as-judge-the-new-standard-with-known-failure-modes">4. LLM-as-judge: the new standard, with known failure modes</h2>
<h3 id="heading-g-eval-structured-llm-scoring-with-probability-weighting">G-Eval: structured LLM scoring with probability weighting</h3>
<p>G-Eval (Liu et al., EMNLP 2023) achieves Spearman ρ = 0.514 on SummEval, the highest automated correlation with human judgment at its time. The algorithm works in three steps.</p>
<p>First, define evaluation criteria and generate Chain-of-Thought evaluation steps via the LLM. Second, present the text with these steps and ask for a 1-5 score. Third, and this is the key innovation, extract token logprobs for score tokens {1, 2, 3, 4, 5} and compute a probability-weighted score:</p>
<p>$$\text{score} = \frac{\sum(i \times P(i))}{\sum P(i)}, \quad i \in \{1,2,3,4,5\}$$</p><p>This produces continuous, fine-grained scores that avoid the tie problem plaguing direct integer scoring.</p>
<p>The <code>deepeval</code> library provides a production-ready G-Eval wrapper. For open-source implementation, serve models via vLLM (which supports logprobs natively) and use the same OpenAI-compatible client interface.</p>
<h3 id="heading-alpacaeval-20-and-the-length-control-breakthrough">AlpacaEval 2.0 and the length-control breakthrough</h3>
<p>AlpacaEval 2.0 (Dubois et al., COLM 2024) introduced Length-Controlled (LC) win rate, fitting a GLM to predict win probability conditioned on zero length difference. This increased Spearman correlation with Chatbot Arena from 0.94 to 0.98 and reduced gameability from 21% to 6%.</p>
<p>The numbers tell the story clearly. Without LC, GPT-4-1106's win rates fluctuate from 35.3% to 64.3% based purely on verbosity prompts. With LC, the range narrows to 41.9% to 51.6%.</p>
<h3 id="heading-mt-bench-and-arena-hard">MT-Bench and Arena-Hard</h3>
<p>MT-Bench evaluates 80 multi-turn questions across 8 categories (writing, roleplay, extraction, reasoning, math, coding, STEM, humanities) using GPT-4 as a 1-10 grader. Arena-Hard-Auto (2024) extends this with 500 challenging prompts, achieving 89.1% agreement with Chatbot Arena and 87.4% separability. That's far better than MT-Bench at distinguishing frontier models.</p>
<h3 id="heading-2024-2025-developments-worth-tracking">2024-2025 developments worth tracking</h3>
<p><strong>JudgeBench</strong> (ICLR 2025) is a sobering benchmark for evaluating judges themselves. The best model achieves only 64% accuracy (Claude-3.5-Sonnet), and fine-tuned judges often perform below random baseline.</p>
<p>The <strong>CALM framework</strong> (ICLR 2025) identified 12 distinct bias types in LLM judges: position, verbosity, fallacy oversight, sentiment, authority, beauty, self-enhancement, refinement, knowledge, format, cultural, and anchoring biases. That's a long list, and it explains why single-run LLM evaluations are unreliable.</p>
<p><strong>WildBench</strong> achieves Pearson 0.98 correlation with Chatbot Arena using real-world tasks with task-specific checklists and length penalties.</p>
<p>And the multi-agent trend is accelerating. Self-MoA (2025) samples a single top LLM multiple times and achieves 65.7% LC win rate on AlpacaEval 2.0, outperforming heterogeneous multi-model ensembles at 59.1%.</p>
<h2 id="heading-5-combining-metrics-practical-recommendations">5. Combining metrics: practical recommendations</h2>
<p>No single metric captures all quality dimensions. The LMSYS team found that triangulating relative model performance with MT-Bench and AlpacaEval provides the best benchmark. And Tang et al. (NAACL 2024) showed that simply diversifying references via LLM-generated paraphrases significantly improves the correlation of even classical metrics with human judgments.</p>
<p>Here's what works by task:</p>
<p><strong>Machine translation:</strong> sacrebleu + COMET (now dominant in WMT shared tasks) + chrF. Optionally add GEMBA-MQM for LLM-based quality estimation.</p>
<p><strong>Summarization:</strong> ROUGE-L + BERTScore + a factual consistency metric + G-Eval for coherence and fluency.</p>
<p><strong>Open-ended generation:</strong> LLM-as-judge with structured rubrics (G-Eval style) + MAUVE for distribution-level comparison + human spot-checks.</p>
<p><strong>Code generation:</strong> pass@k for functional correctness + CodeBLEU. SWE-Judge for more realistic scenarios.</p>
<p><strong>Instruction following:</strong> IFEval for verifiable constraints + MT-Bench for multi-turn quality.</p>
<p>One more thing. Anthropic's paper "Adding Error Bars to Evals" (Miller, Nov 2024) provides essential statistical guidance. Clustered standard errors can be 3× larger than naive standard errors when questions are grouped. Paired difference tests eliminate question-difficulty variance when comparing models. And power analysis determines required evaluation set sizes. Always report confidence intervals. A 2-point improvement is meaningless without knowing the standard error.</p>
<h2 id="heading-6-what-the-comparison-reveals">6. What the comparison reveals</h2>
<p>The Colab experiment (see companion notebook) exposes predictable but instructive patterns.</p>
<p>The <strong>paraphrase example</strong> is the acid test. BLEU-4 drops near zero because there's no 4-gram overlap. BERTScore F1 stays high, correctly identifying semantic equivalence. This is exactly the kind of divergence that tells you something: the candidate is semantically correct but lexically different.</p>
<p>The <strong>verbose padding</strong> example shows ROUGE recall inflating (the reference content is all there) while ROUGE precision drops. BERTScore gives a moderate score. An LLM judge would likely penalize the filler text.</p>
<p>The <strong>hallucination</strong> case reveals the deepest limitation of surface metrics. ROUGE-1 can still score above zero on completely wrong content if individual words happen to overlap.</p>
<p>Three trends define where evaluation is heading. First, dynamic benchmarks like LiveBench and WildBench are replacing static test sets to combat contamination. The problem is so severe that Codeforces performance plummets after training cutoff dates. Second, the statistical rigor revolution means reporting scores without confidence intervals is increasingly unacceptable. Third, fine-tuned evaluation models continue to disappoint relative to general-purpose frontier LLMs as judges: on JudgeBench, the best fine-tuned judge hits only 57% accuracy while the best general model reaches 64%. This suggests evaluation capability scales with general capability, not with specialized training.</p>
<h2 id="heading-takeaway">Takeaway</h2>
<p>Use BPB (not perplexity) for intrinsic model comparison. Use sacrebleu + COMET for translation. Use ROUGE-L + BERTScore for summarization baselines. Use G-Eval or MT-Bench-style LLM judges as the primary quality signal for open-ended generation.</p>
<p>Always combine at least three metrics that measure different dimensions. Always report confidence intervals. And never trust a single number to capture text quality.</p>
<p>Metric disagreement is itself informative. When BLEU says a paraphrase is terrible but BERTScore says it's good, that gap tells you the candidate is semantically correct but lexically different. Building pipelines that surface these disagreements, rather than collapsing everything to a single score, produces evaluation systems that approximate the multi-dimensional judgments humans actually make.</p>
<p>The field is converging on LLM-as-judge as the primary evaluation approach. But the 12 identified bias types and 64% accuracy ceiling on challenging inputs mean we're far from a solved problem. Use frontier LLMs as judges, mitigate their known biases through position swapping, length control, and multi-run averaging, and maintain human spot-checking for high-stakes decisions.</p>
</p>]]></content:encoded></item><item><title><![CDATA[Bayesian Optimization]]></title><description><![CDATA[Most of this is from my class notes for the session - DSCI 6653 - Bayesian Data Analysis at the University of New Haven.
Bayesian optimization is a strategy for global optimization of black-box functions. In simpler terms, it is a smart way to find t...]]></description><link>https://thedatasense.com/bayesian-optimization</link><guid isPermaLink="true">https://thedatasense.com/bayesian-optimization</guid><category><![CDATA[Bayesian optimization]]></category><dc:creator><![CDATA[Binesh Kumar]]></dc:creator><pubDate>Wed, 13 Nov 2024 05:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Most of this is from my class notes for the session - DSCI 6653 - Bayesian Data Analysis at the University of New Haven.</p>
<p>Bayesian optimization is a strategy for global optimization of black-box functions. In simpler terms, it is a smart way to find the best settings for a complex system where checking each setting is costly or time-consuming.</p>
<p>Instead of guessing randomly or checking a grid of points, it builds a probabilistic model of the function (often called a "surrogate") to decide where to sample next.</p>
<p>This process relies on balancing two competing goals: <strong>Exploration</strong> (looking in places where we don't know much yet) and <strong>Exploitation</strong> (refining our knowledge in areas that already look promising).</p>
<h3 id="heading-step-1-the-intuition">Step 1: The Intuition</h3>
<p>Imagine you are a gold prospector on a vast, rugged piece of land. Your goal is to find the highest concentration of gold (the global maximum), but there is a catch:</p>
<ol>
<li><p><strong>Drilling is expensive:</strong> It costs a lot of money and time to set up a rig and drill a test hole. You can't just drill everywhere.</p>
</li>
<li><p><strong>Blind Search:</strong> You can't see the gold from the surface. You only know how much gold is there <em>after</em> you drill.</p>
</li>
</ol>
<p>This is exactly the problem Bayesian Optimization solves. It helps you decide <strong>where to drill next</strong> to get the best results with the fewest drills.</p>
<p>To make this decision, you use two main tools in your "mental map":</p>
<ul>
<li><p><strong>The Surrogate Model (The Map):</strong> After every drill, you update your sketch of the terrain. If you found gold in one spot, you guess there might be more nearby. If you found nothing, you assume that area is barren. This sketch gives you a <em>probability</em> of finding gold across the map.</p>
</li>
<li><p><strong>The Acquisition Function (The Strategy):</strong> This is the rule you use to pick the next spot. You have to balance two instincts:</p>
<ul>
<li><p><strong>Exploitation:</strong> Drilling near where you previously found gold. It's a safer bet, but you might get stuck finding only small nuggets (a local maximum).</p>
</li>
<li><p><strong>Exploration:</strong> Drilling in a completely empty part of the map. It's risky (you might find nothing), but it's the only way to find a massive vein of gold hidden in the unknown (the global maximum).</p>
</li>
</ul>
</li>
</ul>
<p>If you just stick to what you know (Pure Exploitation), you might be standing right next to a massive vein of gold (the global maximum) and never find it because you're too busy digging up small nuggets elsewhere (local maximum).</p>
<p>So, the "Golden Rule" of Bayesian Optimization is that we need a strategy to balance these two:</p>
<ul>
<li><p><strong>Exploration:</strong> Checking the unknown.</p>
</li>
<li><p><strong>Exploitation:</strong> Refining the known.</p>
</li>
</ul>
<h3 id="heading-step-2-the-mechanics">Step 2: The Mechanics</h3>
<p>Now that we have the intuition, let's look at the actual machinery that makes this work. In the math world, we don't have a physical map or a gut feeling. We have two specific components:</p>
<ol>
<li><p><strong>The Surrogate Model (Gaussian Process):</strong> This acts as our probability map. It estimates what the function looks like based on the points we've already checked. It gives us a mean (expected value) and uncertainty (variance) for every point.</p>
</li>
<li><p><strong>The Acquisition Function:</strong> This is the formula that decides where to sample next. It takes the "map" from the Surrogate Model and calculates a score for every possible point, balancing exploration and exploitation.</p>
</li>
</ol>
<p>Let's focus on the <strong>Surrogate Model</strong> first.</p>
<p>Imagine we have drilled two holes. One found a little gold, the other found none. We have no idea what is happening <em>between</em> those two holes.</p>
<p>If we want to build a model that guesses what the terrain looks like in the gaps, how confident should we be about our guess in the middle of those two distant points compared to a spot right next to a hole we already drilled? Less confident, right? he further we are from a drilled hole, the "fuzzier" our map becomes. We just don't know what's out there.</p>
<p>This is why the <strong>Gaussian Process (GP)</strong> is the standard tool here. For every single point on the map, it doesn't just give us a single guess; it gives us a probability distribution (a bell curve).</p>
<p>This gives us two key pieces of data for every coordinate:</p>
<ul>
<li><p><strong>The Mean \( \mu \) :</strong> Our best guess for how much gold is there.</p>
</li>
<li><p><strong>The Standard Deviation \( \sigma \) :</strong> Our uncertainty.</p>
</li>
</ul>
<blockquote>
<p><strong>Note:</strong> Notice in the diagram how the shaded region (uncertainty) gets "pinched" tight near the black dots (data points) and balloons out in the empty spaces? That ballooning is the math telling us, <em>"I have no idea what's happening here!"</em></p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768326753552/01b0fef4-b26f-4308-84b8-056901cb538d.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-the-acquisition-function">The Acquisition Function</h3>
<p>Now we need a rule to look at that GP and say, <em>"Drill here next."</em> This rule is the <strong>Acquisition Function</strong>.</p>
<p>A very common one is called <strong>Upper Confidence Bound (UCB)</strong>. The formula looks roughly like this:</p>
<p>$$\text{Score} = \text{Mean} + (\kappa \times \text{Uncertainty})$$</p><ul>
<li><p><strong>Mean:</strong> High predicted value (<strong>Exploitation</strong>)</p>
</li>
<li><p><strong>Uncertainty:</strong> High potential to learn something new (<strong>Exploration</strong>)</p>
</li>
<li><p><strong>\( \kappa \) (Kappa):</strong> A number we choose to tune our strategy.</p>
</li>
</ul>
<p>If we set that \( \kappa \) (kappa) value to be very, very high, are we acting more like a safe, conservative miner, or a risky, adventurous explorer? We are acting like an adventurous explorer.</p>
<p>A high \( \kappa \) value boosts the "<strong>Uncertainty</strong>" part of the equation, meaning the algorithm is willing to ignore the safe bets (high Mean) to go check out the mysterious, unknown areas (high Uncertainty). It becomes an <strong>adventurous explorer</strong>.</p>
<p>So, the full <strong>Mechanics</strong> cycle looks like this:</p>
<ol>
<li><p><strong>Update Model:</strong> The Gaussian Process looks at the data we have so far.</p>
</li>
<li><p><strong>Pick a Point:</strong> The Acquisition Function (like UCB) calculates the score for every point and picks the winner.</p>
</li>
<li><p><strong>Evaluate:</strong> We actually "drill" at that spot (calculate the real result).</p>
</li>
<li><p><strong>Repeat:</strong> We add that new data to our model and loop again.</p>
</li>
</ol>
]]></content:encoded></item></channel></rss>