Binesh's Data Sense Lab

Circuit Tracing: Finding Medical Features in Gemma 3

Binesh Kumar — Wed, 04 Feb 2026 22:07:58 GMT

Language models can answer medical questions with surprising accuracy. But do they actually encode medical knowledge in identifiable, interpretable ways? Or is it all just statistical soup?

Using Neuronpedia, we ran a simple experiment to find out. We searched for features related to angina (cardiac chest pain) inside Gemma 3 1B IT, then tested whether those features light up when the model processes related medical prompts. The short answer: they do, and the results are pretty clean.

What Are Features, and Why Should You Care?

Sparse autoencoders (SAEs) decompose a model's internal activations into interpretable directions, often called "features." Each feature corresponds to a concept the model has learned. Neuronpedia hosts pretrained SAEs for several open models, including Google's Gemma family, and lets you search, inspect, and test these features through a browser interface.

If we can find features that reliably correspond to specific medical concepts, that tells us something about how the model organizes its knowledge. It also opens the door to monitoring, steering, or auditing model behavior at a mechanistic level.

The Experiment

Step 1: Search for "Angina"

We used Neuronpedia's "Search via Inference" tool with GEMMASCOPE-2-RES-16K (Residual Stream, 16K features) across all layers. The search surfaced several candidate features. One stood out: "cardiac and blood flow" (feature 2224 at layer 17).

Its top activations included phrases like "Individuals with Existing Heart Conditions," "coronary artery disease, heart failure," and "Reduced Blood Pressure." The positive logits pointed to tokens like "Heart," "cardiac," and "cardiovascular." So far, this looks like a genuine cardiac concept feature.

Step 2: Test It on a Medical Prompt

Here's where it gets interesting. We used Neuronpedia's TopK feature analysis to see which features activate most strongly at the final token when the model processes:

"Chest pain is frequently linked to"

This is the exact position where the model predicts the next token. If the cardiac feature actually encodes what we think it does, it should activate here.

Result: The "cardiac and blood flow" feature ranked #1 at the final token position, with an activation of 636.00. Not buried in the top 50. Not somewhere in the middle. Number one.

Step 3: Replicate with a Different Medical Domain

We repeated the experiment for respiratory features.

Searching for "pneumonia" surfaced a feature called "respiratory and lung conditions" (feature 3791 at layer 22). Its positive logits included "respiratory," "lungs," "airflow," "breathing," "airways," and "coughing." The top activations contained clinical text about chronic cough, wheezing, and respiratory problems.

We then tested this feature against the prompt:

"Shortness of breath can be a symptom of"

The TopK analysis at the final "of" token showed the respiratory feature at 708.00, landing in the top 5. The top feature was "Medical conditions and disorders" at 1992.00, which also makes sense since shortness of breath can be a symptom of many things beyond just lung conditions.

What This Tells Us

Both experiments follow the same pattern: features discovered through symptom-related searches activate strongly when the model processes related medical prompts. The cardiac feature found via "angina" fires at position one when the model encounters "chest pain." The respiratory feature found via "pneumonia" fires in the top five when the model encounters "shortness of breath."

This isn't proof that the model "understands" medicine in any deep sense. But it does show that Gemma 3 1B IT organizes medical knowledge into identifiable, interpretable features that activate in contextually appropriate ways. The model isn't just pattern-matching surface tokens. It has learned something about the semantic relationships between symptoms and conditions.

Limitations

A few caveats are worth noting.

This experiment only tests two medical domains (cardiac and respiratory). A broader study would need to cover many more domains to make strong claims about generalizability. We also only tested one prompt per domain. More diverse prompts, including edge cases and adversarial examples, would strengthen the findings.

The activation values themselves are hard to interpret in absolute terms. Is 636.00 "high"? Relative to what? The ranking (first place) is more meaningful than the raw number.

Finally, this was done on Gemma 3 1B IT, a relatively small model. Larger models may organize their features differently.

Try It Yourself

The whole experiment is reproducible through Neuronpedia's web interface. No code required.

If you're interested in mechanistic interpretability for medical AI, this is a good starting point. Search for a medical concept, find its features, then test whether they activate on related prompts. It takes about five minutes, and the results can be surprisingly informative.

What Does Medical VLM Actually See? Experiments with MedGemma and Sparse Autoencoders

Binesh Kumar — Sat, 24 Jan 2026 05:00:00 GMT

When a medical Vision Language Model(VLM) looks at a chest X-ray and says "cardiomegaly present," what's actually happening inside the model? It's a black box. Billions of parameters. Dense activation vectors where every dimension encodes a tangled mix of concepts. This is my experimentation using Sparse Autoencoders (SAEs) to interpret MedGemma, Google's vision-language model for medical imaging. Thanks to DeepMind Team for an open release of Gemma Scope

BlackBox : Neural Networks Are Opaque

MedGemma has a hidden dimension of 2,560. When you feed it a chest X-ray and ask about cardiomegaly, the model produces an activation vector that looks something like this:

[0.23, -0.15, 0.42, 0.08, -0.31, ...]

Every single one of those 2,560 numbers represents a mixture of many concepts. This is called superposition. The model packs more ideas than it has dimensions by encoding them as overlapping patterns. It is next to impossible to figuring out what any single value means.

SAEs provide a Way In

SAEs learn to decompose those dense vectors into a much larger set of sparse features. Instead of 2,560 tangled dimensions, we get 65,000+ features where:

Each feature tends to represent a single concept
Most features are zero for any given input
We can focus on the handful that actually matter

My mental model for this goes like this, dense vector is like hearing an entire orchestra playing at once. The SAE separates out each instrument so you can listen to the violin, the cello, the flute individually.

Dense activation → SAE Encoder → Sparse features (65k dims)
[0.23, -0.15, ...]  →  [0, 0, 0, 142, 0, 0, 89, 0, 0, ...]
                              ↑           ↑
                         Feature 3818  Feature 7241
                         "formal tone" "lung region"

Using GemmaScope 2 on MedGemma

Training SAEs from scratch takes serious compute. Fortunately, Google released GemmaScope 2, a suite of pre-trained SAEs for the Gemma model family. One caveat is that GemmaScope 2 was trained on general Gemma 3 activations. MedGemma was fine-tuned for medical tasks. Would the SAE even work?

Based on my analysis, it appears so.I measured reconstruction quality using Fraction of Variance Unexplained (FVU). Lower is better. Here's what I found across different layers:

Layer	FVU	Variance Explained
9	0.006	99.4%
17	0.020	98.0%
22	0.013	98.7%
29	0.053	94.7%

Medical fine-tuning didn't dramatically shift the activation distributions. The SAE captures over 95% of variance across all layers. That's enough for interpretability work.A few caveats though. Later layers show more drift. And the feature meanings might shift: a "formal language" feature in general Gemma 3 might fire on clinical terminology in MedGemma.

Experiment 1: Different Questions, Different Circuits

I loaded a sample chest X-ray and asked MedGemma four different clinical questions:

Is there cardiomegaly?
Is there pneumonia?
Is there a pleural effusion?
Is this a normal chest X-ray?

Same image. Different questions. What happened in the feature space?

Each question activated a distinct pattern of features. The "cardiomegaly" question lit up 77 features. The "pleural effusion" question lit up 90. The overlap was substantial (cosine similarity above 0.9), but each pathology had its own signature. I think this makes sense. The model is routing different clinical concepts through different internal circuits.

Experiment 2: The Phrasing Effect

Here's where things got interesting. I asked the same clinical question two different ways:

Formal: "Is there radiographic evidence of cardiomegaly?"

Casual: "Does this show a big heart?"

Both questions mean the same thing. But the model's internal features looked completely different.

The cosine similarity between the feature vectors was 0.973. High, but not identical. And when I looked at the biggest differences:

Features more active for the formal phrasing:

Feature 13749: 207 vs 0
Feature 4442: 203 vs 0
Feature 91: 180 vs 0

Features more active for the casual phrasing:

Feature 15587: 163 vs 0
Feature 5984: 152 vs 0
Feature 998: 151 vs 0

Some features only fire for formal clinical language. Others only fire for casual phrasing. The model has learned distinct representations for how questions are asked, not just what they're asking.

Experiment 3: Tracking Features Across a Phrasing Spectrum

In this experiment, I created a gradient of phrasings from very formal to very casual:

"Is there radiographic evidence of cardiac enlargement?"
"Does the imaging demonstrate cardiomegaly?"
"Is there cardiomegaly?"
"Is the heart enlarged?"
"Does this show a big heart?"
"Is the heart too big?"
"Big heart?"

Then I tracked two key features across this spectrum: Feature 13749 (most active for formal phrasing) and Feature 15587 (most active for casual phrasing).

The pattern was clear. As the phrasing got more casual, Feature 13749 dropped and Feature 15587 rose. The crossover happened right around "Is the heart enlarged?" in the middle of the spectrum.This has real implications for medical VLM safety.A radiologist using formal clinical terminology might get a different answer than a patient asking the same question casually. The underlying clinical content is identical. But the model routes it through different internal circuits.

We already know medical VLMs can be sensitive to question phrasing. Now we can see why. The model isn't just processing the semantic content of your question. It's also encoding the style, the register, the formality. And those encodings influence what happens downstream.

For clinical deployment, this suggests we need:

Standardized prompting protocols
Robustness testing across phrasing variations
Feature-level monitoring for production systems

SAEs give us a lens into what's happening inside these models. We can see which features drive a diagnosis, identify when features misfire, and understand why subtle changes in input lead to different outputs.The code is available as a Colab notebook. It runs on a free T4 GPU in about 5 minutes. If you're working with medical VLMs, I'd encourage you to try it. See what features light up for your specific use cases. You might be surprised by what you find.

For more technical details, check out Anthropic's original work on monosemanticity and the GemmaScope paper from Google. The SAEs are available on HuggingFace at google/gemma-scope-2-4b-it.

Also do check out the interactive demo in neuronpedia for Haiku Circuit tracer.

Fluent But Wrong: LLM and Healthcare

Binesh Kumar — Tue, 20 Jan 2026 05:00:00 GMT

Off late a lot of my research time is studying why medical models systems fail. Not the obvious failures where the model outputs gibberish, but the subtle ones where the output looks clinically appropriate, follows proper documentation structure, uses correct terminology, and is still wrong.

To illustrate this problem, I trained a small GPT-2 model from scratch on clinical notes. The goal was not to build something useful. The goal was to demonstrate how easily language models learn to mimic clinical language without learning anything about clinical reasoning.

The results should concern anyone deploying LLMs in healthcare settings.

The Setup

I built a 7.7 million parameter transformer and trained it on two publicly available datasets: MEDIQA-Chat (67 doctor-patient dialogues with paired clinical notes) and MTSamples (approximately 5,000 medical transcriptions across 40 specialties). Total training data was around 200,000 tokens.

For context, GPT-2 was trained on 10 billion tokens. My model saw 0.002% of that amount. It trained for about 15 minutes on a single GPU.

Open this experiment in Google Colab and run it for free

How the Model Works: A Brief Primer

Before showing you what the model produced, it helps to understand what it actually does. This is a simplified explanation for readers without a machine learning background.

Architecture of our clinical GPT-2 model. The same fundamental design powers ChatGPT, just with more parameters.

The Core Idea: Predicting the Next Word

Language models do one thing: predict what word comes next given the words that came before.

If I give the model "The patient presents with chest," it calculates a probability distribution over all possible next words. "Pain" might get 40% probability. "Discomfort" might get 15%. "X-ray" might get 2%. The model samples from this distribution to pick the next word, then repeats the process.

That is all it does. There is no reasoning module. No medical knowledge base. No fact-checking system. Just: given these words, what word is statistically likely to come next?

The Architecture

The model has three main components:

1. Embedding Layer

Words enter the model as numbers. Each word in the vocabulary (50,257 possible tokens) gets converted to a 128-dimensional vector. Think of this as translating words into a numerical language the model can process.

Position matters too. "Patient has pain" means something different than "Pain has patient." So we add position embeddings that encode where each word sits in the sequence.

2. Transformer Blocks (The Core)

This is where the computation happens. Our model stacks 6 identical transformer blocks, each containing two operations:

Self-Attention: Each word looks at every other word in the sequence and decides how much to pay attention to it. When processing "chest" in "The patient presents with chest pain," the attention mechanism might learn to focus heavily on "patient" and "presents" to understand the clinical context. This is done through 4 parallel "attention heads," each learning different patterns.

Feed-Forward Network: After attention, each word passes through a small neural network that transforms its representation. This is where the model builds up abstract features.

The key insight: attention lets the model connect related words regardless of distance. "The patient, a 55-year-old male with a history of hypertension who was recently started on lisinopril, presents with" can connect "patient" to "presents" despite 15 words between them.

3. Output Layer

After passing through all 6 transformer blocks, the model converts the final representation back into a probability distribution over words. The word with the highest probability (or a sample from the distribution) becomes the output.

The Architecture in Numbers

Component	This Model	GPT-3	GPT-5 (estimated)
Parameters	7.7 million	175 billion	Several trillion
Transformer Blocks	6	96	Unknown
Attention Heads	4	96	Unknown
Embedding Dimension	128	12,288	Unknown
Context Window	256 tokens	2,048 tokens	400,000 tokens

Our model is roughly 22,000 times smaller than GPT-3 and 220,000 times smaller than GPT-4. But the fundamental architecture is identical. More parameters mean more capacity to learn patterns, but the mechanism remains the same: predict the next token based on statistical patterns in training data.

What the Architecture Cannot Do

Notice what is missing from this design:

No verification mechanism. The model has no way to check if its output is true. It predicts likely tokens, not accurate tokens.

No world model. The model does not understand that patients are physical beings, that medications have effects, or that vital signs reflect physiological states. It understands that certain words tend to appear near other words.

No reasoning module. There is no component that evaluates whether "a 25-year-old with a 40-year history" is logically possible. The model processes tokens, not concepts.

No uncertainty quantification. The model generates text with uniform confidence whether it is stating established medical fact or complete fabrication.

This architecture is remarkably good at learning statistical patterns in text. It is not designed to understand, verify, or reason about what it generates.

With that context, let me show you what the model produced.

When the Model Sounds Medical

First, outputs that look reasonable to a non-clinician. These are the dangerous ones.

Prompt: Chief Complaint

This looks professional. The format is correct. The terminology is appropriate. The workup makes sense.

But remember what the model actually does: it predicts which tokens are likely to follow other tokens. It learned that "chest pain" frequently appears near "substernal," "radiates to left arm," and "EKG and troponins." It has no idea why these concepts relate to each other.

How do I know? Because when I give the model prompts that should be obviously wrong, it responds with the same confidence.

When the Model Reveals It Understands Nothing

These next outputs require no medical background to evaluate. The failures are obvious to everyone.

Prompt: Impossible Patient History

What went wrong:

A 25-year-old cannot have a 40-year history of anything. He would have developed hypertension at negative 15 years old.

The model did not notice. It saw "X-year history of" and predicted the tokens that typically follow that phrase: chronic conditions like hypertension and kidney disease. It has no concept of time, age, or basic arithmetic.

Also notice the text degrades: "which was found to be a nonreassuring it" is not a coherent phrase. "A cystoscopy in the right ureteral stent" makes no anatomical sense. The model generates medical-sounding word sequences without any understanding of what they mean.

Looking back at the architecture, this makes sense. The self-attention mechanism connects "25-year-old" to "40-year history" but has no way to evaluate whether that connection is logically valid. There is no arithmetic unit. There is no constraint that catches contradictions.

A human clinician would stop at the first sentence and say "this doesn't make sense." The model cannot do that. It only predicts the next likely token.

Prompt: Made-Up Medication

What went wrong:

Flurbinox does not exist. I made it up. The model accepted it without hesitation and documented that the patient has been taking it for a month.

Then it listed "Hypertension" as medication number 2. Hypertension is a diagnosis, not a medication. You cannot take hypertension twice daily.

This is what token prediction looks like. The model saw a medication list format and generated things that statistically appear in medication lists. Sometimes those are medications. Sometimes those are diagnoses. The model does not know the difference because it has no concept of categories. It only knows token co-occurrence patterns.

Prompt: Vital Signs Incompatible with Life

What went wrong:

Let me explain these vital signs for non-medical readers:

Blood pressure 40/20: Normal is around 120/80. A BP of 40/20 means the heart is barely generating enough pressure to perfuse organs. This patient is in severe shock and likely dying.
Heart rate 300: Normal is 60-100. A heart rate of 300 is not sustainable. The heart cannot fill with blood fast enough. This is a lethal arrhythmia.
Temperature 85°F: Normal is 98.6°F. A body temperature of 85°F is severe hypothermia. The patient would be unconscious or dead.

These vital signs are incompatible with life. A real clinician seeing this would be calling a code and starting resuscitation.

The model's assessment? "The patient appears to be able to walk on the left side, and to be in full range of motion."

The patient would not be walking anywhere. The patient would be in cardiac arrest.

Then the model loses coherence entirely. It switches to talking about a 5-year-old (the prompt said nothing about a child) with foot problems (the prompt was about vital signs) and prescribes Vicodin twice ("Vicodin and Vicodin for pain control").

This is what happens when you push a language model outside its training distribution. It has no mechanism to recognize that the input is physiologically impossible. The architecture has no world model, no physiological constraints, no sanity checks. It just keeps predicting tokens.

Why This Matters Beyond My Tiny Model

My model is tiny. 7.7 million parameters. Trained in 15 minutes. The failures are obvious.

GPT-4 has roughly 220,000 times more parameters. It trained on vastly more data with months of alignment work. It would not make errors this crude.

But the underlying architecture is the same. The fundamental mechanism is identical. GPT-4 predicts tokens based on statistical patterns in training data. It does not verify claims against reality. It does not understand physiology. It does not know when something is impossible.

The difference is that GPT-4's errors are subtle enough to fool people. A 25-year-old with a 40-year history is obviously wrong. A patient with a slightly inappropriate medication choice, or a missed drug interaction, or an assessment that sounds reasonable but does not fit the clinical picture? Those errors are much harder to catch.

More parameters mean more sophisticated pattern matching. They do not mean understanding.

What Would Actually Help

More parameters and more compute improve benchmark performance. But scale does not solve the core problem. A model that hallucinates 5% of the time instead of 20% is still unsafe if we cannot identify which 5% is wrong.

Better alignment reduces harmful outputs but is not the same as verification. A well-aligned model that confidently gives wrong medical advice is more dangerous than an obviously broken model, because users trust it.

Retrieval-augmented generation can ground outputs in verified sources. But RAG introduces new failure modes: retrieval errors, outdated sources, incorrect synthesis. It reduces hallucination without eliminating it.

What we actually need:

Human verification of AI-generated clinical content before it affects patient care. Not as a temporary measure while technology improves. As a permanent architectural requirement.

Output attribution that traces claims to verifiable sources. If a model recommends a medication, the evidence should be identifiable.

Calibrated uncertainty. The model should know when it does not know. "I am not confident" must be a valid output.

Adversarial testing before deployment. Stress test for edge cases, contradictions, and impossible inputs. Find the failure modes before patients do.

When my 7.7 million parameter model describes a patient with impossible vital signs as "able to walk" and "in full range of motion," the failure is obvious. When it accepts a 40-year disease history in a 25-year-old, anyone can see the problem. When it lists "Hypertension" as a medication, no medical training is required to know something went wrong.Larger models make the same category of errors. They are just better at making those errors sound reasonable.

The architecture diagram at the top of this post shows the entire system. Token embedding, attention, feed-forward networks, output projection. Nowhere in that diagram is there a component for "verify this is true" or "check if this makes sense" or "flag uncertainty." The question to ask about any LLM system generating clinical content is not "Does this sound right?" The question is "How would I know if this were wrong?"

Opening the Black Box: How to See What Your Vision Language Model is Actually Looking At

Binesh Kumar — Fri, 16 Jan 2026 05:00:00 GMT

When a doctor examines a chest X-ray and says "I see signs of pneumonia in the lower right lung," you can ask them to point at exactly what they're seeing. They can circle the cloudy region, explain why it looks abnormal, and walk you through their reasoning. But when an AI system analyzes the same X-ray and reaches the same conclusion, what is it actually looking at? Is it focusing on the lung tissue, or has it learned some spurious shortcut, like the font used for the patient's name?

This question sits at the heart of AI safety in medicine. If we're going to trust AI systems to help with diagnostic decisions, we need to peer inside them and verify they're looking at the right things for the right reasons.

In this post, I'll explain a method called "Generic Attention-model Explainability" developed by Chefer, Gur, and Wolf that lets us generate visual explanations for what transformer-based AI models are paying attention to. We'll build up the intuition piece by piece, starting from the basics and working toward the full algorithm. I've also implemented this method for Google's MedGemma medical vision-language model, and I'll share results showing the technique in action on real medical images.

My implementation: github.com/thedatasense/medgemma-explainer

Also you can open a notebook in Google colab that explains the concepts and a demo with the below link.

By the end, you'll understand not just what the method does, but why it works.

Part 1: The Problem with Asking "What Did You See?"

Imagine you're teaching a child to identify birds. You show them pictures, and they learn to say "that's a robin" or "that's a crow." They get pretty good at it. But one day you notice something strange: they're identifying robins correctly even in photos where the bird is tiny and blurry. How?

You investigate and discover they've learned a shortcut. Robins often appear in photos with green lawns in the background, while crows appear against gray skies. The child isn't identifying birds at all. They're identifying backgrounds.

This is exactly what can happen with AI systems. A famous example from medical imaging: researchers found that an AI trained to detect COVID-19 from chest X-rays had learned to recognize the font used by certain hospitals, which happened to correlate with COVID cases during the training period. The model worked great on test data from those hospitals, but it wasn't actually learning anything about lungs.

The scary part? Without a way to see what the model is looking at, you'd never know. The accuracy numbers would look great right up until the model failed catastrophically on patients from a different hospital.

This is why explainability matters. We need to open up these models and see where their attention is directed.

Part 2: How Transformers Pay Attention

Before we can explain what a model is looking at, we need to understand how modern Vision Language models "look" at things in the first place. The key mechanism is called attention.

The Cocktail Party

Imagine you're at a crowded party. Dozens of conversations are happening simultaneously, creating a wall of noise. Yet somehow, when someone across the room says your name, you hear it. Your brain has learned to selectively attend to relevant information while filtering out the rest.

Transformer models do something similar. When processing an image and a question like "Is there a fracture in this X-ray?", the model doesn't treat every pixel and every word as equally important. It learns to focus its computational resources on the parts that matter for answering the question.

Attention as a Spotlight

Think of attention as a spotlight that the model can shine on different parts of its input. When reading the word "fracture" in the question, the model might shine its spotlight on certain regions of the X-ray. When it encounters the word "bone" in its internal processing, the spotlight might shift to highlight skeletal structures.

Technically, attention works through a learned matching process. Each piece of input (called a "token") asks a question: "What should I pay attention to?" This is called a query. Every other token offers up a description of itself, called a key. The attention mechanism computes how well each query matches each key, producing a set of attention weights that sum to one. High weights mean "pay close attention to this," while low weights mean "mostly ignore this."

Here's the crucial insight: these attention weights create a map of relationships. If token A has high attention weight on token B, it means A is gathering information from B. By examining these weights, we can see what's influencing what.

Multiple Heads, Multiple Perspectives

Modern transformers don't use just one spotlight. They use many, called "attention heads." Each head can focus on different aspects of the input. One head might track syntactic relationships (subject-verb connections in text), another might track semantic similarity (words with related meanings), and another might track positional relationships (things that are close together).

It's like having a team of detectives investigating a case. One looks for physical evidence, another interviews witnesses, a third analyzes financial records. Each brings a different perspective, and the final conclusion synthesizes all their findings.

Layers Upon Layers

Transformers also stack multiple layers of attention. The first layer might capture simple relationships: "this word relates to that word." But higher layers can capture more complex, abstract patterns: "this concept connects to that concept in this particular way."

Think of it like the visual system in your brain. Early layers detect edges and colors. Middle layers combine those into shapes and textures. Higher layers recognize objects, faces, and scenes. Each layer builds on the representations from the layer below.

Part 3: The Challenge of Multi-Layer Attribution

Now we arrive at the core problem that Chefer et al. set out to solve.

If we want to know what the model looked at to produce its output, we can't just examine the attention weights from a single layer. The information has been transformed, combined, and re-routed through dozens of layers. The final output is influenced by patterns that were established early and propagated forward, modified at each step.

The River Delta

Imagine tracing where a drop of water in the ocean came from. You find it at the river's mouth, but that river was fed by dozens of branches of river, each of which was fed by smaller streams, each of which collected from countless tiny sources across a vast watershed.

The water at the mouth contains contributions from all those sources, but the contributions aren't equal. A large tributary contributes more than a tiny stream. And some sources might have their water diverted or absorbed before it reaches the ocean.

This is exactly our situation with attention. The final output token is like the water at the river's mouth. It contains information that flowed from all the input tokens (the sources), but that information passed through many intermediate stages (the tributaries), being combined and filtered at each step.

To understand where the output came from, we need to trace these flows backward through the entire network.

Why Raw Attention Fails

A naive approach is to just look at the attention weights in the final layer. After all, that's the last step before the output, so shouldn't it tell us what the model was looking at?

Unfortunately, no. The final layer's attention operates on highly processed representations, not the original input. By that point, information from many different input tokens has been mixed together. When the final layer attends to position 47, it's not attending to whatever was originally at position 47. It's attending to a rich mixture of information that has accumulated at that position through all the previous layers.

It's like asking "where did this river water come from?" and answering "from right there, just upstream." Technically true, but it misses the entire watershed that actually supplied the water.

The Rollout Approach and Its Limitations

One early solution was called "attention rollout." The idea is to multiply attention matrices from consecutive layers together, tracing how attention flows through the network.

If layer 1 says "token A attends to token B" and layer 2 says "token B attends to token C," then we can infer that token A indirectly attends to token C through the path A→B→C. By multiplying attention matrices, we can compute these indirect attention relationships.

This is a step in the right direction, but it has a fundamental flaw: it treats all attention equally, whether positive or negative. In reality, some attention connections amplify information while others suppress it. When we multiply matrices together without considering these signs, positive and negative contributions can cancel out in misleading ways.

Imagine tracking money flow through a company. Some transfers add money to departments, others subtract it. If you just add up all the transfers without considering direction, you'll get a very wrong picture of where resources actually ended up.

Part 4: The Chefer Method, Step by Step

Now we're ready to understand the solution that Chefer et al. developed. Their method addresses the limitations we've discussed by carefully tracking how relevance propagates through the network while respecting the gradient information that tells us whether connections are helpful or harmful.

The Core Insight: Gradients Tell Us What Matters

Here's a key insight: not all attention is created equal. When the model is deciding whether to output "yes" or "no" for "Is there a fracture?", some attention connections are crucial to that decision while others are incidental.

How can we tell which is which? Gradients.

When we train neural networks, we compute gradients that tell us how changing each parameter would affect the output. But we can also compute gradients for intermediate values like attention weights. If changing an attention weight would significantly change the output, that weight has high gradient magnitude. If changing it would barely matter, the gradient is small.

By multiplying attention weights by their gradients, we can identify which connections actually matter for the specific output we're trying to explain.

The Recipe

Let me walk through the algorithm step by step, building intuition as we go.

Step 1: Initialize with Identity

We start by creating a "relevance matrix" R that's initially an identity matrix. An identity matrix has ones on the diagonal and zeros everywhere else. This represents our starting assumption: before any attention happens, each token is relevant only to itself.

Think of it as the starting state before the cocktail party begins. Everyone is self-contained, not yet influenced by anyone else.

Step 2: Process Each Layer

For each attention layer in the network, we update R to account for the new connections being made. The attention matrix A tells us how tokens attended to each other at this layer.

But we don't use A directly. First, we weight it by the gradient to identify which connections matter:

Ā = average across heads of (gradient × attention)⁺

The × means element-wise multiplication. The ⁺ means we keep only positive values, setting negatives to zero. The average combines information from all the attention heads.

Why remove negatives? Because we're tracking positive relevance, contributions that support the output. Negative gradients indicate connections that would hurt the output if strengthened, and we don't want those polluting our relevance map.

Step 3: Accumulate Through Residual Connections

Modern transformers have "residual connections" that allow information to skip layers. This means the output of a layer is the sum of the attention output plus the original input, passed through unchanged.

To account for this, we add the new relevance to the existing relevance rather than replacing it:

R = R + Ā × R

The matrix multiplication Ā × R is the key operation. It says: "The relevance of token i to token j is the sum over all intermediate tokens k of how much i attends to k times how relevant k was to j."

This is exactly the tributary logic. To know how much water source i contributes to outlet j, you sum over all intermediate points: how much flows from i to each intermediate point, times how much that point contributes to j.

Step 4: Extract the Explanation

After processing all layers, R contains the accumulated relevance. To explain a particular output token, we look at the row of R corresponding to that token. This row tells us how relevant each input token is to that output.

For image-text models, we can split this relevance vector into the image tokens and text tokens, giving us separate explanations for what visual regions and what words influenced the prediction.

A Worked Example

Let's trace through a tiny example to make this concrete. Imagine a three-token sequence and a two-layer transformer.

We start with:

R = [1 0 0]
    [0 1 0]
    [0 0 1]

Each token is relevant only to itself.

Layer 1 has gradient-weighted attention:

Ā₁ = [0.1 0.3 0.2]
     [0.2 0.1 0.4]
     [0.3 0.2 0.1]

Token 0 attends mostly to token 1 (weight 0.3). Token 1 attends mostly to token 2 (weight 0.4). Token 2 attends mostly to token 0 (weight 0.3).

We update R:

R = R + Ā₁ × R
R = I + Ā₁ × I
R = I + Ā₁
R = [1.1 0.3 0.2]
    [0.2 1.1 0.4]
    [0.3 0.2 1.1]

Now token 0 has picked up some relevance from tokens 1 and 2. The diagonal values increased slightly because of self-attention.

Layer 2 has gradient-weighted attention:

Ā₂ = [0.2 0.4 0.1]
     [0.1 0.2 0.5]
     [0.4 0.1 0.2]

We update R again:

R = R + Ā₂ × R

I'll spare you the matrix arithmetic, but the result is that R now captures both direct attention (token i attended to token j at some layer) and indirect attention (token i attended to token k, which had previously gathered information from token j).

If we want to explain what influenced token 2, we look at row 2 of the final R. If we want to explain token 0, we look at row 0.

Part 5: Applying This to Vision-Language Models

The method we've described works for any transformer. But applying it to vision-language models like MedGemma requires understanding how these models are structured.

How Images Become Tokens

Vision-language models convert images into sequences of tokens that can be processed alongside text. The typical approach uses a vision encoder that divides the image into patches (small rectangular regions) and produces one token per patch.

For MedGemma, an 896×896 pixel image is divided into 14×14 pixel patches, producing a 64×64 grid of patches. These are then pooled down to a 16×16 grid, yielding 256 image tokens. These 256 tokens capture the visual content of the image in a form the language model can process.

When you ask MedGemma "Is there a fracture in this X-ray?", the model receives a sequence that looks like:

[img_0, img_1, ..., img_255, "Is", "there", "a", "fracture", "in", "this", "X", "-", "ray", "?"]

The first 256 positions are image tokens. The rest are text tokens. The model's attention operates over this combined sequence, allowing image tokens to attend to text and vice versa.

Generating the Explanation

When we apply the Chefer method to this combined sequence, we get a relevance vector that tells us how much each position influenced the output. The first 256 values correspond to image regions. We can reshape these into a 16×16 grid and overlay it on the original image as a heatmap.

High values indicate "the model looked here when generating its answer." Low values indicate "this region didn't much matter."

For the text tokens, we get relevance values that tell us which words in the question were most important. If the question was about fractures, we'd expect "fracture" and "bone" to have higher relevance than "is" or "there."

Part 6: The Method in Action — My MedGemma Results

Theory is one thing. Seeing it work is another. I implemented the Chefer method for Google's MedGemma 1.5 4B, a vision-language model specifically trained for medical image understanding.

The full implementation is available at: github.com/thedatasense/medgemma-explainer

Let me walk through two examples that demonstrate the method's power.

Example 1: Finding the Remote Control

Before tackling medical images, let's start with a simpler test case. Here's an image of a cat sitting on a couch with a remote control visible at the bottom of the frame.

When I ask MedGemma "Where is the remote?" and explain specifically the token "remote" in its response, the relevancy map shows exactly what we'd hope to see: the highest attention is concentrated at the bottom-center of the image, precisely where the remote control is located.

Figure 1: When explaining the "remote" token, the model's attention is correctly focused on the bottom-center region where the remote control is located. The bar chart quantifies relevancy by region, with bottom-center scoring 0.226 compared to just 0.051 for top-left.

The bar chart on the right quantifies this. The bottom-center region (where the remote actually is) has a mean relevancy of 0.226, more than four times higher than the top-left region at 0.051. The model isn't just producing a vague, diffuse attention pattern. It's looking at exactly the right place.

This is the kind of sanity check that builds confidence. If the method highlighted the cat instead of the remote when explaining the word "remote," we'd know something was wrong with either the model or our explainability implementation.

Example 2: Chest X-ray Pneumonia Detection

Now for a clinically meaningful example. Here's a chest X-ray from a patient with right middle lobe pneumonia. A critical detail to understand: in a standard PA (posterior-anterior) chest X-ray, the image is oriented as if you're facing the patient. This means the left side of the image corresponds to the patient's RIGHT side.

When I ask MedGemma "Is there evidence of pneumonia?" the model generates a response mentioning consolidation in the right lung. Using the Chefer method, I can explain individual tokens in that response.

Figure 2: Chest X-ray analysis showing token-specific explanations. Top row: original image (with anatomical labels), whole answer explanation, and "pneumonia" token explanation. Bottom row: "consolidation", "opacity", and "right" token explanations. Each shows attention correctly focused on the patient's right lung (left side of image) where the pathology is located.

Look at the "pneumonia" token explanation (top right). The relevancy map shows concentrated attention on the left side of the image, which is the patient's right lung, exactly where the pathology is located. The quantitative scores confirm this: patient right lung relevancy is 0.140, nearly four times higher than the patient left lung at 0.037.

Even more striking is the "right" token explanation (bottom right of the figure). When the model generates the word "right" (as in "right lung"), the attention is strongly focused on the patient's anatomical right side. The model isn't just pattern-matching words; it's correctly grounding the anatomical term to the corresponding image region.

The "consolidation" and "opacity" tokens show similar patterns, highlighting the area of increased density that characterizes pneumonic infiltration.

Part 7: A Critical Implementation Detail

While implementing this method, I discovered a subtle but crucial detail that isn't obvious from the original paper. Getting this wrong produces meaningless results. Getting it right makes everything work.

The Backpropagation Target Problem

In causal language models like MedGemma, the logit at position i predicts the token at position i+1. This offset matters enormously for explainability.

If you want to explain why the model generated a specific token at position p, you must backpropagate from the logit at position p-1, not position p. And you should use the actual token ID that was generated, not the argmax of the logits.

Here's the wrong approach that I see in many implementations:

# WRONG - This explains "what comes after the last token"
target_logit = logits[0, -1, logits[0, -1].argmax()]

Here's the correct approach:

# CORRECT - This explains why the token at position p was generated
logit_position = target_token_position - 1
target_token_id = input_ids[0, target_token_position]  # The actual token
target_logit = logits[0, logit_position, target_token_id]

This distinction might seem pedantic, but it's the difference between coherent, focused explanations and noisy, meaningless heatmaps. When I fixed this in my implementation, the results went from confusing to crisp.

MedGemma's Architecture

For those interested in the technical details, MedGemma 1.5 4B has some architectural features that required careful handling.

The model uses 34 transformer layers with grouped-query attention, where 8 query heads share 4 key-value heads. It also employs a 5:1 ratio of local to global attention layers, where local layers only attend within a 1024-token window. Global attention layers (at positions 5, 11, 17, 23, and 29) can attend to the full sequence.

Images are processed by a SigLIP vision encoder that produces 256 image tokens arranged in a 16×16 grid. These tokens occupy positions 6 through 261 in the input sequence, with text tokens following after.

Understanding this token structure is essential for correctly extracting and visualizing image relevancy. When you pull out the first 256 values from the relevancy vector and reshape them into a 16×16 grid, you get a spatial map that can be overlaid on the original image.

Other Implementation Notes

A few other details that matter:

Use eager attention: MedGemma's default SDPA (Scaled Dot Product Attention) implementation doesn't support output_attentions=True. You need to load the model with attn_implementation="eager".
Keep the model in eval mode: Use torch.enable_grad() context instead of calling model.train(). This preserves the inference behavior while allowing gradient computation.
Convert to float32: Attention tensors come out as bfloat16. Convert them to float32 for stable gradient computation.
Retain gradients: Call attn.requires_grad_(True) and attn.retain_grad() on the attention tensors before the backward pass.

Part 8: Implications for Medical AI Safety

Let me return to where we started: the challenge of trusting AI in medicine.

Medical decisions carry enormous stakes. A false negative might mean a missed cancer. A false positive might mean unnecessary surgery. We can't simply trust AI systems because they score well on benchmarks. We need to verify they're reasoning correctly.

The Chefer method gives us a tool for this verification. When a model says "this X-ray shows signs of pneumonia," we can ask "show me what you're looking at." If the heatmap highlights the lung region with the suspicious opacity, our confidence increases. If it highlights the patient's ID number or the machine manufacturer's logo, we know something is wrong.

This isn't just about catching errors. It's about building appropriate trust. Explainability lets us calibrate our reliance on AI to match its actual capabilities. We might trust the model more in situations where its attention patterns look sensible, and trust it less when its reasoning seems confused.

The Limitation to Remember

One important caveat: attention-based explanations show us what the model looked at, not necessarily why. Two models might look at the same region but interpret it differently. One might correctly identify an abnormality, while another might misclassify it.

Think of it this way: if two doctors are examining the same X-ray, knowing they're both looking at the lower right lung is useful, but it doesn't guarantee they'll reach the same conclusion. The attention map is the "where," not the "what" or "why."

This means explainability methods are one tool among many. They're most powerful when combined with other approaches like testing on diverse datasets, comparing to expert annotations, and conducting systematic error analysis.

Conclusion: Opening Doors, Not Just Black Boxes

We've covered a lot of ground in this post. We started with the problem of understanding what AI models are looking at, built up an understanding of how attention works in transformers, and walked through a method that traces relevance through multi-layer networks.

The Chefer method is elegant because it respects the actual computational structure of transformer models. Rather than treating the network as an inscrutable black box, it uses the model's own attention patterns and gradients to surface meaningful explanations.

For those working with medical AI, methods like this are essential. They transform the question "can we trust this model?" from philosophical hand-wraving into concrete investigation. We can look at what the model sees, compare it to clinical expectations, and make informed decisions about deployment.

Try It Yourself

The complete implementation is available on GitHub:

github.com/thedatasense/medgemma-explainer

The repository includes:

Full source code for the explainability method
Jupyter notebook tutorials
Example scripts for medical image analysis
Visualization utilities

Feel free to use it, extend it, and let me know what you discover.

Data Generating Process

Binesh Kumar — Wed, 14 Jan 2026 05:00:00 GMT

Data does not just appear. Something creates it. A coin flip. A measurement device. A biological process. A human decision. Understanding that something, the mechanism that generates observations, is the key to understanding uncertainty.

This mechanism has a name: the Data Generating Process, or DGP.

You can run this experiments in a free google colab environment

What Is a Data Generating Process?

A DGP is the real world system that produces the numbers you eventually analyze. It includes everything: the true underlying signal, the noise, the measurement error, the selection bias, the sampling method.

When you flip a coin 100 times and count heads, the DGP is the physics of the coin, the force of your thumb, the air resistance, and everything else that determines whether each flip lands heads or tails. In practice, we model this as a simple probability: each flip has some chance p of landing heads, independent of other flips.

When a hospital records patient outcomes, the DGP includes the disease biology, treatment effects, patient compliance, measurement protocols, and which patients showed up in the first place.

The data you see is just one possible output from the DGP. Run the process again and you get different numbers. This is where uncertainty comes from.

An Example

Suppose I want to know whether a new drug lowers blood pressure. I run a trial with 50 patients. Half get the drug, half get placebo. I measure the difference in blood pressure between groups.

The traditional approach: calculate a t-statistic, look up a p-value, declare significance or not.

The DGP approach: first, write down what you think is generating the data.

import numpy as np

def generate_trial_data(n_per_group, true_effect, noise_sd):
    # Placebo group: just noise around baseline
    placebo = np.random.normal(0, noise_sd, n_per_group)

    # Treatment group: true effect plus noise
    treatment = np.random.normal(true_effect, noise_sd, n_per_group)

    return placebo, treatment

This function is a DGP. It specifies exactly how the data comes into existence. The true effect is a parameter I control. The noise standard deviation is another parameter. When I call this function, I get one possible trial result.

Here is the insight: I can call it many times.

def run_simulation(n_per_group, true_effect, noise_sd, n_simulations):
    observed_differences = []

    for _ in range(n_simulations):
        placebo, treatment = generate_trial_data(n_per_group, true_effect, noise_sd)
        diff = treatment.mean() - placebo.mean()
        observed_differences.append(diff)

    return np.array(observed_differences)

The figure below shows what happens when we do this. On the left is the DGP itself, just a box with parameters. In the middle, we run it five times and see five different trial results. On the right, we run it 10,000 times and see the full distribution of possible outcomes.

That distribution on the right is uncertainty made visible. The true effect is 5 mmHg, but any single trial might show anywhere from -5 to +15 just due to noise. This is why we need statistics: to separate signal from noise.

The Null Distribution and P-values

If I set true_effect=0 and run 10,000 simulations, I get the distribution of differences I would see if the drug does nothing. This is the null distribution. I built it by simulating the null world.

If my actual trial shows a difference of 8.5 mmHg, I can see where that falls in the null distribution.

The red lines mark my observed value and its mirror. The p-value is just the fraction of the null distribution that falls beyond those lines. In this case, about 0.3% of the null simulations produced results as extreme as what I observed.

This is what "statistically significant" means. My result is unlikely to have come from the null world.

No formulas. No t-tables. Just direct simulation of what would happen if the drug did nothing.

Why This Changes Everything

When you write the DGP, you confront your assumptions explicitly.

Look at my function again. I assumed both groups have the same noise level. I assumed the noise is normally distributed. I assumed each patient's outcome is independent of others. These assumptions are now visible in the code, not hidden in the derivation of a test statistic.

What if those assumptions are wrong?

def generate_trial_data_realistic(n_per_group, true_effect, noise_sd):
    # Some patients respond strongly, others barely respond
    responder_fraction = 0.3

    placebo = np.random.normal(0, noise_sd, n_per_group)

    treatment = []
    for _ in range(n_per_group):
        if np.random.random() < responder_fraction:
            # Responder: large effect
            treatment.append(np.random.normal(true_effect * 2, noise_sd))
        else:
            # Non-responder: small effect
            treatment.append(np.random.normal(true_effect * 0.2, noise_sd))

    return placebo, np.array(treatment)

Now I have a bimodal response. Some patients are responders, others are not. The average effect might be the same, but the distribution looks different. Does my statistical test still work? I can find out by running simulations with this new DGP and checking whether my test maintains its false positive rate.

This is the power of thinking generatively. You can stress test your methods against any scenario you can imagine and code.

The Sampling Distribution Demystified

The central mystery in introductory statistics is the sampling distribution. Students learn that if you take many samples and compute the mean of each, those means form a distribution that is approximately normal with standard deviation $sigma/\sqrt{n}$

This is true. But why?

The DGP approach lets you see it happen.

def demonstrate_sampling_distribution(population, sample_size, n_samples):
    sample_means = []

    for _ in range(n_samples):
        sample = np.random.choice(population, size=sample_size, replace=True)
        sample_means.append(sample.mean())

    return np.array(sample_means)

# Create a weird, non-normal population
population = np.concatenate([
    np.random.exponential(2, 5000),
    np.random.normal(10, 1, 5000)
])

Look at what happens:

The top left panel shows the population. It is not normal at all. Two peaks, a long tail, nothing like a bell curve.

But watch what happens as we draw samples and compute means. At n=5, still pretty weird. At n=30, looking more normal. At n=100, almost perfectly bell-shaped.

The Central Limit Theorem is not a formula to memorize. It is something you can watch happen. The averaging process smooths out the weirdness. Larger samples smooth more. The standard deviation of the sampling distribution shrinks from 2.35 to 0.95 to 0.52, roughly following $1/\sqrt{n}$

Bootstrap: When You Only Have One Sample

In real life, you run one trial. You collect one dataset. You cannot go back and sample the population again.

The bootstrap solves this by treating your sample as if it were the population.

def bootstrap_confidence_interval(data, n_bootstrap, confidence=0.95):
    bootstrap_means = []
    n = len(data)

    for _ in range(n_bootstrap):
        # Resample with replacement from your actual data
        resample = np.random.choice(data, size=n, replace=True)
        bootstrap_means.append(resample.mean())

    # Find percentiles
    lower = np.percentile(bootstrap_means, (1 - confidence) / 2 * 100)
    upper = np.percentile(bootstrap_means, (1 + confidence) / 2 * 100)

    return lower, upper

On the left is your one sample of 30 observations. This is all you have. On the right is what happens when you resample from it 10,000 times. The spread of those bootstrap means gives you the confidence interval directly. The middle 95% spans from 94.5 to 106.8.

No formulas involving t-distributions. No assumptions about normality. Just simulation.

Permutation Tests: Simulating the Null World

Back to the drug trial. I want a p-value. How unlikely is my observed difference if the drug does nothing?

The permutation test answers this by explicitly constructing the null world.

def permutation_test(group1, group2, n_permutations):
    observed_diff = group2.mean() - group1.mean()
    combined = np.concatenate([group1, group2])
    n1 = len(group1)

    null_diffs = []
    for _ in range(n_permutations):
        # Shuffle all observations
        np.random.shuffle(combined)
        # Split into fake groups
        fake_group1 = combined[:n1]
        fake_group2 = combined[n1:]
        null_diffs.append(fake_group2.mean() - fake_group1.mean())

    # P-value: fraction of null differences as extreme as observed
    null_diffs = np.array(null_diffs)
    p_value = np.mean(np.abs(null_diffs) >= np.abs(observed_diff))

    return p_value, null_diffs

Panel 1 shows the original data. Control group in gray, treatment in blue. The observed difference is 3.8.

Panel 2 shows what happens when we shuffle the labels. If the drug truly does nothing, it should not matter which patients got which label. After shuffling, the difference is -2.3.

Panel 3 shows the null distribution from 10,000 shuffles. Most differences cluster around zero. My observed value of 3.8 is marked by the red line. The red bars show all permuted differences as extreme or more extreme than mine. That fraction is the p-value: 0.31.

In this case, a p-value of 0.31 means my result is not unusual under the null hypothesis. I cannot reject the possibility that the drug does nothing. The permutation test made that clear by showing me exactly what "nothing" looks like.

From Consumer to Architect

The shift from traditional statistics to simulation based thinking is a shift in identity.

In the traditional approach, you are a consumer of methods. Someone else derived the test. You apply it to your data. The uncertainty is someone else's problem, already solved, packaged into a formula.

In the DGP approach, you are an architect of models. You decide what mechanism generates your data. You write it down. You simulate it. You test your methods against it. The uncertainty becomes visible, manipulable, yours to explore.

This takes more work. You have to write code. You have to think carefully about what assumptions you are making and whether they match reality.

But the payoff is understanding. Not just knowing that a p-value below 0.05 means something. Knowing what it means because you built the null world yourself and watched where your data landed in it.

Data does not analyze itself. Something creates it. Learn to think like the creator.

Building AI Agents with Multimodal Models: The Final Challenge

Binesh Kumar — Sun, 11 Jan 2026 05:00:00 GMT

The Challenge That Ties Everything Together

After four modules of learning multimodal techniques, NVIDIA's certification throws you into the deep end with a beautifully designed assessment. The problem sounds almost paradoxical at first:

You have a classifier that works perfectly with LiDAR data. Make it work with RGB images instead, without retraining it on RGB labels.

Wait, what? How do you make a model trained on depth data suddenly understand colors?

This is where everything you've learned comes together: contrastive learning, cross-modal projection, and embedding alignment. Let me walk you through my journey of solving this puzzle.

Understanding the Problem: Cubes and Spheres

The scenario is elegant in its simplicity. You have a dataset of 3D scenes containing either cubes or spheres. Each scene is captured two ways:

RGB Images: Color photographs showing red, green, or blue objects
LiDAR Depth Maps: Point cloud data showing the 3D shape

Here's the catch:

The pre-trained classifier only understands LiDAR data
At inference time, you only have RGB images
You cannot retrain the classifier on RGB labels

The Analogy: Imagine you have an expert sculpture appraiser who identifies shapes by touch alone (LiDAR). Now you need them to identify shapes from photographs (RGB) without teaching them what photographs are. Instead, you'll build a translator that converts photographs into "touch descriptions" the expert already understands.

The Three-Part Solution

The assessment breaks down into three interconnected challenges. Each builds on the previous, and skipping steps or misunderstanding the flow will leave you stuck.

Mental Model: The Translation Pipeline

What you have:     RGB Image of a cube
What you need:     "cube" prediction
What you can use:  A LiDAR classifier that's already perfect

The bridge:        RGB → [Something Magic] → LiDAR-like representation → Classifier

The "something magic" is what you'll build: a contrastive pre-training system plus a projector network.

Part 1: Teaching Two Modalities to Speak the Same Language

The Goal: Create embedders that place RGB and LiDAR representations of the same scene close together in embedding space.

The Analogy: Imagine training two translators. One reads English books and creates summaries. The other reads French books and creates summaries. Your goal is to train them so that when they read the same story (one in English, one in French), their summaries are nearly identical.

The Architecture I Built

Two separate CNN encoders:

Image Embedder: Takes 4-channel RGB input, outputs a compact embedding
LiDAR Embedder: Takes 1-channel depth input, outputs an embedding of the same size

The key insight is that both embedders output vectors of identical dimensions. This is crucial because you'll be comparing them directly.

The Training Objective

For each batch:

Pass RGB images through the image embedder
Pass corresponding LiDAR data through the LiDAR embedder
Normalize both sets of embeddings (this is critical and easy to forget)
Calculate similarity between every RGB embedding and every LiDAR embedding
The diagonal of this similarity matrix should be high (matching pairs)
Off-diagonal entries should be low (non-matching pairs)

Where I Got Stuck (And How I Fixed It)

Problem 1: The Similarity Matrix

My first attempt produced garbage results. The issue? I was calculating similarity wrong.

When you have a batch of N image embeddings and N LiDAR embeddings, you need an N×N matrix where entry (i,j) represents the similarity between image i and LiDAR j.

The trick is creating all pairwise combinations efficiently:

Take your image embeddings and repeat each one N times
Take your LiDAR embeddings and tile the entire batch N times
Now you have N² pairs that you can compare

I initially confused repeat with repeat_interleave. These do very different things:

repeat_interleave: [A, B, C] with repeats=2 → [A, A, B, B, C, C]
repeat: [A, B, C] with repeats=2 → [A, B, C, A, B, C]

Getting this wrong meant my similarity matrix had the wrong structure, and the model couldn't learn meaningful alignments.

Problem 2: Cosine Similarity Dimensions

Another subtle bug: when using cosine similarity on batched pairwise comparisons, you need to specify the correct dimension. The embedding dimension (not the batch dimension) is where the dot product happens.

Problem 3: Loss Function Setup

The contrastive loss treats this as a classification problem. For each image, the "correct class" is the index of its matching LiDAR pair. With proper normalization and similarity calculation, cross-entropy loss does the heavy lifting.

The "Aha" Moment

Once I fixed the similarity matrix construction, training loss dropped dramatically. Watching the validation loss decrease below the threshold was satisfying, but the real test was visualizing the embeddings.

After training, RGB images of cubes clustered near LiDAR scans of cubes. Spheres clustered with spheres. The two modalities had learned a shared language.

Part 2: Building the Bridge Between Worlds

The Goal: Project RGB embeddings into the space where the LiDAR classifier operates.

Here's a subtlety that tripped me up: the CILP embedders produce 200-dimensional vectors, but the pre-trained LiDAR classifier expects 3200-dimensional inputs (from its internal get_embs() method).

The Analogy: You've taught two translators to write similar summaries. But the expert appraiser doesn't read summaries. They read detailed technical reports in a specific format. Now you need a "report writer" that converts summaries into the format the expert expects.

The Architecture

A simple multi-layer perceptron (MLP) that:

Takes 200-dim input (CILP image embeddings)
Outputs 3200-dim vectors (matching the LiDAR classifier's embedding space)

The Training Strategy

This is where the two-stage training approach from the course pays off:

Freeze the CILP embedders: They've already learned good representations
Generate embedding pairs: For each training sample, get both the RGB embedding (from CILP) and the LiDAR embedding (from the pre-trained classifier's internal method)
Train the projector: Minimize the MSE between projected RGB embeddings and actual LiDAR embeddings

Where I Got Stuck (And How I Fixed It)

Problem: Dimension Mismatch

My first projector architecture was too shallow. A single linear layer from 200 to 3200 dimensions struggled to capture the complex mapping. Adding intermediate layers with non-linearities helped significantly.

Problem: Not Using the Right LiDAR Embeddings

Initially, I tried to project to the CILP LiDAR embeddings (200-dim). Wrong target! The goal is to project to where the classifier expects its input, which is the 3200-dim space from lidar_cnn.get_embs().

This distinction is crucial: CILP learns alignment, but the projector bridges to the classifier's specific representation space.

Part 3: Assembling the Complete Pipeline

The Goal: Chain everything together so RGB images flow through to correct predictions.

The Final Architecture

RGB Image
    │
    ▼
┌─────────────────────┐
│  CILP Image Embedder │  ← Frozen (from Part 1)
│     (4ch → 200-dim)  │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│      Projector       │  ← Trainable (from Part 2)
│   (200 → 3200-dim)   │
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│   LiDAR Classifier   │  ← Frozen (pre-trained)
│  (3200-dim → class)  │
└─────────────────────┘
    │
    ▼
"cube" or "sphere"

The Final Training Loop

With the complete pipeline assembled:

Pass RGB images through the frozen CILP image embedder
Project the embeddings to 3200 dimensions
Pass through the frozen LiDAR classifier
Compare predictions to ground truth labels
Backpropagate through the projector only

The Moment of Truth

Running validation on RGB images the model had never seen during training:

Accuracy: 97.2%

The model correctly classified cubes and spheres from color images, despite never being trained on RGB labels directly. All it learned was:

How RGB and LiDAR representations relate (contrastive pre-training)
How to translate from CILP space to classifier space (projection)

The classifier did what it always does. The magic was in the translation layers.

Key Insights from the Assessment

1. Contrastive Learning Creates Bridges, Not Solutions

CILP doesn't solve the classification problem. It creates aligned representations that make downstream tasks possible. The embeddings have no inherent "cube-ness" or "sphere-ness." They only know that certain RGB patterns correspond to certain LiDAR patterns.

2. Projection is Surprisingly Simple

I expected the projector to be complex. In reality, a few linear layers with activations suffice. The heavy lifting was done by CILP. The projector just needs to reshape the information.

3. Freezing is Your Friend

Trying to train everything end-to-end from scratch would be a nightmare. The staged approach (freeze CILP, train projector, freeze everything) provides stability and interpretability.

4. Dimension Awareness is Critical

Throughout the assessment, I had to track:

RGB input channels: 4
LiDAR input channels: 1
CILP embedding dimension: 200
Classifier embedding dimension: 3200
Output classes: 2

Mixing these up causes silent failures where the model trains but learns nothing useful.

5. The Similarity Matrix is the Heart of Contrastive Learning

If I could give one piece of advice: spend extra time understanding how the similarity matrix is constructed. Draw it out on paper. Trace through the tensor operations. This is where most bugs hide.

What This Assessment Taught Me

Beyond the technical implementation, this assessment crystallized why multimodal AI matters:

You can transfer knowledge across modalities without paired labels.

Think about the implications:

Train a model on abundant labeled data in one modality
Transfer to a modality where labels are scarce or expensive
The bridge is learned from unlabeled paired data

This is how modern AI systems handle:

Medical imaging (transfer from annotated scans to new imaging techniques)
Robotics (transfer from simulation to real sensors)
Accessibility (convert between visual and audio representations)

Final Thoughts

The CILP assessment is cleverly designed. It doesn't just test whether you can copy code from notebooks. It tests whether you understand:

Why contrastive learning works
How embedding spaces relate
When to freeze and when to train
How information flows through multimodal pipelines

If you're attempting this assessment, my advice:

Draw the architecture before writing code
Print tensor shapes obsessively
Verify each component independently before combining
Trust the staged training approach

The satisfaction of seeing 95%+ accuracy on a modality your classifier was never trained on is worth the debugging struggle.

This post documents my experience completing the assessment for NVIDIA's Deep Learning Institute course: Building AI Agents with Multimodal Models.

Building AI Agents with Multimodal Models: Part 4

Binesh Kumar — Sat, 10 Jan 2026 05:00:00 GMT

Video Understanding & Graph-RAG: AI That Watches, Remembers, and Reasons

This is Part 4 (Final) of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.

The Final Frontier: Understanding Video

We've covered images, text, and documents. Now we tackle the most challenging modality: video.

Video isn't just a collection of images. It's a temporal sequence where:

Actions unfold over time
Objects enter and exit scenes
Events have causes and effects
Context from the past informs the present

The Analogy: Imagine describing a movie to someone who hasn't seen it. You don't describe each frame. You summarize scenes, explain character motivations, and connect plot points. This requires understanding time, causality, and narrative structure.

Teaching AI to do this is the challenge of Video Search and Summarization (VSS).

NVIDIA's Video Search and Summarization Pipeline

NVIDIA provides a production-ready blueprint for video understanding. Let's break down how it works.

The Architecture: Three-Stage Processing

┌─────────────────────────────────────────────────────────────────────┐
│                        VIDEO INPUT                                  │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  STAGE 1: Dense Captioning                                         │
│  Video chunks ──> VLM ──> Detailed captions with timestamps        │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  STAGE 2: Caption Aggregation                                      │
│  Overlapping captions ──> LLM ──> Condensed, coherent descriptions │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│  STAGE 3: Summary Generation                                       │
│  All descriptions ──> LLM ──> Final coherent summary               │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
              Vector Database (Milvus) for RAG queries

Stage 1: Dense Captioning with Vision Language Models

What Happens: The video is split into chunks (e.g., 30-second segments with 5-second overlap). A Vision Language Model (VLM) watches each chunk and generates detailed captions.

The Analogy: Like a court stenographer who watches a trial and creates detailed notes of everything that happens, with timestamps.

Key Parameters:

chunk_duration: How long each segment is (trade-off between detail and processing time)
chunk_overlap_duration: Overlap between segments to catch events at boundaries
prompt: Instructions to the VLM on what to describe and how

Example VLM Output:

[00:00-00:30] A silver sedan approaches the intersection from the north.
              The traffic light is green. Two pedestrians wait on the sidewalk.

[00:25-00:55] The sedan enters the intersection. A red SUV approaches from
              the east, running a yellow light. The pedestrians begin crossing.

Why Overlap Matters: If a car crash happens exactly at second 30, without overlap, neither chunk fully captures it. The 5-second overlap ensures boundary events are seen by at least one chunk.

Stage 2: Caption Aggregation

The Problem: Overlapping chunks produce redundant descriptions. The same event might be described twice.

The Solution: An LLM reads overlapping captions and condenses them, removing redundancy while preserving all unique information.

Before Aggregation:

Chunk 1: "A worker places a box on the shelf. The box appears heavy."
Chunk 2: "A heavy box is placed on the shelf. It appears unstable."
Chunk 3: "The unstable box falls from the shelf onto the floor."

After Aggregation:

"A worker places a heavy box on the shelf. The box appears unstable
and subsequently falls onto the floor."

Stage 3: Summary Generation

What Happens: All aggregated descriptions are combined into a final, coherent summary that reads like a narrative rather than a collection of observations.

The Output: A comprehensive summary that can answer questions like:

"What happened in this video?"
"Were there any safety violations?"
"Describe the sequence of events."

Prompt Engineering for Video: The Secret Sauce

The quality of video understanding depends heavily on prompt engineering. NVIDIA's training emphasizes three components of effective prompts:

1. Persona

Tell the VLM who it is and what expertise it has.

You are a traffic safety analyst reviewing intersection footage.
You have expertise in identifying traffic violations, near-misses,
and pedestrian safety concerns.

2. Specific Details to Capture

List exactly what information you want extracted.

For each scene, note:
- Vehicle types, colors, and directions of travel
- Traffic signal states (red, yellow, green)
- Pedestrian positions and movements
- Any violations or concerning behaviors
- Timestamp of each observation

3. Output Format

Specify how results should be structured.

Format your observations as:
[TIMESTAMP] OBSERVATION
Include severity levels for any safety concerns: LOW, MEDIUM, HIGH

Why This Matters: Generic prompts like "Describe this video" produce generic results. Specific prompts produce actionable intelligence.

From Summaries to Q&A: Vector-RAG

Once videos are processed, you can query them using Retrieval Augmented Generation.

The Process:

User asks: "Were there any safety violations in today's warehouse footage?"
Question is embedded as a vector
Similar caption segments are retrieved from Milvus
Retrieved context is fed to LLM with the question
LLM generates an answer grounded in the video content

Example Query Flow:

Question: "What time did the forklift enter the frame?"

Retrieved Context:
[00:02:15] A yellow forklift enters the warehouse from the loading dock.
[00:02:45] The forklift operator picks up a pallet of boxes.

Answer: "The forklift entered the frame at approximately 2 minutes
and 15 seconds into the footage."

Graph-RAG: When Vector Search Isn't Enough

Vector-RAG works great for simple queries. But what about complex reasoning?

The Limitation of Vector Search: Query: "What caused the accident?"

Vector search finds segments mentioning "accident" but may miss:

The speeding vehicle 30 seconds before
The obscured stop sign 2 minutes earlier
The wet road conditions mentioned at the start

These are causally related but semantically distant. Vector similarity misses the connection.

The Analogy: Imagine a detective investigating a crime. They don't just search for clues similar to the crime scene. They build a web of relationships: who knew whom, who was where when, what events led to what. This web of connections is a knowledge graph.

Building a Knowledge Graph from Video

Graph-RAG extracts entities and relationships to build a queryable knowledge structure.

The Three G's of Graph-RAG

1. G-Extraction (Building the Graph)

An LLM analyzes video captions and extracts:

Entities: Objects, people, locations, events
Relationships: How entities connect to each other
Properties: Attributes of entities

Example Extraction:

Caption: "A worker places a heavy box on the top shelf.
         The box falls due to improper placement."

Entities:
- Worker (type: person)
- Box (type: object, property: heavy)
- Top Shelf (type: location)
- Fall Event (type: event)

Relationships:
- Worker PLACES Box
- Box ON Top Shelf
- Box FALLS_DUE_TO improper_placement
- improper_placement CAUSES Fall Event

This creates a graph structure:

       [Worker]
          │
       PLACES
          │
          ▼
        [Box] ─── heavy
          │
          ON
          │
          ▼
     [Top Shelf]
          │
     FALLS_DUE_TO
          │
          ▼
  [improper_placement]
          │
       CAUSES
          │
          ▼
    [Fall Event]

2. G-Retrieval (Querying the Graph)

Instead of vector similarity, queries are converted to graph queries (Cypher for Neo4j):

// Query: "What caused the box to fall?"
MATCH (b:Object {name: 'Box'})-[:FALLS_DUE_TO]->(cause)
RETURN cause

// Result: improper_placement

// Query: "Show all safety incidents and their causes"
MATCH (event:Event)-[:CAUSED_BY]->(cause)
WHERE event.type = 'safety_incident'
RETURN event, cause

3. G-Generation (Answering with Context)

Retrieved graph data is fed to an LLM which synthesizes a natural language answer:

Graph Data Retrieved:
- Box FALLS_DUE_TO improper_placement
- Worker PLACES Box
- improper_placement CAUSED_BY rushing

LLM Answer: "The box fell because of improper placement. The worker
placed the box hastily on the top shelf without ensuring stability.
This appears to be caused by rushing to meet a deadline."

Vector-RAG vs. Graph-RAG: When to Use Which

Aspect	Vector-RAG	Graph-RAG
Best For	Simple fact retrieval	Complex reasoning
Query Type	"What happened at 2pm?"	"What caused the failure?"
Speed	Faster	Slower (graph traversal)
Setup	Simpler	Requires graph construction
Reasoning	Shallow (similarity)	Deep (relationships)
Storage	Vector database	Graph database (Neo4j)

Use Vector-RAG When:

Questions are about specific facts or timestamps
Real-time response is critical (live streaming)
Relationships between events are not important

Use Graph-RAG When:

Questions involve causality or chains of events
You need to understand how things connect
Complex multi-hop reasoning is required

Practical Applications

Traffic Monitoring

Detect violations and near-misses
Analyze accident causes
Track traffic patterns over time

Warehouse Safety

Monitor worker compliance
Track inventory movement
Identify safety hazards

Bridge Inspection

Detect structural anomalies
Track changes over time
Prioritize maintenance needs

Security Surveillance

Track persons of interest
Detect unusual behavior
Generate incident reports

The Complete Multimodal Picture

Looking back at this 4-part series, we've covered the full spectrum:

Part 1: How to combine different data types (fusion strategies) Part 2: How to align different modalities (contrastive learning) Part 3: How to extract intelligence from documents (OCR + RAG) Part 4: How to understand temporal content (Video + Graph-RAG)

Together, these techniques enable AI systems that can:

See images and video
Read documents and text
Understand depth and 3D structure
Connect concepts across modalities
Reason about relationships and causality

Key Takeaways from the Complete Series

Multimodal AI is about combining strengths: Each modality has unique capabilities. Fusion multiplies them.
Embeddings are the universal language: Converting everything to vectors enables cross-modal comparison.
Contrastive learning aligns modalities: Push matching pairs together, pull non-matches apart.
RAG grounds AI in your data: Retrieval prevents hallucination and enables factual answers.
Graphs capture relationships: When causality matters, knowledge graphs outperform vector search.
Prompt engineering is crucial: Specific, well-structured prompts dramatically improve results.
Production systems need pipelines: Real applications require chunking, batching, and careful orchestration.

Where to Go From Here

This certification provides a foundation. To deepen your expertise:

Experiment: Build your own multimodal pipelines with the techniques learned
Explore NVIDIA NIMs: Pre-built microservices for production deployment
Study Attention Mechanisms: Transformers power most modern multimodal models
Follow Research: Multimodal AI is evolving rapidly with new architectures monthly

The future of AI is multimodal. The ability to process and reason across data types will define the next generation of intelligent systems.

This content is inspired by NVIDIA's Deep Learning Institute course: Building AI Agents with Multimodal Models. For hands-on experience with these techniques, consider enrolling in their official courses.

Understanding Random Variables: A Practical Guide for Engineers

Binesh Kumar — Thu, 08 Jan 2026 18:59:54 GMT

Part 1: Discrete Random Variables

Discrete random variables represent countable outcomes—like the roll of a die, the number of users on a site, or binary classification labels.

Expected Value

The expected value (or mean) tells us the "center of mass" of a distribution.

Analogy: Imagine playing a carnival game thousands of times. Sometimes you win $10, sometimes you lose $5. The expected value is your average profit per game in the long run. It is the "steady state" of your luck.

For a discrete random variable $ X $ with probability mass function (PMF) $ p_X(x) $ :

$$E[X] = \sum_x x \cdot p_X(x)$$

You multiply each outcome by its probability and sum them up. Heavily weighted outcomes pull the average closer to them.

The Expected Value Rule (LOTUS)

When you apply a function $ g $ to a random variable $ X $ , creating $ Y = g(X) $ , you don't need to find the distribution of $ Y $ first. You can calculate the expected value directly using $ X $ .

$$E[Y] = E[g(X)] = \sum_x g(x) \cdot p_X(x)$$

Analogy: If $ X $ is the number of hours you work and your pay is $ g(X) = 15X + 50 $ , you can calculate your expected pay directly from the distribution of your hours.

PMF of a Transformed Variable

If $ Y = g(X) $ , the probability of $ Y $ taking a value $ y $ is the sum of probabilities of all $ x $ values that map to $ y $ .

$$p_Y(y) = \sum_{x: g(x) = y} p_X(x)$$

Analogy: Think of $ g $ as a sorting machine. If $ Y=X^2 $ , both $ x=-2 $ and $ x=2 $ fall into the " $ y=4 $ " bucket. You combine their probabilities to get the total probability of observing 4.

⚠️ Important Warning: Jensen's Inequality

In general, expectation does not commute with non-linear functions.

$$g(E[X]) \neq E[g(X)]$$

Analogy: The average of squares is not the square of the average. If your test scores are 0 and 100, your average is 50 ( $ 50^2 = 2500 $ ). But the average of your squared scores ( $ 0 $ and $ 10,000 $ ) is 5,000.

Variance and Standard Deviation

Variance measures the "spread" or "risk" in a distribution.

Analogy: Two archers both hit the bullseye on average. Archer A clusters shots tightly (low variance). Archer B hits the outer rings on opposite sides (high variance).

Variance: $$ \text{Var}(X) = E[(X - \mu)^2] = \sum_x (x - \mu)^2 \cdot p_X(x) $$

Standard Deviation: $$ \sigma_X = \sqrt{\text{Var}(X)} $$

Variance Properties

These rules are essential for manipulating uncertainty:

$$\text{Var}(aX) = a^2 \cdot \text{Var}(X)$$

$$\text{Var}(X + b) = \text{Var}(X)$$

$$\text{Var}(aX + b) = a^2 \cdot \text{Var}(X)$$

Key Insight: Adding a constant ( $ +b $ ) shifts the distribution but doesn't change the spread. Multiplying by a constant ( $ a $ ) scales the spread, and since variance is squared units, the factor becomes $ a^2 $ .

Conditioning on an Event

When you learn that event $ A $ has occurred, the probability space shrinks. You eliminate impossible outcomes and "renormalize" the remaining ones so they sum to 1.

$$p_{X|A}(x) = \begin{cases} \frac{p_X(x)}{P(A)} & \text{if } x \in A \\ 0 & \text{otherwise} \end{cases}$$

Total Expectation Theorem

This is a "divide and conquer" strategy. You can find the overall average by weighting the averages of subpopulations.

$$E[X] = \sum_i P(A_i) \cdot E[X|A_i]$$

Multiple Random Variables

When dealing with multiple variables (like Age and Income), we use the Joint PMF:

$$p_{X,Y}(x,y) = P(X=x, Y=y)$$

Marginalization: To get back the distribution of just $ X $ , you sum over all possible values of $ Y $ :

$$p_X(x) = \sum_y p_{X,Y}(x, y)$$

Linearity of Expectation: This is one of the most powerful properties in probability. It holds even if variables are dependent.

$$E[X + Y] = E[X] + E[Y]$$

Part 2: Continuous Random Variables

For continuous variables (time, distance, temperature), the probability of being exactly equal to a specific number is 0. Instead, we measure probability over intervals using a Probability Density Function (PDF), $ f_X(x) $ .

Probability as Area

$$ P(a \\le X \\le b) = \\int\_a^b f\_X(x) , dx $$

Expectation (Continuous)

$$ E\[X\] = \\int\_{-\\infty}^{\\infty} x \\cdot f\_X(x) , dx $$

Common Distributions

1. Uniform Distribution ( $ X \sim \text{Uni}(a,b) $ ) Every interval of the same length is equally likely. $$ f_X(x) = \frac{1}{b-a} \quad \text{for } a < x < b $$

2. Exponential Distribution ( $ X \sim \text{Exp}(\lambda) $ ) Models waiting times (e.g., time until the next server request). It has the unique Memoryless Property: $$ P(X > s + t | X > t) = P(X > s) $$ If you've waited 10 minutes, the probability of waiting another 5 is the same as if you just started waiting.

3. Normal (Gaussian) Distribution ( $ X \sim \mathcal{N}(\mu, \sigma^2) $ ) The bell curve. $$ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$ Linear Transformation: If $ X $ is Normal, then $ aX+b $ is also Normal.

Cumulative Distribution Function (CDF)

The CDF is the integral of the PDF. It represents the probability that $ X $ is less than or equal to $ x $ .

$$F_X(x) = P(X \le x) = \int_{-\infty}^x f_X(t) \, dt$$

Pro Tip: When transforming continuous variables ( $ Y=g(X) $ ), it is often safer to work with the CDF first and then differentiate to find the new PDF.

$$f_Y(y) = \frac{d}{dy} F_Y(y)$$

Part 3: Bayes' Rule

Bayes' rule allows us to "flip" conditional probabilities. It is the foundation of inference.

$$p_{X|Y}(x|y) = \frac{p_X(x) \cdot p_{Y|X}(y|x)}{p_Y(y)}$$

Analogy:

Prior $ p_X(x) $ : What you believed before seeing data.
Likelihood $ p_{Y|X}(y|x) $ : How likely the data is given your belief.
Posterior $ p_{X|Y}(x|y) $ : Your updated belief after seeing the data.

Quick Reference Table

Concept	Discrete	Continuous
Distribution	PMF: $ p_X(x) $	PDF: $ f_X(x) $
Expectation	$ \sum x \cdot p_X(x) $	$ \int x \cdot f_X(x) \, dx $
Variance	$ \sum (x-\mu)^2 \cdot p_X(x) $	$ \int (x-\mu)^2 \cdot f_X(x) \, dx $
Independence	$ p_{X,Y} = p_X \cdot p_Y $	$ f_{X,Y} = f_X \cdot f_Y $

Study Tips

Linearity of Expectation is your best friend. It works regardless of independence.
Variance of Sums ( $ \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) $ ) only works if $ X $ and $ Y $ are independent. If they are dependent, you must add the Covariance term.
For continuous transformations, always go through the CDF if you are unsure. It prevents mistakes with boundaries and derivatives.

Building AI Agents with Multimodal Models: Part 3

Binesh Kumar — Thu, 08 Jan 2026 05:00:00 GMT

Document Intelligence: Teaching AI to Read, Understand, and Remember PDFs

This is Part 3 of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.

The Challenge: Documents Are Messy

Think about a typical business document. It might have:

Paragraphs of text in multiple columns
Tables with financial data
Charts and graphs
Images and diagrams
Headers, footers, and page numbers
Different fonts and formatting

For humans, navigating this is intuitive. But for AI, a PDF is just a jumble of pixels or raw text blobs with no inherent structure. Teaching AI to extract meaningful information from documents is one of the most practical applications of multimodal AI.

This is where Optical Character Recognition (OCR) meets Retrieval Augmented Generation (RAG) to create intelligent document processing systems.

OCR: From Pixels to Text

The Analogy: Imagine you're teaching a child to read. First, they learn to recognize individual letters. Then words. Then sentences. Eventually, they understand that text flows in certain directions and formats.

Optical Character Recognition follows a similar journey:

Image Processing: Clean up the document image (remove noise, fix rotation)
Layout Detection: Find regions of text, tables, images
Character Recognition: Convert pixel patterns to characters
Post-Processing: Apply language models to fix errors

Modern OCR goes far beyond simple text extraction. It understands document structure.

The Document Processing Pipeline

NVIDIA's training demonstrates a comprehensive pipeline for extracting multimodal data from PDFs. Let's break it down.

Step 1: Document Partitioning

Before extracting content, you need to identify what's in the document.

The Analogy: Before renovating a house, you walk through each room and catalog what's there. "Living room has a couch, TV, and bookshelf. Kitchen has appliances and a dining table."

Document partitioning creates an inventory of elements:

Text blocks
Tables
Images
Charts
Headers and titles

Tools like the unstructured library do this automatically, identifying element types and their locations.

Step 2: Smart Chunking

Once you have text, you need to break it into digestible pieces for the AI. But how you chunk matters enormously.

Naive Chunking (Bad Approach): Split text every 500 characters regardless of content.

Problem: You might split a sentence mid-thought, separate a header from its content, or break apart related concepts.

Chunk 1: "The quarterly revenue reached $5.2 million, an increase of 23%"
Chunk 2: "compared to the previous quarter. Key drivers included..."

Semantic Chunking (Better Approach): Split at natural boundaries like titles, section breaks, or paragraph endings.

Chunk 1: [Header: Financial Results]
         "The quarterly revenue reached $5.2 million, an increase of 23%
          compared to the previous quarter."

Chunk 2: [Header: Key Drivers]
         "Key drivers included expanded market presence and new product
          launches in the enterprise segment..."

The semantic approach preserves meaning and context. When the AI retrieves this chunk later, it gets complete thoughts.

Step 3: Table Extraction

Tables are notoriously tricky. They encode relationships through spatial position, not linear text.

The Challenge:

| Product | Q1 Sales | Q2 Sales |
|---------|----------|----------|
| Widget  | $50,000  | $65,000  |
| Gadget  | $30,000  | $45,000  |

If you just extract text left-to-right, you get: "Product Q1 Sales Q2 Sales Widget $50,000 $65,000..."

This loses all the relational information. Which number belongs to which product?

The Solution: Use specialized table extraction models that understand grid structure. NVIDIA's pipeline uses models like Microsoft's Table Transformer to:

Detect table regions in the document
Identify rows and columns
Extract cell contents with their positions
Convert to structured formats (HTML, JSON)

The extracted HTML preserves structure:

<table>
  <tr><td>Producttd><td>Q1 Salestd><td>Q2 Salestd>tr>
  <tr><td>Widgettd><td>$50,000td><td>$65,000td>tr>
table>

Step 4: Image and Chart Extraction

Documents often contain figures that carry critical information.

The Approach:

Object Detection: Use models like YOLOX to find figures, charts, and diagrams
Region Extraction: Crop these regions as separate images
Metadata Preservation: Keep track of page number, position, and nearby text (captions)
Visual Analysis: Optionally use Vision Language Models to describe the content

This enables queries like "Show me all the architecture diagrams in this documentation."

RAG: Retrieval Augmented Generation

Now you've extracted all this content. How do you make it useful?

The Analogy: Imagine you're a researcher with a library of 10,000 books. When someone asks you a question, you don't read all 10,000 books. You:

Search the catalog for relevant books
Pull those specific books off the shelf
Read the relevant sections
Synthesize an answer

RAG does exactly this with AI.

The RAG Pipeline

User Question
     │
     ▼
┌─────────────┐
│  Embedding  │ ← Convert question to vector
└─────────────┘
     │
     ▼
┌─────────────┐
│  Retrieval  │ ← Find similar chunks in vector database
└─────────────┘
     │
     ▼
┌─────────────┐
│   Context   │ ← Combine retrieved chunks
└─────────────┘
     │
     ▼
┌─────────────┐
│     LLM     │ ← Generate answer using context
└─────────────┘
     │
     ▼
   Answer

Step 1: Indexing (One-Time Setup)

Take all your extracted chunks and convert them to embeddings:

Chunk 1 ──> [Encoder] ──> [0.2, 0.8, 0.1, ...]
Chunk 2 ──> [Encoder] ──> [0.5, 0.3, 0.9, ...]
Chunk 3 ──> [Encoder] ──> [0.1, 0.7, 0.4, ...]
...

Store these embeddings in a vector database like Milvus, Pinecone, or FAISS.

Step 2: Retrieval (At Query Time)

When a user asks a question:

Convert the question to an embedding
Find the K most similar chunks (using cosine similarity)
Return those chunks as context

question = "What was Q2 revenue?"
question_embedding = encoder.encode(question)
similar_chunks = vector_db.search(question_embedding, k=5)

Step 3: Generation

Feed the retrieved context plus the question to an LLM:

Context: [Retrieved chunks about Q2 revenue]
Question: What was Q2 revenue?

Answer: Based on the financial report, Q2 revenue was $65,000 for
the Widget product line and $45,000 for Gadgets, totaling $110,000.

The LLM generates an answer grounded in your actual documents, not its training data.

Object Detection with YOLOX

For intelligent document analysis, you need to detect where different elements are located.

The Model: NVIDIA provides specialized models like nv-yolox-page-elements trained specifically for document analysis.

What It Detects:

Tables
Charts and graphs
Titles and headers
Figures and images

How It Works:

Process each page as an image
Model outputs bounding boxes with confidence scores
Use boxes to crop and extract specific regions

Page Image ──> [YOLOX Model] ──> Detected Regions:
  • Table at (100, 200, 500, 400) - Confidence: 0.95
  • Chart at (100, 450, 500, 650) - Confidence: 0.89
  • Title at (50, 50, 400, 80) - Confidence: 0.97

This enables intelligent routing: text goes to OCR, tables go to table extractors, charts go to visual analysis models.

Handling Large Documents

Real documents can be hundreds of pages. Processing all at once is impractical.

The Solution: Batch processing with memory management.

# Process in batches of 10 pages
for start_page in range(0, total_pages, 10):
    end_page = min(start_page + 10, total_pages)
    batch = extract_pages(document, start_page, end_page)
    process_batch(batch)
    save_results(batch)
    clear_memory()  # Prevent memory overflow

Each batch is processed independently, results are saved, and memory is cleared before the next batch.

Practical Example: Processing a Technical Datasheet

Let's walk through processing NVIDIA's Grace-Blackwell datasheet (a real example from the training):

Input: 20-page PDF with specifications, architecture diagrams, and performance tables

Processing Steps:

Partition: Identify 150+ elements across 20 pages
Extract Text: Pull out 45 text blocks with semantic chunking
Extract Tables: Identify 12 specification tables, convert to HTML
Extract Figures: Locate 8 architecture diagrams
Index: Embed all content into vector database
Query: "What are the memory bandwidth specs?"

Result: System retrieves relevant table chunks and generates accurate answer with source citations.

Key Takeaways

Document processing is inherently multimodal: Text, tables, images all carry information
Smart chunking preserves meaning: Semantic boundaries beat arbitrary character limits
Tables need special handling: Spatial structure encodes relationships that linear text loses
Object detection enables routing: YOLOX identifies what's where so appropriate extractors can be used
RAG grounds AI in your data: Retrieved context prevents hallucination and enables factual answers
Batch processing handles scale: Process large documents in manageable chunks to control memory

When to Use Document RAG

Enterprise Knowledge Bases: Make company documentation searchable and queryable
Legal Document Analysis: Extract clauses, find precedents, compare contracts
Financial Analysis: Query annual reports, extract metrics from filings
Technical Documentation: Create intelligent assistants for product manuals
Research: Build queryable databases of academic papers

What's Next?

In Part 4, we'll explore the most exciting frontier: Video Understanding and Graph-RAG. You'll learn how AI can watch, understand, and answer questions about video content, and how knowledge graphs enable complex reasoning that simple vector search cannot achieve.

This content is inspired by NVIDIA's Deep Learning Institute course: Building AI Agents with Multimodal Models. For hands-on experience, consider enrolling in their official courses.

OCR on Engineering Drawings with a 0.9B Vision-Language Model

Binesh Kumar — Wed, 07 Jan 2026 05:29:34 GMT

Late last year, I started exploring how to extract metadata from product drawings. Part numbers, material specifications, revision history, manufacturing process notes. The kind of information that lives in title blocks and needs to end up in a PLM database. I tried various OCR techniques - with the tolerance call outs, dimensions, it was a mess and I stretched the limit of what can be done with regular expressions. Then I found PaddleOCR-VL. It is Vision Language Model (VLM) with a few preprocessors finetuned for OCR tasks.

PaddleOCR-VL-0.9B integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model. It uses a two-stage approach. First, PP-DocLayoutV2 performs layout analysis, localizing semantic regions and predicting reading order. Then PaddleOCR-VL-0.9B recognizes the content. A post-processing module outputs structured Markdown and JSON. On OmniDocBench v1.5, it achieves an overall score of 92.56, surpassing MinerU2.5-1.2B (90.67) and general VLMs like Qwen2.5-VL-72B. A model 80 times smaller achieving higher accuracy.

For my use case, I used a two-stage pipeline:

PDF → Images → PaddleOCR-VL (OCR) → Qwen3-0.6B (Extraction) → Structured JSON

The input is the entire drawing in pdf.

PaddleOCR-VL handles the OCR. Then I pass the extracted text to Qwen3-0.6B, a 600M parameter LLM, for structured information extraction. No complex regex patterns. The LLM figures out which text corresponds to which field.

The output looks like this:

{
  "part_number": "3814200",
  "drawing_number": "4095700.M00.027PI1/2",
  "material": "Polycarbonate (makrolon cristal ref:2458)",
  "finish": "poli",
  "description": "CAPOT INTERRUPTEUR / SWITCH COVER",
  "product": "LEGENDAIR XL2 US"
}

The whole thing runs on a laptop with 16GB RAM. GPU helps but is not required. Even with mutliple waves of digital transformatio, product manufactures accumulated vast archives of engineering drawings that containe the recipe, part numbers, material specifications, supplier references, revision histories.Cloud-based OCR means documents leave your network, they might be logged or used for training - which could lead to IP Leaks.

VLMs for OCR is promising, 0.9B parameter model changes this. It runs locally on a computer without network access and the documents never leave your infrastructure. The Apache 2.0 license allows commercial use for free.

I have shared my extraction pipeline on GitHub: PaddleOCR_Engineering_Drawings.

Building AI Agents with Multimodal Models: Part 2

Binesh Kumar — Wed, 07 Jan 2026 05:00:00 GMT

Contrastive Learning: Teaching AI That a Picture is Worth a Thousand Words

This is Part 2 of a 4-part series based on learnings from NVIDIA's "Building AI Agents with Multimodal Models" certification.

The Big Question: How Do You Connect Pictures and Words?

Here's a puzzle: You have a photo of a golden retriever playing fetch, and you have the text "a happy dog catching a frisbee." To you, these obviously go together. But to a computer, an image is just a grid of numbers, and text is a sequence of characters. They're completely different data types.

How do we teach AI that these two things represent the same concept?

The answer is Contrastive Learning, and it's the secret sauce behind revolutionary models like OpenAI's CLIP and forms the foundation of modern image search, text-to-image generation, and visual question answering.

The Embedding Space: A Universe Where Ideas Live

Before we dive into contrastive learning, we need to understand embeddings.

The Analogy: Imagine a massive library where every book has a specific location. Similar books are shelved near each other. Mystery novels are in one section, cooking books in another, and within cooking, Italian cuisine is close to French cuisine.

An embedding is like giving every piece of data (an image, a sentence, a sound) coordinates in this library. The magic is that similar concepts get similar coordinates, regardless of their original format.

So when we "embed" an image of a dog and the word "dog," if done correctly, both should end up in the same neighborhood of this mathematical space.

Image of dog  ──> [Image Encoder] ──> [0.8, 0.2, 0.5, ...] ──┐
                                                              ├──> Close together!
Text "a dog"  ──> [Text Encoder]  ──> [0.79, 0.21, 0.48, ...] ┘

Contrastive Learning: Learning by Comparison

The Analogy: Imagine you're teaching a child to identify animals using flashcards. You show them two cards and ask: "Are these the same animal?"

Show a dog photo and say "dog" → "Yes, same!"
Show a dog photo and say "cat" → "No, different!"

Through thousands of these comparisons, the child learns what "dog" means without you ever explicitly defining it.

Contrastive learning works the same way. You don't tell the model what a dog is. Instead, you show it:

Positive pairs: Image of dog + text "dog" (these should be similar)
Negative pairs: Image of dog + text "cat" (these should be different)

The model learns to push positive pairs together and pull negative pairs apart in the embedding space.

The Math Behind the Magic: Cosine Similarity

How do we measure if two embeddings are "close"?

The Analogy: Imagine two arrows pointing from the center of a room. If they point in the same direction, they're similar. If they point in opposite directions, they're different. The angle between them tells you how similar they are.

Cosine Similarity measures exactly this. It calculates the angle between two vectors (embeddings):

Score of 1.0: Pointing in the exact same direction (identical meaning)
Score of 0.0: Perpendicular (unrelated)
Score of -1.0: Opposite directions (opposite meaning)

The formula normalizes vectors to unit length (arrows of length 1) so we only care about direction, not magnitude.

Similarity = (A · B) / (|A| × |B|)

Where:
- A · B is the dot product
- |A| and |B| are the magnitudes

Building a CLIP-Style Model: Step by Step

Let's walk through how this works in practice, using a simplified example from NVIDIA's training.

Step 1: Create Two Encoder Networks

You need one encoder for each modality:

Image Encoder: Takes images → Produces image embeddings
Text Encoder:  Takes text   → Produces text embeddings

These can be any architecture (CNNs for images, Transformers for text). The key is that both produce vectors of the same size.

Step 2: Normalize the Embeddings

Before comparing, we normalize all embeddings to unit vectors. This ensures we're comparing direction only.

# Normalize to unit vectors
image_embedding = F.normalize(image_embedding, dim=1)
text_embedding = F.normalize(text_embedding, dim=1)

Step 3: Calculate the Similarity Matrix

For a batch of N image-text pairs:

Row i contains similarities between image i and all N texts
Column j contains similarities between text j and all N images
The diagonal should be high (matching pairs)
Off-diagonal should be low (non-matching pairs)

              Text1   Text2   Text3   Text4
Image1      [ 0.95   0.10    0.05    0.12 ]  ← Image1 matches Text1
Image2      [ 0.08   0.92    0.15    0.20 ]  ← Image2 matches Text2
Image3      [ 0.12   0.18    0.89    0.10 ]  ← Image3 matches Text3
Image4      [ 0.05   0.22    0.08    0.91 ]  ← Image4 matches Text4

Step 4: Apply Cross-Entropy Loss

We treat this as a classification problem. For each image, the correct text is its "class." We use cross-entropy loss to:

Maximize diagonal values (correct pairs)
Minimize off-diagonal values (wrong pairs)

The loss is computed in both directions:

Given image, predict correct text
Given text, predict correct image

Final loss = Average of both directions

A Practical Example: Fashion Item Search

NVIDIA's training demonstrates this with the FashionMNIST dataset. The twist? Instead of pairing images with text, they pair original images with their edge-detected outlines (using Sobel filters).

The Use Case: Build a system where you can sketch a rough outline of clothing, and the system finds matching products.

How It Works:

Take images of t-shirts, pants, shoes, etc.
Extract edge outlines using Sobel filters (simulating hand-drawn sketches)
Train contrastively: Original image ↔ Outline should be close
At inference: User draws a sketch → System finds images with similar embeddings

This is the foundation of visual search systems used by e-commerce platforms.

Contrastive learning creates aligned embeddings, but sometimes you need to go further. What if you have a model trained on LiDAR data, and you want to use RGB images instead?

The Analogy: Imagine you have a expert translator who only speaks Japanese. You speak English. Instead of training a new expert, you hire an interpreter (a projector) who converts your English into Japanese.

Cross-Modal Projection trains a simple neural network to convert embeddings from one modality space to another:

RGB Embedding ──> [Projector Network] ──> LiDAR Embedding Space

The projector is typically just a few linear layers, trained using Mean Squared Error (MSE) loss to match the target embeddings.

Why This Matters

Reuse Expensive Models: LiDAR models are expensive to train. Projection lets you reuse them with cheaper RGB data.
Missing Modality at Inference: Your training data has both RGB and depth, but your deployment camera only captures RGB. Project to fill the gap.
Transfer Learning: Project from a modality where you have lots of data to one where you have less.

Two-Stage Training Strategy

For complex multimodal systems, NVIDIA recommends a two-stage approach:

Stage 1: Alignment Train the projector to align embeddings using frozen pre-trained encoders.

Freeze: Both encoders
Train: Projector only
Loss: MSE between projected and target embeddings

Stage 2: Fine-tuning Optionally unfreeze everything and fine-tune end-to-end for your specific task.

Unfreeze: Everything (or selectively)
Train: Whole pipeline
Loss: Task-specific (classification, regression, etc.)

This staged approach prevents catastrophic forgetting and ensures stable training.

Key Takeaways

Embeddings are coordinates in meaning-space: Similar concepts cluster together regardless of original data type
Contrastive learning teaches through comparison: Push matching pairs together, pull non-matching pairs apart
Cosine similarity measures directional alignment: Normalized dot product tells you how "same direction" two vectors point
Cross-modal projection bridges modality gaps: A simple network can translate between embedding spaces
Two-stage training is more stable: First align embeddings, then fine-tune for your task

Real-World Applications

Image Search: Type "sunset over mountains" → Find matching photos (CLIP)
Product Discovery: Upload a photo → Find similar products (Pinterest, Amazon)
Content Moderation: Align images with violation categories for detection
Accessibility: Connect images to audio descriptions for visually impaired users
Robotics: Align camera views with depth sensors for navigation

What's Next?

In Part 3, we'll explore how to extract and process multimodal data from documents using OCR and RAG pipelines. You'll learn how AI can read PDFs, extract tables and images, and build searchable knowledge bases from unstructured documents.

This content is inspired by NVIDIA's Deep Learning Institute course: Building AI Agents with Multimodal Models. For hands-on experience, consider enrolling in their official courses.

Building AI Agents with Multimodal Models : Part 1

Binesh Kumar — Mon, 05 Jan 2026 05:00:00 GMT

Understanding How AI Learns to See, Hear, and Feel All at Once

Why Do We Need Multimodal AI?

Imagine you're trying to identify a fruit in complete darkness. You can feel its round shape, its smooth skin, and smell its citrusy aroma. Now imagine you can only see a photo of it but can't touch or smell it. In either case alone, you might confuse an orange with a tangerine. But combine all your senses together, and suddenly the identification becomes much easier.

This is exactly the challenge AI faces. Traditional AI models are like humans with only one sense. A camera sees colors but doesn't understand depth. A LiDAR sensor measures precise distances but sees the world in points, not colors. Neither alone tells the complete story.

Multimodal AI is about teaching machines to combine multiple "senses" to understand the world more completely.

The Core Problem: Different Data Types Don't Speak the Same Language

Here's where it gets interesting. When you combine senses, your brain does it effortlessly. But for computers, mixing an image (a grid of pixels) with depth data (a cloud of 3D points) is like trying to add apples and equations together. They're fundamentally different.

Think of it like this:

RGB Image Data: A painting on a flat canvas with colors
LiDAR Point Cloud: A 3D sculpture made of tiny dots
Text: A story written in words
Audio: Vibrations over time

The magic of multimodal AI lies in finding smart ways to combine these completely different data formats.

The Three Fusion Strategies: When to Combine Your Ingredients

Just like cooking, the order in which you combine ingredients matters. NVIDIA's training introduces three fundamental approaches to fusion, each with its own strengths.

1. Early Fusion: Mix Everything at the Start

The Analogy: Making a smoothie. You throw all your fruits into the blender right at the beginning and blend them together.

How It Works: Concatenate (stack) all your input data together before feeding it into a single neural network. If your image has 3 color channels (RGB) and your depth map has 1 channel, you create a 4-channel input.

When to Use It:

When your modalities capture complementary low-level features
When the raw data naturally aligns (same resolution, same timestamps)
When you want a simpler, more efficient architecture

The Trade-off: You're betting that the network can figure out how to use both data types from the very beginning. Sometimes this works beautifully. Other times, the model gets confused trying to learn two things at once.

Input A ─┐
         ├──> [Concatenate] ──> [Single Neural Network] ──> Output
Input B ─┘

2. Late Fusion: Let Experts Work Separately, Then Vote

The Analogy: A panel of specialist doctors. The eye doctor examines vision, the hearing specialist checks audio, and at the end they meet to discuss and reach a combined diagnosis.

How It Works: Train separate neural networks for each modality. Each network becomes an expert at its own data type. At the very end, combine their predictions (by averaging, voting, or concatenating).

When to Use It:

When each modality has unique patterns that require specialized learning
When you want modality-specific interpretability
When you have pre-trained models for individual modalities

The Trade-off: You need more parameters (two full networks instead of one). But each network can fully focus on mastering its own domain without interference.

Input A ──> [Network A] ──> Prediction A ─┐
                                          ├──> [Combine] ──> Final Output
Input B ──> [Network B] ──> Prediction B ─┘

3. Intermediate Fusion: Meet in the Middle

The Analogy: Jazz musicians improvising together. Each plays their own instrument, but at key moments they sync up, listen to each other, and let one musician's riff influence another's response.

How It Works: Each modality has its own pathway that extracts features. At intermediate layers (not the beginning, not the end), these pathways exchange information. This exchange can happen through:

Concatenation: Stacking feature maps together at a middle layer
Matrix Multiplication: Having features from one modality modulate or gate the other

When to Use It:

When you want the best of both worlds
When modalities need some individual processing before they can meaningfully interact
When you need rich cross-modal interactions

The Trade-off: More complex to design. You need to decide where and how fusion happens.

Input A ──> [Early Layers A] ──┐
                               ├──> [Fusion Layer] ──> [Later Layers] ──> Output
Input B ──> [Early Layers B] ──┘

A Practical Example: Colored Cubes with RGB and LiDAR

NVIDIA's training uses a brilliant example to demonstrate these concepts. Imagine a scene with three cubes: one red, one green, and one blue. Your task is to classify which cube is which.

Challenge 1: RGB Camera Only The camera sees colors perfectly. Red cube? Check. Green cube? Check. But wait, where exactly are they in 3D space? The camera flattens everything to 2D. If the cubes overlap visually, things get confusing.

Challenge 2: LiDAR Only The LiDAR sensor knows exact 3D positions. It can tell you precisely where each cube sits in space. But all cubes look the same because LiDAR doesn't see color.

The Solution: Combine Both With multimodal fusion, the model gets the best of both worlds. LiDAR provides spatial precision while RGB provides color identification. Together, they solve what neither could alone.

This is multimodal AI in action: combining complementary strengths to overcome individual weaknesses.

Key Takeaways

Multimodal AI combines different data types (images, text, audio, depth) to create more robust understanding
Fusion timing matters:
- Early fusion is simple but requires data compatibility
- Late fusion allows specialization but needs more resources
- Intermediate fusion offers flexibility but adds complexity
Choose your strategy based on your data: If modalities complement each other at a low level, go early. If they need expertise first, go late. If you need both, go intermediate.
The goal is complementary strengths: Each modality should bring something unique to the table

What's Next?

In Part 2, we'll explore how AI learns to connect completely different modalities through a technique called Contrastive Learning. Imagine teaching a computer that a photo of a dog and the word "dog" should live close together in the AI's understanding. This is the foundation of models like CLIP that power modern image search and generation.

This content is inspired by NVIDIA's Deep Learning Institute course: Building AI Agents with Multimodal Models. For hands-on experience, consider enrolling in their official courses.

Why a 0.9B VLM can be a serious OCR engine

Binesh Kumar — Sat, 03 Jan 2026 05:00:00 GMT

In this post, I will discuss PaddleOCR-VL, focusing on what is important for OCR and document parsing: stable layout, high-resolution text capture, low error rates, and fast deployment.The paper’s main claim is simple but important: you can get state of the art document parsing with an ultra compact vision language model, if you design the system around the real constraints of OCR.

For a classic OCR, the stack consists on mechanisms to detect regions with text, recognize the text using algorithms like CTC, with bolt on rules for tables, figures and so on.A VLM changes the contract. Instead of predicting characters from a section, you probe “Given pixels, generate a sequence that encodes the content I want”. I was quick to build a parallel with image captioning tasks but the more we dive in, a captioning task can miss a few axis on the tick labels and still say chart with sales rising and be accurate. But for OCR, the expectation is lossless meaning if you miss a character in a number it can break retrieval, matching and QA.

The core architectural idea - decouple layout from recognition

The system has three stages

PP-DocLayoutV2 for layout detection and reading order
PaddleOCR-VL-0.9B for element level recognition
A lightweight post step builds Markdown and JSON outputs

The paper’s position is: do not ask the VLM to solve layout implicitly through generation. Make layout explicit with a fast detector plus ordering network, then let the VLM do what it is best at: recognition.

Now if you are interested lets dig deep in to each of those stage.

Stage 1: PP-DocLayoutV2

PP-DocLayoutV2 combines RT-DETR for detecting and classifying layout elements and a lightweight pointer network with 6 transformer layers for reading order prediction

The ordering part has details that matter:

it embeds proposals with absolute 2D positional encodings and class label embeddings
it adds a geometric bias mechanism inspired by Relation-DETR to model pairwise geometry
it predicts an N by N pairwise ordering matrix
it recovers a consistent reading order with a deterministic “win accumulation” decoding algorithm

This is the backbone of the system. If reading order is wrong, our OCR can be perfect and our parsed document is still unusable.

Stage 2: PaddleOCR-VL-0.9B

PaddleOCR-VL-0.9B follows a LLaVA inspired structure: vision encoder, projector, language model.Instead of fixed resolution resizing or tiling, the paper uses native dynamic high resolution preprocessing and a NaViT style encoder initialized from Keye-VL, designed to support native resolution inputs without distortion.The authors claim this yields fewer hallucinations and stronger performance on text heavy tasks.This is a big deal for dense documents and drawings, where tiny glyph details decide correctness.

The projector is a randomly initialized 2 layer MLP with GELU, using a merge size of 2 to bridge vision features into the language embedding space efficiently.In plain terms: reduce the token burden before the decoder pays attention to everything.Autoregressive decoding cost is tied to decoder size. The paper explicitly chooses ERNIE-4.5-0.3B for inference efficiency and adds 3D-RoPE for positional representation.The element recognizer is also built via post adaptation using pretrained weights: Keye-VL for the vision side and ERNIE-4.5-0.3B for the language side.

Stage 3: Post processing

After layout and element recognition, PaddleOCR-VL runs a lightweight post processing module that aggregates outputs from both stages and formats the final result into structured Markdown and JSON.

This is where the system becomes a document parser instead of a bag of OCR strings.What this stage effectively does, based on the paper’s description, is:

follow the reading order predicted by PP-DocLayoutV2
place each recognized element back into a page level representation
serialize the page into Markdown for human readable output
serialize the same content into JSON for programmatic use

One way to think about it is s that Stage 2 gives you “content”, Stage 3 gives you “a document”.If you care about RAG, this stage is not optional. The paper describes document parsing as a foundation for retrieval and downstream LLM use, especially when combined with RAG systems.

Training Approach

The VLM training is two stage:

Stage 1 alignment on 29M image text pairs
Stage 2 instruction fine tuning on 2.7M samples

The paper also describes a large scale data construction pipeline: over 30M samples collected via public acquisition and synthesis, refined using prompt driven labeling with larger models, plus cleaning to remove low quality or hallucinated annotations.

Inference

PaddleOCR-VL is also designed to run fast end to end. The paper describes multi threading asynchronous execution split into three parallel stages:

data loading
layout model processing
VLM inference

Data flows through queues. VLM batching triggers when the queue hits a threshold or when items have waited too long, so blocks across different pages can be aggregated for better parallelism.On their end to end benchmark, they report that with FastDeploy the system achieves 53.1 percent higher page throughput and 50.9 percent higher token throughput than MinerU2.5. In my experience, I got a throughput of 45s per a page of engineering drawing.

I used PaddleOCR-VLM for extracting key manufacturing information from engineering drawings, in my analysis the advantages of this model are that

Stage 1 can isolate title blocks, revision tables, notes, and callouts so Stage 2 never has to guess what region matters
Stage 2 can run at high resolution on tight crops, which is exactly what tiny labels need
Stage 3 can output clean Markdown for inspection and JSON for downstream matching

If you want to learn about using this model for Engineering drawings, review my blog here - OCR on Engineering Drawings with a 0.9B Vision-Language Model

Multimodal LLMs in Healthcare: What's Actually Working

Binesh Kumar — Fri, 02 Jan 2026 01:33:41 GMT

If you've tried asking ChatGPT to interpret a chest X-ray, you know the answer: it can't. Not because the technology doesn't exist, but because most general-purpose models weren't built for medical imaging.

That's changing fast. A new generation of vision-language models can now look at scans, read clinical notes, and answer questions about both. Some of these models are already matching specialist performance on diagnostic benchmarks.

Here's what's actually working, what's still experimental, and what it takes to deploy these systems safely.

Why Healthcare Needs Multimodal AI

Healthcare data doesn't fit neatly into text or images alone. A single patient encounter might include X-rays, MRI scans, lab results, vital signs over time, and pages of clinical notes. Traditionally, each data type required its own specialized model.

Multimodal models change this. A single architecture can detect subtle abnormalities in a scan, summarize a 20-page discharge summary, spot concerning trends in vital signs, and explain its reasoning in plain language. The potential is obvious: faster diagnoses, fewer missed findings, less cognitive load on clinicians.

But potential and reality are different things. Let's look at the models that are actually delivering results.

Vision-Language Models That Work on Medical Images

These models take an image and a question, then return a text answer. The architecture typically combines a vision encoder (to "see" the image) with a language model (to understand questions and generate responses).

LLaVA-Med 1.5

LLaVA-Med pairs a CLIP vision encoder with Vicuna, a 13B parameter language model. The team trained it on 200,000 image-text pairs from PubMed Central, supplemented with synthetic instructions generated by GPT-4.

The results are solid. On radiology and pathology question-answering benchmarks, it matches or beats prior approaches. The architecture is straightforward: the vision encoder extracts image features, an MLP projects them into the language model's embedding space, and the language model handles the rest.

Visual Med-Alpaca

This one takes a different approach. Instead of a single end-to-end model, Visual Med-Alpaca uses a routing system. A classifier first determines what type of input it's dealing with, then dispatches to specialized experts (Med-GIT for general medical images, DePlot for charts and graphs). The outputs feed into a LLaMA-7B core with LoRA adapters.

Training data came from 54,000 Q&A pairs drawn from BigBIO and ROCO radiology datasets. The team used GPT-3.5 to generate additional prompts, then filtered them with human review.

One caveat: this is strictly research-use only, with no FDA approval.

CheXagent

CheXagent focuses specifically on chest X-rays. The image encoder (SigLIP-Large) processes 512×512 pixel images through 24 transformer layers. A projection MLP maps those features into a Phi-2.7B decoder trained on medical and scientific text.

The training corpus is impressive: over one million chest X-ray and report pairs, plus 2.7 billion tokens from clinical notes and research articles. The intended use cases include drafting radiology reports, flagging abnormalities, and explaining findings to patients.

MedGemma-4B-IT

Google's entry into this space launched in July 2025. It's a decoder-only transformer with 4B parameters, built on the Gemma 3 base. The SigLIP image encoder was pre-trained on de-identified data spanning chest X-rays, dermatology, ophthalmology, and histopathology.

The context window is generous: 128K tokens of text plus images (each 896×896 image converts to 256 tokens). Here's how it compares to the base Gemma model:

Task	Base Gemma 3 4B	MedGemma 4B-IT
MIMIC-CXR macro F1 (top 5)	81.2	88.9
CheXpert macro F1 (top 5)	32.6	48.1
CXR14 macro F1 (3 conditions)	32.0	50.1
SLAKE VQA token F1	40.2	72.3
PathMCQA histopathology accuracy	37.1	69.8
EyePACS fundus accuracy	14.4	64.9

The improvements are substantial across the board. MedGemma is available on Hugging Face under the Health AI Developer Foundations license, with fine-tuning notebooks on GitHub.

Language Models for Clinical Text

Not everything in healthcare is an image. Electronic health records contain millions of words: admission notes, progress updates, discharge summaries, lab interpretations. Models trained specifically on this text outperform general-purpose LLMs.

GatorTron

GatorTron comes in sizes from 110M to 8.9B parameters, all trained on 82 billion words of de-identified clinical text. The researchers tested it on concept extraction, relation extraction, clinical inference, and question answering. The finding won't surprise anyone who's followed scaling laws: bigger models and more data improve everything.

Few-Shot Health Learners

This work from Google explores whether large language models can handle time-series health data with minimal examples. Starting from PaLM-24B (pre-trained on 780B tokens), the team fine-tuned on ECG waveforms and vital signs using just a handful of examples per task.

The results suggest that LLMs can ground numeric health data surprisingly well. Applications include arrhythmia detection, activity recognition, and estimating calorie expenditure or stress levels from sensor data.

How Do You Test These Models?

Benchmarks matter. A model that aces one dataset might fail completely in a real clinical setting. Here are the validation sets researchers are using:

NEJM Clinicopathologic Cases contains 143 diagnostic puzzles from 2021 to 2024, scored on a Bond Scale (0-5) and Likert scale (0-2). These are the kind of cases that stump experienced clinicians.

NEJM Healer Series walks models through 20 complete patient encounters across four stages: triage, examination, testing, and management. Scoring uses the R-IDEA rubric (0-10).

Grey Matters Management presents 5 complex scenarios scored on a 100-point rubric. Notably, this benchmark compares GPT-4 against physicians working with and without AI assistance.

MIMIC-IV-Ext Clinical Decision Making draws from 2,400 emergency department visits for abdominal pain, testing whether models can distinguish appendicitis, cholecystitis, diverticulitis, and pancreatitis.

Probabilistic Reasoning Challenges test whether models can perform Bayesian inference with lab results. This matters because clinical decision-making is fundamentally probabilistic, and models that give false confidence are dangerous.

What It Takes to Deploy Safely

Research performance doesn't guarantee safe clinical use. Several factors separate a promising paper from a deployable system.

Privacy is non-negotiable. Patient data must be de-identified and encrypted. Models trained on identifiable data face both legal liability and the risk of memorizing sensitive information.

Generalization trips up many models. Performance on one hospital's data often doesn't transfer to another institution with different patient populations, imaging equipment, or documentation practices. Diverse testing is essential.

Explainability helps clinicians trust (and appropriately distrust) model outputs. Attention maps, saliency scores, and counterfactual explanations all help, though none fully solve the interpretability problem.

Regulation remains unsettled. The FDA and CE marking bodies are still working out how to evaluate AI that learns and updates. Liability questions are largely unresolved.

Where This Is Heading

The immediate future is clear: these models will get better, handle more modalities, and integrate more tightly into clinical workflows.

Longer term, expect models that incorporate genomic data, wearable sensor streams, and even environmental factors. Real-time decision support integrated directly into EHRs is coming. Personalization based on individual patient histories will follow.

The harder problems are institutional and regulatory. Who's liable when an AI-assisted diagnosis is wrong? How do you validate a model that keeps learning? What does informed consent look like when AI is involved in care decisions?

Multimodal LLMs will transform healthcare. The technology is nearly ready. The question is whether our institutions can adapt fast enough to deploy it safely.

References

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2023). LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.

Han, T., Adams, L. C., Papaioannou, J. M., Grundmann, P., Oberhauser, T., Löser, A., Truhn, D., & Bressem, K. K. (2023). MedAlpaca: An open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:2304.08247.

Chen, Z., Diao, S., Wang, B., Wang, H., Liu, T., Hu, Z., & Jiang, L. (2024). CheXagent: Towards a foundation model for chest X-ray interpretation. arXiv preprint arXiv:2401.12208.

Google (2025). MedGemma: Medical vision-language models. Google Health AI Developer Foundations. Retrieved from https://huggingface.co/google/medgemma

Yang, X., Chen, A., PourNejatian, N., Shin, H. C., Smith, K. E., Parisien, C., Compas, C., Martin, C., Costa, A. B., Flores, M. G., Zhang, Y., Magoc, T., Harle, C. A., Lipori, G., Mitchell, D. A., Hogan, W. R., Shenkman, E. A., Bian, J., & Wu, Y. (2022). GatorTron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540.

Rasul, K., Ashok, A., Williams, A. R., Khorasani, M., Adamopoulos, G., Bhagwatkar, R., Biloš, M., Ghonia, H., Hassen, N. V., Anderson, D., Schneider, J., Nevmyvaka, Y., & Rätsch, G. (2023). Medical time-series data generation using generative adversarial networks. Proceedings of Machine Learning Research, 182.

When AI Radiologists Get Confused: The Critical Challenge of VLM Robustness in Medical Diagnostics

Binesh Kumar — Wed, 15 Oct 2025 04:00:00 GMT

Picture this: You’re in the emergency room with chest pain and shortness of breath. The doctor orders a chest X-ray, and while waiting for the radiologist, you pull out your phone. Could ChatGPT help interpret what’s wrong? You’ve used it for math problems and recipe suggestions. Surely it could read an X-ray?

This isn’t a hypothetical anymore. We’re already there. An Australian study found that 9.9% of adults had used ChatGPT for health questions in just six months, with nearly 40% of non-users considering it. When people get health advice from ChatGPT, nearly half simply follow it. No questions asked, no double-checking with their doctor.

Our recent research reveals that when these sophisticated AI systems move from answering text questions to interpreting medical images, they become dangerously brittle. We evaluated 125 chest X-ray interpretations using state-of-the-art vision language models, including Google’s MedGemma 4B and GPT-4V. Simply changing “vascular dilation” to “vascular congestion” in a question made the same AI system provide completely different diagnoses for the same X-ray image.

The Promise of VLM-Based Radiologists

When we first started working with vision language models for medical imaging, the promise seemed clear. Unlike traditional AI that just spits out labels like “pneumonia: 87% probability,” these models could actually explain what they saw. They’d tell you why they thought something looked abnormal. You could ask follow-up questions. Feed them a patient’s history and watch them adjust their interpretation accordingly.

But here’s what we discovered matters just as much as accuracy: consistency. We call it robustness in the lab, but what it really means is whether the AI gives you the same answer when you ask the same question slightly differently. Think about it. A radiologist doesn’t suddenly see pneumonia just because you say “chest radiograph” instead of “X-ray.” They know “lung volumes” and “lung capacity” mean the same thing.

Yet that’s exactly what happens with today’s most advanced models. And we’re not talking about edge cases or trick questions. We tested basic medical synonyms, the kind any first-year resident would recognize as identical. The models fell apart.

When Terminology Becomes a Diagnostic Trap

Let’s walk through real examples from our evaluation that show how catastrophically these models can fail.

Case 1: The Vascular Confusion

We showed a chest X-ray to one of the most advanced vision language models available. Asked about vascular findings. The model correctly identified pulmonary vascular dilation, which is exactly what we’d expect. It’s a widening of blood vessels that might indicate various conditions but isn’t immediately life threatening.

Then we changed two words. Just two. “Vascular dilation” became “vascular congestion.”

Suddenly the model was talking about cardiac congestion. Possible heart failure. Recommending completely different follow-up procedures. Same image, nearly identical question, completely different medical pathway. The clinical implications hit us immediately. A patient might get rushed into unnecessary cardiac workup while their actual condition goes untreated. Or worse, someone might start urgent cardiac treatment for what’s actually a non-cardiac issue.

Case 2: The Imaginary Pneumonia

This one still makes us shake our heads. We had an X-ray showing clear pleural effusion. That’s fluid around the lungs, often serious enough to need drainage. The model saw it correctly when we asked about lung findings.

But when we added the phrase “chest radiograph” to our question? The model suddenly “saw” pneumonia that wasn’t there.

It didn’t just add pneumonia to its diagnosis. It completely forgot about the pleural effusion and started recommending antibiotics. This isn’t just wrong. It’s actively harmful. A patient with fluid crushing their lungs needs drainage, not antibiotics for an infection that doesn’t exist.

Case 3: The Vanishing Diagnosis

Perhaps most concerning was when changing “lung volumes” to “lung capacity” made critical findings disappear entirely. The model went from correctly identifying pleural effusion and potential cardiac issues to completely missing the effusion and focusing only on cardiac problems.

Pleural effusion can kill you if it’s not treated. It can lead to respiratory failure. Yet a simple synonym made the AI blind to its presence. The model confidently described other findings while missing the one thing that might send someone to the ICU.

What makes these failures so unsettling is their unpredictability. You can’t train staff to avoid certain phrases or create a list of “safe” terminology. The brittleness runs deeper than that.

Making Sense of the Brittleness: What We’re Learning in the Lab

The brittleness we observed in chest X-ray VLMs sent us down a research rabbit hole. How could models that seem so sophisticated fail so spectacularly when we barely changed our words? We needed a systematic way to measure this vulnerability, which led us to develop VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) here at the SAIL Lab at University of New Haven.

VSF-Med: Our Systematic Approach

VSF-Med isn’t just another benchmark. It’s our attempt to quantify exactly how and why these models break. We evaluated 68,478 attack scenarios across five models including:

CheXagent 8B: Medical-specialized model
Llama 3.2 11B Vision: General-purpose VLM
GPT-4o: State-of-the-art multimodal model
Google MedGemma 4B: Medical-focused model
GPT-4V: Previous generation flagship

The framework measures vulnerability across nine different attack vectors, giving us concrete numbers for what we’d been observing anecdotally.

The Sobering Results

What we found was concerning. Even CheXagent 8B, a model specifically trained for medical imaging, showed moderate vulnerability (z-score: 0.68) to our prompt injection attacks. That “dilation vs congestion” problem we showed you earlier? Not an isolated incident.

Key findings:

Performance scores ranged from 7 to 97 across different phrasings of identical X-ray questions
Medical-specialized models demonstrated only 36% better resilience compared to general-purpose VLMs
Current best-in-class models still have vulnerability spreads exceeding 0.3 standard deviations

Better than general models, yes. Good enough for clinical use? Not even close.

Why Are These Models So Fragile?

We have theories we’re exploring:

Contrastive Learning Issues: Many vision-language models use contrastive learning during training, where they learn to match images with text descriptions. This might create brittle associations between specific phrases and visual features.
The Alignment Problem: These models are fine-tuned to be helpful and responsive, which might make them overeager to provide different answers when prompted differently. They’re trying so hard to be useful that they forget to be consistent.
Medical Language Complexity: Radiological language is precise but full of synonyms. Models trained on general text might not grasp that “increased opacity” and “increased density” mean the same thing in a chest X-ray context.
Architectural Limitations: The transformer architecture itself might contribute to this sensitivity. Attention mechanisms that work beautifully for general language tasks might amplify small prompt variations in high-stakes medical contexts.

What This Means for Medical AI

The momentum behind medical AI is undeniable. Just this week, Mayo Clinic Press highlighted how AI is already being used for stroke diagnosis, heart failure detection, and cancer screening. They describe an optimistic future where “AI has the potential to improve the work of human healthcare teams, making care more personal and effective.”

While this enthusiasm is understandable given AI’s promise, our research suggests we need to address fundamental robustness issues before these systems can truly deliver on that potential.

Safety Implications

Diagnostic Inconsistency: Same image, different terminology = different diagnoses
Clinical Risk: Unreliable AI could misguide medical decisions
Trust Issues: Healthcare providers need consistent, predictable AI behavior
Patient Safety: When people follow AI medical advice without verification, inconsistency becomes dangerous

The Real-World Context

Remember those statistics we opened with? People are already using these tools for health decisions. An Australian study found that people with limited health literacy use ChatGPT at nearly twice the rate of others (18.4% vs 9.4%). Those from non-English speaking backgrounds? Even higher at 29.2%. These are exactly the populations who might be most vulnerable to inconsistent AI responses.

Future Directions

Our research highlights the urgent need for:

Robust Training Methods: VLMs that maintain consistency across terminology variations
Comprehensive Testing: Systematic evaluation of medical AI before clinical deployment
Safety Frameworks: Guidelines for reliable medical AI implementation
Architectural Innovations: New approaches that improve multimodal robustness

Open Science Commitment

Our VSF-Med framework is completely open source because we believe this problem is too important for any single team to tackle alone. We’ve made it so researchers anywhere can benchmark their medical VLM with a single command, generating over 30,000 adversarial test cases automatically.

We’re diving deeper into architectural modifications that might improve robustness. We’re exploring whether different training objectives could create more stable image-text associations. And we’re working with clinicians to understand which types of brittleness pose the greatest real-world risks.

The Path Forward

Medical AI has immense potential to revolutionize healthcare, from reducing diagnostic errors to making expertise available in underserved areas. But as our research shows, we’re not there yet. The brittleness we’ve uncovered isn’t just a technical curiosity—it’s a fundamental barrier to safe clinical deployment.

Until we can get vulnerability spreads much, much lower, these systems remain too fragile for autonomous clinical use. This isn’t about being pessimistic about AI. It’s about being realistic about what needs to be fixed before we can responsibly deploy these powerful tools in life-and-death situations.

Because ultimately, this isn’t just an interesting technical puzzle. It’s about making sure AI tools genuinely help rather than harm when lives are on the line.

A guide to LLM evaluation metrics

Binesh Kumar — Wed, 17 Sep 2025 04:00:00 GMT

No single metric reliably captures LLM output quality. But the right combination of metrics, carefully chosen for your task, gets surprisingly close to human judgment. This guide covers mathematical formulations, failure modes, and runnable code for every major evaluation metric, from classical perplexity through modern LLM-as-judge approaches.

The field has shifted fast since 2023. LLM-based judges now achieve over 80% agreement with human annotators. Meanwhile, n-gram metrics like BLEU persist largely through institutional inertia. Knowing when each metric works, and when it fails, is the difference between rigorous evaluation and self-deception.

Note: You can run this experiment using the free tier of Google Colab.

1. Perplexity and bits-per-byte: the intrinsic baselines

Perplexity remains the default intrinsic metric for language models. It's defined as the exponentiated average negative log-likelihood over a token sequence:

$$\text{PPL}(X) = \exp\left(-\frac{1}{N} \sum_i \log P(x_i \mid x_{

This equals exp(cross-entropy loss), making it a direct readout of training loss. Lower perplexity means the model assigns higher probability to observed text. GPT-2 large scores about 16.4 PPL on WikiText-2 with sliding-window evaluation (stride=512), compared to 19.4 without overlap. That's a methodological detail that matters more than many researchers realize.

Here's the critical pitfall: tokenizer dependence. Perplexity normalizes per token, but different tokenizers produce different token counts for the same text. The Weighted Perplexity Benchmark (2025) showed tokenization differences affect measurements by up to 21.6% across 19 models on WikiText-2. Comparing Llama 2 (32K vocabulary) to Llama 3 (128K vocabulary) on perplexity is meaningless. Llama 3's per-token perplexity is higher simply because each token covers more underlying bytes.

Bits-per-byte (BPB) solves this by normalizing total information content by UTF-8 bytes rather than tokens:

$$\text{BPB} = \frac{\text{total NLL in nats}}{\ln(2) \times \text{total bytes}}$$

Since byte count stays fixed regardless of tokenization, BPB enables fair cross-model comparison. Shannon estimated English entropy at about 1.0 to 1.3 bits per character. GPT-2 achieved 0.93 BPB on enwik8. Two models with identical predictive quality but different tokenizers can show perplexities of 20.09 vs 7.39, yet produce identical BPB of 1.08.

Recent work has exposed deeper problems. Fang et al. (ICLR 2025) showed that standard perplexity averages across all tokens equally, masking poor performance on "key tokens" that are essential for long-context understanding. Their LongPPL metric focuses on key tokens via a long-short context contrastive method and achieves −0.96 Pearson correlation with downstream benchmarks, versus near-zero for standard PPL. Kuribayashi et al. separately demonstrated that lower perplexity doesn't always correlate with more human-like text processing.

When to use perplexity: comparing checkpoints within the same model family. When to use BPB: cross-model comparison. When to avoid both: measuring output quality, fluency, or task performance. They measure model fit to data, not generation quality.

2. N-gram overlap metrics: still everywhere, often wrong

Despite well-documented limitations, BLEU and ROUGE remain the most-cited evaluation metrics in NLP. A 2025 analysis of 14,171 papers across four major NLP conferences found that 63.6% of papers using BLEU provide no implementation details. That's a reproducibility crisis hiding in plain sight.

BLEU: precision over substance

BLEU computes a weighted geometric mean of modified n-gram precisions, multiplied by a brevity penalty:

$$\text{BLEU} = \text{BP} \times \exp\left(\sum w_n \times \log p_n\right)$$

where BP = exp(1 − r/c) if c ≤ r, else 1. Modified precision clips n-gram counts against maximum reference counts to prevent gaming through repetition. Standard BLEU-4 uses uniform weights (w₁ = w₂ = w₃ = w₄ = 0.25).

The original designers built BLEU for corpus-level machine translation. Applying it to single sentences causes the geometric mean to collapse to zero when any n-gram precision hits zero, which happens frequently for short sentences. The sacrebleu library exists specifically to fix reproducibility. It produces a version signature string (e.g., BLEU|nrefs:1|case:mixed|tok:13a|smooth:exp|version:2.0.0) that ensures exact reproducibility. Always use sacrebleu for paper-reportable scores. Never roll your own tokenization.

ROUGE computes n-gram recall (plus precision and F1):

$$\text{ROUGE-N recall} = \frac{\sum \text{Count\_match}(\text{gram}_n)}{\sum \text{Count}(\text{gram}_n \text{ in reference})}$$

ROUGE-L uses the Longest Common Subsequence (LCS), which captures word ordering without requiring contiguity. ROUGE-Lsum splits on newlines for multi-sentence evaluation. State-of-the-art summarization models typically achieve ROUGE-1: 40-47%, ROUGE-2: 18-28% on news benchmarks.

METEOR: the forgotten improvement

METEOR creates alignments through four stages: exact match, stemming, synonym (WordNet), and paraphrase. It then computes a recall-weighted harmonic mean with a fragmentation penalty:

$$\text{METEOR} = F_\text{mean} \times (1 - \gamma \times (\text{chunks}/\text{matched})^\beta)$$

It achieves Pearson correlation of 0.964 at corpus level (vs. BLEU's 0.817). Yet it remains underused due to WordNet dependency and version sensitivity, where scores can differ ±10 points between v1.0 and v1.5.

A new contender worth watching: the GEM metric (ICLR 2025), a reference-free approach based on mutual information, now outperforms BLEU, ROUGE-L, BERTScore, and BARTScore in correlation with human annotations, while also resisting manipulation.

3. Embedding-based metrics: semantics at a cost

BERTScore: greedy matching in embedding space

BERTScore extracts contextual embeddings from a pre-trained model, then uses greedy cosine-similarity matching between candidate and reference tokens:

$$R_\text{BERT} = \frac{1}{|x|} \sum_{x_i \in x} \max_{\hat{x}_j \in \hat{x}} \cos(x_i, \hat{x}_j)$$

$$P_\text{BERT} = \frac{1}{|\hat{x}|} \sum_{\hat{x}j \in \hat{x}} \max{x_i \in x} \cos(x_i, \hat{x}_j)$$

$$F_\text{BERT} = 2 \cdot P \cdot R / (P + R)$$

The default model is roberta-large (layer 17), but microsoft/deberta-xlarge-mnli achieves the highest Pearson correlation with human judgments. Without baseline rescaling, scores cluster in a narrow range (0.92 to 1.0 for RoBERTa), making interpretation hard. Rescaling spreads scores from 0.93 to a more readable 0.58 average.

Three limits matter here. First, a 512-token maximum means longer texts get silently truncated. Second, Sun et al. (EMNLP 2022) demonstrated social bias across 6 sensitive attributes ("BERTScore is Unfair"). Third, changing the underlying model can flip rankings between systems.

MoverScore: optimal transport over embeddings

MoverScore formulates evaluation as an Earth Mover's Distance problem. Instead of BERTScore's greedy 1-to-1 matching, it uses globally optimal soft alignment:

$$\text{MoverScore}(x, \hat{x}) = 1 - \text{EMD}(x, \hat{x})$$

This allows many-to-one alignments, which matter when one concept gets expressed with multiple words. On WMT17, MoverScore achieved Pearson correlation of 0.743 vs BERTScore's 0.719. But the improvement is marginal, the moverscore PyPI package is inactive, and the O(n³) optimal transport computation runs substantially slower.

When to use BERTScore: paraphrase detection and semantic similarity evaluation. When to avoid it: texts exceeding 512 tokens, fairness-sensitive applications, or when factual correctness (not semantic similarity) is the target.

4. LLM-as-judge: the new standard, with known failure modes

G-Eval: structured LLM scoring with probability weighting

G-Eval (Liu et al., EMNLP 2023) achieves Spearman ρ = 0.514 on SummEval, the highest automated correlation with human judgment at its time. The algorithm works in three steps.

First, define evaluation criteria and generate Chain-of-Thought evaluation steps via the LLM. Second, present the text with these steps and ask for a 1-5 score. Third, and this is the key innovation, extract token logprobs for score tokens {1, 2, 3, 4, 5} and compute a probability-weighted score:

$$\text{score} = \frac{\sum(i \times P(i))}{\sum P(i)}, \quad i \in \{1,2,3,4,5\}$$

This produces continuous, fine-grained scores that avoid the tie problem plaguing direct integer scoring.

The deepeval library provides a production-ready G-Eval wrapper. For open-source implementation, serve models via vLLM (which supports logprobs natively) and use the same OpenAI-compatible client interface.

AlpacaEval 2.0 and the length-control breakthrough

AlpacaEval 2.0 (Dubois et al., COLM 2024) introduced Length-Controlled (LC) win rate, fitting a GLM to predict win probability conditioned on zero length difference. This increased Spearman correlation with Chatbot Arena from 0.94 to 0.98 and reduced gameability from 21% to 6%.

The numbers tell the story clearly. Without LC, GPT-4-1106's win rates fluctuate from 35.3% to 64.3% based purely on verbosity prompts. With LC, the range narrows to 41.9% to 51.6%.

MT-Bench and Arena-Hard

MT-Bench evaluates 80 multi-turn questions across 8 categories (writing, roleplay, extraction, reasoning, math, coding, STEM, humanities) using GPT-4 as a 1-10 grader. Arena-Hard-Auto (2024) extends this with 500 challenging prompts, achieving 89.1% agreement with Chatbot Arena and 87.4% separability. That's far better than MT-Bench at distinguishing frontier models.

2024-2025 developments worth tracking

JudgeBench (ICLR 2025) is a sobering benchmark for evaluating judges themselves. The best model achieves only 64% accuracy (Claude-3.5-Sonnet), and fine-tuned judges often perform below random baseline.

The CALM framework (ICLR 2025) identified 12 distinct bias types in LLM judges: position, verbosity, fallacy oversight, sentiment, authority, beauty, self-enhancement, refinement, knowledge, format, cultural, and anchoring biases. That's a long list, and it explains why single-run LLM evaluations are unreliable.

WildBench achieves Pearson 0.98 correlation with Chatbot Arena using real-world tasks with task-specific checklists and length penalties.

And the multi-agent trend is accelerating. Self-MoA (2025) samples a single top LLM multiple times and achieves 65.7% LC win rate on AlpacaEval 2.0, outperforming heterogeneous multi-model ensembles at 59.1%.

5. Combining metrics: practical recommendations

No single metric captures all quality dimensions. The LMSYS team found that triangulating relative model performance with MT-Bench and AlpacaEval provides the best benchmark. And Tang et al. (NAACL 2024) showed that simply diversifying references via LLM-generated paraphrases significantly improves the correlation of even classical metrics with human judgments.

Here's what works by task:

Machine translation: sacrebleu + COMET (now dominant in WMT shared tasks) + chrF. Optionally add GEMBA-MQM for LLM-based quality estimation.

Summarization: ROUGE-L + BERTScore + a factual consistency metric + G-Eval for coherence and fluency.

Open-ended generation: LLM-as-judge with structured rubrics (G-Eval style) + MAUVE for distribution-level comparison + human spot-checks.

Code generation: pass@k for functional correctness + CodeBLEU. SWE-Judge for more realistic scenarios.

Instruction following: IFEval for verifiable constraints + MT-Bench for multi-turn quality.

One more thing. Anthropic's paper "Adding Error Bars to Evals" (Miller, Nov 2024) provides essential statistical guidance. Clustered standard errors can be 3× larger than naive standard errors when questions are grouped. Paired difference tests eliminate question-difficulty variance when comparing models. And power analysis determines required evaluation set sizes. Always report confidence intervals. A 2-point improvement is meaningless without knowing the standard error.

6. What the comparison reveals

The Colab experiment (see companion notebook) exposes predictable but instructive patterns.

The paraphrase example is the acid test. BLEU-4 drops near zero because there's no 4-gram overlap. BERTScore F1 stays high, correctly identifying semantic equivalence. This is exactly the kind of divergence that tells you something: the candidate is semantically correct but lexically different.

The verbose padding example shows ROUGE recall inflating (the reference content is all there) while ROUGE precision drops. BERTScore gives a moderate score. An LLM judge would likely penalize the filler text.

The hallucination case reveals the deepest limitation of surface metrics. ROUGE-1 can still score above zero on completely wrong content if individual words happen to overlap.

Three trends define where evaluation is heading. First, dynamic benchmarks like LiveBench and WildBench are replacing static test sets to combat contamination. The problem is so severe that Codeforces performance plummets after training cutoff dates. Second, the statistical rigor revolution means reporting scores without confidence intervals is increasingly unacceptable. Third, fine-tuned evaluation models continue to disappoint relative to general-purpose frontier LLMs as judges: on JudgeBench, the best fine-tuned judge hits only 57% accuracy while the best general model reaches 64%. This suggests evaluation capability scales with general capability, not with specialized training.

Takeaway

Use BPB (not perplexity) for intrinsic model comparison. Use sacrebleu + COMET for translation. Use ROUGE-L + BERTScore for summarization baselines. Use G-Eval or MT-Bench-style LLM judges as the primary quality signal for open-ended generation.

Always combine at least three metrics that measure different dimensions. Always report confidence intervals. And never trust a single number to capture text quality.

Metric disagreement is itself informative. When BLEU says a paraphrase is terrible but BERTScore says it's good, that gap tells you the candidate is semantically correct but lexically different. Building pipelines that surface these disagreements, rather than collapsing everything to a single score, produces evaluation systems that approximate the multi-dimensional judgments humans actually make.

The field is converging on LLM-as-judge as the primary evaluation approach. But the 12 identified bias types and 64% accuracy ceiling on challenging inputs mean we're far from a solved problem. Use frontier LLMs as judges, mitigate their known biases through position swapping, length control, and multi-run averaging, and maintain human spot-checking for high-stakes decisions.

Bayesian Optimization

Binesh Kumar — Wed, 13 Nov 2024 05:00:00 GMT

Most of this is from my class notes for the session - DSCI 6653 - Bayesian Data Analysis at the University of New Haven.

Bayesian optimization is a strategy for global optimization of black-box functions. In simpler terms, it is a smart way to find the best settings for a complex system where checking each setting is costly or time-consuming.

Instead of guessing randomly or checking a grid of points, it builds a probabilistic model of the function (often called a "surrogate") to decide where to sample next.

This process relies on balancing two competing goals: Exploration (looking in places where we don't know much yet) and Exploitation (refining our knowledge in areas that already look promising).

Step 1: The Intuition

Imagine you are a gold prospector on a vast, rugged piece of land. Your goal is to find the highest concentration of gold (the global maximum), but there is a catch:

Drilling is expensive: It costs a lot of money and time to set up a rig and drill a test hole. You can't just drill everywhere.
Blind Search: You can't see the gold from the surface. You only know how much gold is there after you drill.

This is exactly the problem Bayesian Optimization solves. It helps you decide where to drill next to get the best results with the fewest drills.

To make this decision, you use two main tools in your "mental map":

The Surrogate Model (The Map): After every drill, you update your sketch of the terrain. If you found gold in one spot, you guess there might be more nearby. If you found nothing, you assume that area is barren. This sketch gives you a probability of finding gold across the map.
The Acquisition Function (The Strategy): This is the rule you use to pick the next spot. You have to balance two instincts:
- Exploitation: Drilling near where you previously found gold. It's a safer bet, but you might get stuck finding only small nuggets (a local maximum).
- Exploration: Drilling in a completely empty part of the map. It's risky (you might find nothing), but it's the only way to find a massive vein of gold hidden in the unknown (the global maximum).

If you just stick to what you know (Pure Exploitation), you might be standing right next to a massive vein of gold (the global maximum) and never find it because you're too busy digging up small nuggets elsewhere (local maximum).

So, the "Golden Rule" of Bayesian Optimization is that we need a strategy to balance these two:

Exploration: Checking the unknown.
Exploitation: Refining the known.

Step 2: The Mechanics

Now that we have the intuition, let's look at the actual machinery that makes this work. In the math world, we don't have a physical map or a gut feeling. We have two specific components:

The Surrogate Model (Gaussian Process): This acts as our probability map. It estimates what the function looks like based on the points we've already checked. It gives us a mean (expected value) and uncertainty (variance) for every point.
The Acquisition Function: This is the formula that decides where to sample next. It takes the "map" from the Surrogate Model and calculates a score for every possible point, balancing exploration and exploitation.

Let's focus on the Surrogate Model first.

Imagine we have drilled two holes. One found a little gold, the other found none. We have no idea what is happening between those two holes.

If we want to build a model that guesses what the terrain looks like in the gaps, how confident should we be about our guess in the middle of those two distant points compared to a spot right next to a hole we already drilled? Less confident, right? he further we are from a drilled hole, the "fuzzier" our map becomes. We just don't know what's out there.

This is why the Gaussian Process (GP) is the standard tool here. For every single point on the map, it doesn't just give us a single guess; it gives us a probability distribution (a bell curve).

This gives us two key pieces of data for every coordinate:

The Mean $ \mu $ : Our best guess for how much gold is there.
The Standard Deviation $ \sigma $ : Our uncertainty.

Note: Notice in the diagram how the shaded region (uncertainty) gets "pinched" tight near the black dots (data points) and balloons out in the empty spaces? That ballooning is the math telling us, "I have no idea what's happening here!"

The Acquisition Function

Now we need a rule to look at that GP and say, "Drill here next." This rule is the Acquisition Function.

A very common one is called Upper Confidence Bound (UCB). The formula looks roughly like this:

$$\text{Score} = \text{Mean} + (\kappa \times \text{Uncertainty})$$

Mean: High predicted value (Exploitation)
Uncertainty: High potential to learn something new (Exploration)
$ \kappa $ (Kappa): A number we choose to tune our strategy.

If we set that $ \kappa $ (kappa) value to be very, very high, are we acting more like a safe, conservative miner, or a risky, adventurous explorer? We are acting like an adventurous explorer.

A high $ \kappa $ value boosts the "Uncertainty" part of the equation, meaning the algorithm is willing to ignore the safe bets (high Mean) to go check out the mysterious, unknown areas (high Uncertainty). It becomes an adventurous explorer.

So, the full Mechanics cycle looks like this:

Update Model: The Gaussian Process looks at the data we have so far.
Pick a Point: The Acquisition Function (like UCB) calculates the score for every point and picks the winner.
Evaluate: We actually "drill" at that spot (calculate the real result).
Repeat: We add that new data to our model and loop again.

Concept	Discrete	Continuous
Distribution	PMF: \( p_X(x) \)	PDF: \( f_X(x) \)
Expectation	\( \sum x \cdot p_X(x) \)	\( \int x \cdot f_X(x) \, dx \)
Variance	\( \sum (x-\mu)^2 \cdot p_X(x) \)	\( \int (x-\mu)^2 \cdot f_X(x) \, dx \)
Independence	\( p_{X,Y} = p_X \cdot p_Y \)	\( f_{X,Y} = f_X \cdot f_Y \)

Binesh's Data Sense Lab

Circuit Tracing: Finding Medical Features in Gemma 3

What Are Features, and Why Should You Care?

The Experiment

Step 1: Search for "Angina"

Step 2: Test It on a Medical Prompt

Step 3: Replicate with a Different Medical Domain

What This Tells Us

Limitations

Try It Yourself

What Does Medical VLM Actually See? Experiments with MedGemma and Sparse Autoencoders

BlackBox : Neural Networks Are Opaque

SAEs provide a Way In

Using GemmaScope 2 on MedGemma

Experiment 1: Different Questions, Different Circuits

Experiment 2: The Phrasing Effect

Experiment 3: Tracking Features Across a Phrasing Spectrum

Fluent But Wrong: LLM and Healthcare

The Setup

How the Model Works: A Brief Primer

The Core Idea: Predicting the Next Word

The Architecture

The Architecture in Numbers

What the Architecture Cannot Do

When the Model Sounds Medical

Prompt: Chief Complaint

When the Model Reveals It Understands Nothing

Prompt: Impossible Patient History

Prompt: Made-Up Medication

Prompt: Vital Signs Incompatible with Life

Why This Matters Beyond My Tiny Model

What Would Actually Help

What we actually need:

Opening the Black Box: How to See What Your Vision Language Model is Actually Looking At

Part 1: The Problem with Asking "What Did You See?"

Part 2: How Transformers Pay Attention

The Cocktail Party

Attention as a Spotlight

Multiple Heads, Multiple Perspectives

Layers Upon Layers

Part 3: The Challenge of Multi-Layer Attribution

The River Delta

Why Raw Attention Fails

The Rollout Approach and Its Limitations

Part 4: The Chefer Method, Step by Step

The Core Insight: Gradients Tell Us What Matters

The Recipe

A Worked Example

Part 5: Applying This to Vision-Language Models

How Images Become Tokens

Generating the Explanation

Part 6: The Method in Action — My MedGemma Results

Example 1: Finding the Remote Control

Example 2: Chest X-ray Pneumonia Detection

Part 7: A Critical Implementation Detail

The Backpropagation Target Problem

MedGemma's Architecture

Other Implementation Notes

Part 8: Implications for Medical AI Safety

The Limitation to Remember

Conclusion: Opening Doors, Not Just Black Boxes

Try It Yourself

Further Reading

Data Generating Process

What Is a Data Generating Process?

An Example

The Null Distribution and P-values

Why This Changes Everything

The Sampling Distribution Demystified

Bootstrap: When You Only Have One Sample

Permutation Tests: Simulating the Null World

From Consumer to Architect

Building AI Agents with Multimodal Models: The Final Challenge

The Challenge That Ties Everything Together

Understanding the Problem: Cubes and Spheres

The Three-Part Solution

Mental Model: The Translation Pipeline

Part 1: Teaching Two Modalities to Speak the Same Language

The Architecture I Built

The Training Objective