ai mlintermediate 12m2026-06-09

Multimodal LLMs — Vision, Language, Audio, Tool Use Combined

Session 36 of the 48-session learning series.

Date: Tue, 2026-07-07 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 23 · Est. read: 2 h

Why this session matters

This is Session 36 of 48 in the LLM track. By 2026, "LLM" really means "multimodal foundation model" — Claude, GPT, Gemini, all natively handle images, audio, and tools. Knowing how the modalities are stitched into a shared representation, and what fails when you cross them, separates AI engineers from prompt-jockeys.

Agenda

How vision is plugged into a transformer (CLIP, ViT, vision-language models)
Audio models — Whisper, audio tokens, native audio in/out
Late fusion vs early fusion vs interleaved modalities
Tool use across modalities — image → action → answer
Evaluation — VQA, MMMU, Audio-LM benchmarks; the new failure modes

Pre-read (skim before the session)

Deep dive

1. What "multimodal" actually means

Three patterns conflated under the same word:

Cross-modal embedding (CLIP) — text and image in same vector space for similarity / search.
Vision-language generation (LLaVA, GPT-4V) — model takes image + text, generates text.
Any-to-any (GPT-4o, Gemini Native) — image / audio / video / text in; image / audio / text out.

Each is a different architecture problem. We'll cover all three.

Train two encoders side by side:

Image encoder (ViT or ResNet) → image embedding.
Text encoder (transformer) → text embedding.

Train with contrastive loss: for a batch of (image, caption) pairs, the diagonal of image_emb · text_emb.T should be high, off-diagonal low.

After training:

Zero-shot image classification by checking cos(image, "a photo of a {class}") for each class.
Image search by text query.
Text-image retrieval at scale.

CLIP is the embedding spine of most modern multimodal stacks (Stable Diffusion, Midjourney's text understanding, every "search images by text" feature).

3. From CLIP to vision-language generation

CLIP doesn't generate text. To get a VLM:

image → CLIP-like vision encoder → image tokens (~256–576 per image)
                                          ↓
                       [ MLP projection: image-emb-dim → LLM-emb-dim ]
                                          ↓
                       interleave with text tokens:
                       <image_tokens> "Describe this." <generated_text>
                                          ↓
                                    LLM transformer

That's the LLaVA recipe. CLIP encoder + projection MLP + frozen LLM, trained on image-instruction pairs. Cheap to build; impressively capable.

4. Native multimodal architectures

GPT-4o, Gemini, Claude 3.5+ go further:

Image, audio, video all encoded into tokens that share the model's token space.
Trained end-to-end — no frozen modules.
Generation can output across modalities (text + image, text + audio).

Architectural variants:

Late fusion — encode each modality separately, concat embeddings, pass to LLM. Simplest; least cross-modal reasoning.
Early fusion — interleave tokens from all modalities in one stream from the start. Strongest reasoning across modalities; hardest to train.
Cross-attention fusion (Flamingo) — image tokens accessed via cross-attention in dedicated layers. Compromise; mainstream pre-2024.

Trend in 2026: pure early fusion with shared tokeniser.

5. Vision encoder choices

CLIP-ViT — most common; 224×224 or 336×336 inputs; ~256 tokens out.
SigLIP — better contrastive loss; sharper at smaller sizes.
DinoV2 — self-supervised; better at fine detail, less semantic.
ConvNext / EVA-CLIP — hybrid CNN-ViT; sometimes higher quality.

For document understanding / OCR-heavy: higher resolution + multi-crop sliding window.

6. Image tokens are expensive

A 336×336 image → 576 vision tokens. That eats your context window fast.

4-image prompt: 2304 tokens for images alone.
High-res / multi-crop: easily 4000+ tokens per image.

Inference cost scales with this. Practical advice:

Resize aggressively before sending.
Use a smaller model for image-only queries; reserve the big model for cross-modal reasoning.
Cache image embeddings if you'll re-query the same image multiple times.

7. Audio in / audio out

Whisper-style encoders are dominant for audio→text. Modern any-to-any models:

Audio in — mel-spectrogram → audio tokens (Whisper-like encoder) → interleaved with text tokens.
Audio out — model emits audio tokens; vocoder (SoundStream, EnCodec, neural codec) converts to waveform.

End-to-end voice latency: GPT-4o demonstrated <300 ms. The bottleneck moved from speech recognition to LLM decoding.

8. Document understanding — the killer business use case

OCR-first — Donut, LayoutLM, PaddleOCR; extract text + structure; pass to LLM.
OCR-free — feed image directly to VLM; ask it to extract a table. Works for clean docs, fails on dense.
Hybrid — OCR + VLM ensemble. Most production.

For invoices, contracts, forms: hybrid + structured output schema + per-field validation is the only reliable approach.

9. Multimodal RAG

When your corpus has images, charts, PDFs:

Embed both text and image with a CLIP-class encoder.
Index in a vector DB with type tag.
At query time, retrieve top-K across both; pass everything to a VLM.

Easier said than done — chart interpretation is still a weakness. A chart-aware prompt + structured extraction outperforms naive image RAG by a lot.

10. Tool use × multimodal

The compound:

Vision identifies a chart → ask parse_chart_to_csv() tool → retrieve numbers → LLM reasons → answer.
Vision identifies a UI screenshot → call read_button(label) → click → screenshot again → continue.

This is the "computer use" / agentic-vision frontier. Anthropic and OpenAI both shipped basic UI-control in 2024. Latency and reliability are the open problems.

11. Evaluation

VQA (Visual Question Answering) — image + question, expect short answer.
MMMU — multimodal multi-discipline understanding; college-level.
MathVista — visual math.
MMBench — broad rubric-based eval.
OCRBench — OCR-style tasks.
VideoMME — video understanding at length.

Same caveats as LLM eval (S25): contamination, judge bias, your-product-eval > public benchmark.

12. New failure modes

Hallucinating across modalities — model invents details not in the image ("the man is wearing a red hat" — no hat).
Reading text in image wrong — OCR errors hidden inside fluent prose.
Spatial reasoning — "to the left of the cup" still fails 30% in 2026.
Chart precision — values read off bar charts can be 10% off; never trust without validation.
Cross-modal jailbreak — instructions hidden inside images (sticker says "ignore the user").

Mitigations: structured output, verification tools, second-pass validation, never auto-action on visual claims without confirmation.

13. Reality check

A pragmatic multimodal stack:

Use API for general multimodal (GPT-4o, Claude, Gemini).
For document extraction at scale: open OCR (PaddleOCR) + structured prompt + Pydantic validator + LLM-as-judge sample audit.
For images-only retrieval: CLIP + Faiss.
Avoid building your own multimodal training pipeline unless this is the company's IP.

Link: https://leetcode.com/problems/image-overlap/
Difficulty: Medium
Why this problem: Treat two image grids; find the best overlap. Forces the same kind of spatial-reasoning mental model that VLMs struggle with.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain CLIP's contrastive training and what it enables.
Sketch the LLaVA architecture (CLIP encoder + projection + LLM).
Compare late fusion, early fusion, cross-attention fusion.
Estimate how many tokens an image consumes and design context budget around it.
Pick the right approach for document extraction (OCR-first, OCR-free, hybrid).
Solve image-overlap — grid alignment & shift counting, mirror of the spatial-reasoning gap.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

News Feed / Timeline System — Fanout-on-Read vs Write, Ranking

Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing