Multimodal LLMs — Vision, Language, Audio, Tool Use Combined
Session 36 of the 48-session learning series.
Date: Tue, 2026-07-07 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 23 · Est. read: 2 h
Why this session matters
This is Session 36 of 48 in the LLM track. By 2026, "LLM" really means "multimodal foundation model" — Claude, GPT, Gemini, all natively handle images, audio, and tools. Knowing how the modalities are stitched into a shared representation, and what fails when you cross them, separates AI engineers from prompt-jockeys.
Agenda
- How vision is plugged into a transformer (CLIP, ViT, vision-language models)
- Audio models — Whisper, audio tokens, native audio in/out
- Late fusion vs early fusion vs interleaved modalities
- Tool use across modalities — image → action → answer
- Evaluation — VQA, MMMU, Audio-LM benchmarks; the new failure modes
Pre-read (skim before the session)
- CLIP paper (Radford et al., 2021)
- Flamingo paper (Alayrac et al., 2022)
- LLaVA paper (Liu et al., 2023)
- Whisper paper (Radford et al., 2022)
Deep dive
1. What "multimodal" actually means
Three patterns conflated under the same word:
- Cross-modal embedding (CLIP) — text and image in same vector space for similarity / search.
- Vision-language generation (LLaVA, GPT-4V) — model takes image + text, generates text.
- Any-to-any (GPT-4o, Gemini Native) — image / audio / video / text in; image / audio / text out.
Each is a different architecture problem. We'll cover all three.
2. CLIP — the foundational cross-modal model
Train two encoders side by side:
- Image encoder (ViT or ResNet) → image embedding.
- Text encoder (transformer) → text embedding.
Train with contrastive loss: for a batch of (image, caption) pairs, the diagonal of image_emb · text_emb.T should be high, off-diagonal low.
After training:
- Zero-shot image classification by checking
cos(image, "a photo of a {class}")for each class. - Image search by text query.
- Text-image retrieval at scale.
CLIP is the embedding spine of most modern multimodal stacks (Stable Diffusion, Midjourney's text understanding, every "search images by text" feature).
3. From CLIP to vision-language generation
CLIP doesn't generate text. To get a VLM:
image → CLIP-like vision encoder → image tokens (~256–576 per image)
↓
[ MLP projection: image-emb-dim → LLM-emb-dim ]
↓
interleave with text tokens:
<image_tokens> "Describe this." <generated_text>
↓
LLM transformer
That's the LLaVA recipe. CLIP encoder + projection MLP + frozen LLM, trained on image-instruction pairs. Cheap to build; impressively capable.
4. Native multimodal architectures
GPT-4o, Gemini, Claude 3.5+ go further:
- Image, audio, video all encoded into tokens that share the model's token space.
- Trained end-to-end — no frozen modules.
- Generation can output across modalities (text + image, text + audio).
Architectural variants:
- Late fusion — encode each modality separately, concat embeddings, pass to LLM. Simplest; least cross-modal reasoning.
- Early fusion — interleave tokens from all modalities in one stream from the start. Strongest reasoning across modalities; hardest to train.
- Cross-attention fusion (Flamingo) — image tokens accessed via cross-attention in dedicated layers. Compromise; mainstream pre-2024.
Trend in 2026: pure early fusion with shared tokeniser.
5. Vision encoder choices
- CLIP-ViT — most common; 224×224 or 336×336 inputs; ~256 tokens out.
- SigLIP — better contrastive loss; sharper at smaller sizes.
- DinoV2 — self-supervised; better at fine detail, less semantic.
- ConvNext / EVA-CLIP — hybrid CNN-ViT; sometimes higher quality.
For document understanding / OCR-heavy: higher resolution + multi-crop sliding window.
6. Image tokens are expensive
A 336×336 image → 576 vision tokens. That eats your context window fast.
- 4-image prompt: 2304 tokens for images alone.
- High-res / multi-crop: easily 4000+ tokens per image.
Inference cost scales with this. Practical advice:
- Resize aggressively before sending.
- Use a smaller model for image-only queries; reserve the big model for cross-modal reasoning.
- Cache image embeddings if you'll re-query the same image multiple times.
7. Audio in / audio out
Whisper-style encoders are dominant for audio→text. Modern any-to-any models:
- Audio in — mel-spectrogram → audio tokens (Whisper-like encoder) → interleaved with text tokens.
- Audio out — model emits audio tokens; vocoder (SoundStream, EnCodec, neural codec) converts to waveform.
End-to-end voice latency: GPT-4o demonstrated <300 ms. The bottleneck moved from speech recognition to LLM decoding.
8. Document understanding — the killer business use case
- OCR-first — Donut, LayoutLM, PaddleOCR; extract text + structure; pass to LLM.
- OCR-free — feed image directly to VLM; ask it to extract a table. Works for clean docs, fails on dense.
- Hybrid — OCR + VLM ensemble. Most production.
For invoices, contracts, forms: hybrid + structured output schema + per-field validation is the only reliable approach.
9. Multimodal RAG
When your corpus has images, charts, PDFs:
- Embed both text and image with a CLIP-class encoder.
- Index in a vector DB with type tag.
- At query time, retrieve top-K across both; pass everything to a VLM.
Easier said than done — chart interpretation is still a weakness. A chart-aware prompt + structured extraction outperforms naive image RAG by a lot.
10. Tool use × multimodal
The compound:
- Vision identifies a chart → ask
parse_chart_to_csv()tool → retrieve numbers → LLM reasons → answer. - Vision identifies a UI screenshot → call
read_button(label)→ click → screenshot again → continue.
This is the "computer use" / agentic-vision frontier. Anthropic and OpenAI both shipped basic UI-control in 2024. Latency and reliability are the open problems.
11. Evaluation
- VQA (Visual Question Answering) — image + question, expect short answer.
- MMMU — multimodal multi-discipline understanding; college-level.
- MathVista — visual math.
- MMBench — broad rubric-based eval.
- OCRBench — OCR-style tasks.
- VideoMME — video understanding at length.
Same caveats as LLM eval (S25): contamination, judge bias, your-product-eval > public benchmark.
12. New failure modes
- Hallucinating across modalities — model invents details not in the image ("the man is wearing a red hat" — no hat).
- Reading text in image wrong — OCR errors hidden inside fluent prose.
- Spatial reasoning — "to the left of the cup" still fails 30% in 2026.
- Chart precision — values read off bar charts can be 10% off; never trust without validation.
- Cross-modal jailbreak — instructions hidden inside images (sticker says "ignore the user").
Mitigations: structured output, verification tools, second-pass validation, never auto-action on visual claims without confirmation.
13. Reality check
A pragmatic multimodal stack:
- Use API for general multimodal (GPT-4o, Claude, Gemini).
- For document extraction at scale: open OCR (PaddleOCR) + structured prompt + Pydantic validator + LLM-as-judge sample audit.
- For images-only retrieval: CLIP + Faiss.
- Avoid building your own multimodal training pipeline unless this is the company's IP.
Reading material
- CLIP (Radford et al., 2021)
- Flamingo (Alayrac et al., 2022)
- LLaVA (Liu et al., 2023)
- Whisper (Radford et al., 2022)
In-depth research material
- GPT-4o announcement (OpenAI)
- Gemini technical report
- BLIP-2 (Li et al., 2023)
- SigLIP (Zhai et al., 2023)
Video reference
▶︎ Multimodal LLMs Explained (Yannic Kilcher)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Image Overlap
- Link: https://leetcode.com/problems/image-overlap/
- Difficulty: Medium
- Why this problem: Treat two image grids; find the best overlap. Forces the same kind of spatial-reasoning mental model that VLMs struggle with.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain CLIP's contrastive training and what it enables.
- Sketch the LLaVA architecture (CLIP encoder + projection + LLM).
- Compare late fusion, early fusion, cross-attention fusion.
- Estimate how many tokens an image consumes and design context budget around it.
- Pick the right approach for document extraction (OCR-first, OCR-free, hybrid).
- Solve
image-overlap— grid alignment & shift counting, mirror of the spatial-reasoning gap.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.