Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Multimodal LLMs — Vision, Language, Audio, Tool Use Combined

Session 36 of the 48-session learning series.

Date: Tue, 2026-07-07 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 23 · Est. read: 2 h

Why this session matters

This is Session 36 of 48 in the LLM track. By 2026, "LLM" really means "multimodal foundation model" — Claude, GPT, Gemini, all natively handle images, audio, and tools. Knowing how the modalities are stitched into a shared representation, and what fails when you cross them, separates AI engineers from prompt-jockeys.

Agenda

  • How vision is plugged into a transformer (CLIP, ViT, vision-language models)
  • Audio models — Whisper, audio tokens, native audio in/out
  • Late fusion vs early fusion vs interleaved modalities
  • Tool use across modalities — image → action → answer
  • Evaluation — VQA, MMMU, Audio-LM benchmarks; the new failure modes

Pre-read (skim before the session)

Deep dive

1. What "multimodal" actually means

Three patterns conflated under the same word:

  • Cross-modal embedding (CLIP) — text and image in same vector space for similarity / search.
  • Vision-language generation (LLaVA, GPT-4V) — model takes image + text, generates text.
  • Any-to-any (GPT-4o, Gemini Native) — image / audio / video / text in; image / audio / text out.

Each is a different architecture problem. We'll cover all three.

2. CLIP — the foundational cross-modal model

Train two encoders side by side:

  • Image encoder (ViT or ResNet) → image embedding.
  • Text encoder (transformer) → text embedding.

Train with contrastive loss: for a batch of (image, caption) pairs, the diagonal of image_emb · text_emb.T should be high, off-diagonal low.

After training:

  • Zero-shot image classification by checking cos(image, "a photo of a {class}") for each class.
  • Image search by text query.
  • Text-image retrieval at scale.

CLIP is the embedding spine of most modern multimodal stacks (Stable Diffusion, Midjourney's text understanding, every "search images by text" feature).

3. From CLIP to vision-language generation

CLIP doesn't generate text. To get a VLM:

image → CLIP-like vision encoder → image tokens (~256–576 per image)
                                          ↓
                       [ MLP projection: image-emb-dim → LLM-emb-dim ]
                                          ↓
                       interleave with text tokens:
                       <image_tokens> "Describe this." <generated_text>
                                          ↓
                                    LLM transformer

That's the LLaVA recipe. CLIP encoder + projection MLP + frozen LLM, trained on image-instruction pairs. Cheap to build; impressively capable.

4. Native multimodal architectures

GPT-4o, Gemini, Claude 3.5+ go further:

  • Image, audio, video all encoded into tokens that share the model's token space.
  • Trained end-to-end — no frozen modules.
  • Generation can output across modalities (text + image, text + audio).

Architectural variants:

  • Late fusion — encode each modality separately, concat embeddings, pass to LLM. Simplest; least cross-modal reasoning.
  • Early fusion — interleave tokens from all modalities in one stream from the start. Strongest reasoning across modalities; hardest to train.
  • Cross-attention fusion (Flamingo) — image tokens accessed via cross-attention in dedicated layers. Compromise; mainstream pre-2024.

Trend in 2026: pure early fusion with shared tokeniser.

5. Vision encoder choices

  • CLIP-ViT — most common; 224×224 or 336×336 inputs; ~256 tokens out.
  • SigLIP — better contrastive loss; sharper at smaller sizes.
  • DinoV2 — self-supervised; better at fine detail, less semantic.
  • ConvNext / EVA-CLIP — hybrid CNN-ViT; sometimes higher quality.

For document understanding / OCR-heavy: higher resolution + multi-crop sliding window.

6. Image tokens are expensive

A 336×336 image → 576 vision tokens. That eats your context window fast.

  • 4-image prompt: 2304 tokens for images alone.
  • High-res / multi-crop: easily 4000+ tokens per image.

Inference cost scales with this. Practical advice:

  • Resize aggressively before sending.
  • Use a smaller model for image-only queries; reserve the big model for cross-modal reasoning.
  • Cache image embeddings if you'll re-query the same image multiple times.

7. Audio in / audio out

Whisper-style encoders are dominant for audio→text. Modern any-to-any models:

  • Audio in — mel-spectrogram → audio tokens (Whisper-like encoder) → interleaved with text tokens.
  • Audio out — model emits audio tokens; vocoder (SoundStream, EnCodec, neural codec) converts to waveform.

End-to-end voice latency: GPT-4o demonstrated <300 ms. The bottleneck moved from speech recognition to LLM decoding.

8. Document understanding — the killer business use case

  • OCR-first — Donut, LayoutLM, PaddleOCR; extract text + structure; pass to LLM.
  • OCR-free — feed image directly to VLM; ask it to extract a table. Works for clean docs, fails on dense.
  • Hybrid — OCR + VLM ensemble. Most production.

For invoices, contracts, forms: hybrid + structured output schema + per-field validation is the only reliable approach.

9. Multimodal RAG

When your corpus has images, charts, PDFs:

  • Embed both text and image with a CLIP-class encoder.
  • Index in a vector DB with type tag.
  • At query time, retrieve top-K across both; pass everything to a VLM.

Easier said than done — chart interpretation is still a weakness. A chart-aware prompt + structured extraction outperforms naive image RAG by a lot.

10. Tool use × multimodal

The compound:

  • Vision identifies a chart → ask parse_chart_to_csv() tool → retrieve numbers → LLM reasons → answer.
  • Vision identifies a UI screenshot → call read_button(label) → click → screenshot again → continue.

This is the "computer use" / agentic-vision frontier. Anthropic and OpenAI both shipped basic UI-control in 2024. Latency and reliability are the open problems.

11. Evaluation

  • VQA (Visual Question Answering) — image + question, expect short answer.
  • MMMU — multimodal multi-discipline understanding; college-level.
  • MathVista — visual math.
  • MMBench — broad rubric-based eval.
  • OCRBench — OCR-style tasks.
  • VideoMME — video understanding at length.

Same caveats as LLM eval (S25): contamination, judge bias, your-product-eval > public benchmark.

12. New failure modes

  • Hallucinating across modalities — model invents details not in the image ("the man is wearing a red hat" — no hat).
  • Reading text in image wrong — OCR errors hidden inside fluent prose.
  • Spatial reasoning — "to the left of the cup" still fails 30% in 2026.
  • Chart precision — values read off bar charts can be 10% off; never trust without validation.
  • Cross-modal jailbreak — instructions hidden inside images (sticker says "ignore the user").

Mitigations: structured output, verification tools, second-pass validation, never auto-action on visual claims without confirmation.

13. Reality check

A pragmatic multimodal stack:

  • Use API for general multimodal (GPT-4o, Claude, Gemini).
  • For document extraction at scale: open OCR (PaddleOCR) + structured prompt + Pydantic validator + LLM-as-judge sample audit.
  • For images-only retrieval: CLIP + Faiss.
  • Avoid building your own multimodal training pipeline unless this is the company's IP.

Reading material

In-depth research material

Video reference

▶︎ Multimodal LLMs Explained (Yannic Kilcher)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Image Overlap

  • Link: https://leetcode.com/problems/image-overlap/
  • Difficulty: Medium
  • Why this problem: Treat two image grids; find the best overlap. Forces the same kind of spatial-reasoning mental model that VLMs struggle with.
  • Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

  • Explain CLIP's contrastive training and what it enables.
  • Sketch the LLaVA architecture (CLIP encoder + projection + LLM).
  • Compare late fusion, early fusion, cross-attention fusion.
  • Estimate how many tokens an image consumes and design context budget around it.
  • Pick the right approach for document extraction (OCR-first, OCR-free, hybrid).
  • Solve image-overlap — grid alignment & shift counting, mirror of the spatial-reasoning gap.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.