Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-21

Day 23 — Multimodal LLMs — Vision-Language, Audio, and Tool-Use Combined

2025 is the year multimodal went default. GPT-4o, Claude 3.5 Sonnet vision, Gemini 1.5/2 — every serious agent now sees and hears. Understanding how visual toke…

Multimodal LLMs unify text, images, audio, sometimes video and code into a single token stream. The architecture is more about adapters and tokenisers than new model classes — the LLM is still a transformer; the inputs just get richer.

🧠 Concept

Why it matters & the mental model.

1. The recipe

2. Vision encoders

  • ViT (Vision Transformer): split image into 16×16 patches, linearly embed, add position, run through transformer. Modern variants: SigLIP, CLIP ViT-L/14, DINOv2.
  • CLIP-style pretraining (image-text contrastive) aligns vision with language space — that alignment is what makes the projector cheap.

3. The projector

A small MLP (sometimes Q-Former or perceiver resampler) maps vision tokens into the LLM's text embedding space. This is the only piece that has to be trained from scratch when you bolt vision onto an existing text LLM (LLaVA recipe).

4. Tokenisation explosion

A single 1024×1024 image becomes thousands of patches → thousands of "tokens". Vision tokens dominate context. Tricks:

  • Multiple crops at different resolutions ("dynamic resolution", InternVL, GPT-4V).
  • Visual token compression (Q-Former produces fixed 64 tokens regardless of image).
  • Routing: low-importance regions get fewer tokens.

🛠 Deep Dive

Internals, code, architecture.

5. Audio

  • Whisper-style: log-mel spectrogram → encoder → decoder. Strong ASR baseline.
  • Speech-as-tokens (SpeechT5, AudioLM): treat audio as discrete tokens via VQ-VAE codec, LLM generates them.
  • Real-time conversational (GPT-4o, Gemini Live, Moshi): low-latency duplex; user can interrupt; tokens flow both ways.

6. Training strategies

  1. Pretrain vision encoder + projector with contrastive loss (CLIP).
  2. Freeze LLM, train projector on (image, caption) pairs (LLaVA stage 1).
  3. Instruction tuning on (image, instruction, answer) triples — the secret sauce.
  4. Optional: end-to-end fine-tune for top accuracy.

7. Capabilities & limits today

  • OCR + document understanding: Claude, GPT-4o, Gemini all strong.
  • Chart / table reading: improved, but small text + complex layouts still fail.
  • Spatial reasoning (count, locate): improving fast but lossy.
  • Video (multiple frames): mostly multi-image inputs; true temporal understanding still early (Gemini 1.5 / Veo).
  • Generation: separate models (DALL·E, Imagen, Veo, Sora); not unified with LLM yet at production.

8. Tool use + vision = browser agents

The 2024-2025 wave: GPT-4V + a browser tool → "look at screenshot, click here". Real products (Anthropic Computer Use, OpenAI Operator, Adept). Failure modes: visual grounding errors, hallucinated UI elements, page latency.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Evaluation

  • VQA / MMMU / DocVQA / ChartQA: benchmarks.
  • For products: build a golden set of (image, question, expected answer) — same eval discipline as text-only.
  • Safety: image jailbreaks (text in image), CSAM filtering, PII (faces, plates).

10. Cost

Image inputs are token-expensive (a single ~1024² image often costs as much as 1000-2000 text tokens). Resize on the client side; use the model's "low detail" mode when high resolution isn't needed.

11. Practical agent patterns

  • Screenshot → action loop with bounded steps.
  • Multimodal RAG: retrieve images + text snippets; pass top-k to LLM.
  • OCR pre-pass: extract text from images with a fast OCR, give LLM both image and OCR text → better grounding.

12. What to take away

"How does a multimodal model see an image?" Strong answers: ViT patches → projector → text embedding space, mention CLIP pretraining, talk about token cost, name one product (Claude / GPT-4o) and one failure mode.

Key points

    Resources

    Practice Problem: Image Smoother (Easy)