The Multimodal Turn

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

The multimodal turn closed the gap between what a language model could see and what it could read. GPT-4V, Claude 3 with vision, and Gemini all shipped vision capabilities within a six-month window. Document parsing, screenshot reasoning, chart reading, and visual-question-answering moved from research demos to production features inside a year.

GPT-4V

September 2023, vision capability added to GPT-4. The first frontier model that could read a chart and answer questions about it without an OCR pre-pass.

Claude 3 with vision

March 2024, all three Claude 3 sizes shipped with vision. The competitive parity was the news; the capability had become table-stakes.

Audio and video

Gemini led on video understanding through 2024 and 2025. GPT-4o introduced realtime audio. By the end of 2025 every frontier model supported at least text, image, and audio inputs.

Signature models of the era

GPT-4V
Claude 3 (Opus, Sonnet, Haiku)
Gemini 1.5 Pro

Technical shifts

Vision becomes universal
Document parsing collapses as a separate product category
Audio inputs and outputs ship across providers

Market shifts

OCR-as-a-service companies pivot or shrink
Document-AI startups consolidate

Authentication — is the document from this era?

Tell	Meaning
Mentions 'GPT-4V' as if separate from 'GPT-4'	Document dates from late 2023 to mid-2024.

Agents catalogued in this era

Stable Diffusion 3 — Open-weights image generation flagship. Text rendering finally works.
Midjourney v7 — The image generation model that made AI art a creative category. Photorealism and style control best-in-class.

Primary sources

[1] OpenAI: GPT-4V system card — 2023-09-25

From the Almanac shop

The AI Eras — Pocket Field Guide

Ten eras of AI on a single foldable. The Almanac in your pocket.

$19 — Coming soon

← Back to the timeline