Era VI · 2024
The Multimodal Turn
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
The multimodal turn closed the gap between what a language model could see and what it could read. GPT-4V, Claude 3 with vision, and Gemini all shipped vision capabilities within a six-month window. Document parsing, screenshot reasoning, chart reading, and visual-question-answering moved from research demos to production features inside a year.
GPT-4V
September 2023, vision capability added to GPT-4. The first frontier model that could read a chart and answer questions about it without an OCR pre-pass.
Claude 3 with vision
March 2024, all three Claude 3 sizes shipped with vision. The competitive parity was the news; the capability had become table-stakes.
Audio and video
Gemini led on video understanding through 2024 and 2025. GPT-4o introduced realtime audio. By the end of 2025 every frontier model supported at least text, image, and audio inputs.
Signature models of the era
- GPT-4V
- Claude 3 (Opus, Sonnet, Haiku)
- Gemini 1.5 Pro
Technical shifts
- Vision becomes universal
- Document parsing collapses as a separate product category
- Audio inputs and outputs ship across providers
Market shifts
- OCR-as-a-service companies pivot or shrink
- Document-AI startups consolidate
Authentication — is the document from this era?
| Tell | Meaning |
|---|---|
| Mentions 'GPT-4V' as if separate from 'GPT-4' | Document dates from late 2023 to mid-2024. |
Agents catalogued in this era
- Stable Diffusion 3 — Open-weights image generation flagship. Text rendering finally works.
- Midjourney v7 — The image generation model that made AI art a creative category. Photorealism and style control best-in-class.
Primary sources
- [1] OpenAI: GPT-4V system card — 2023-09-25
From the Almanac shop
The AI Eras — Pocket Field Guide
Ten eras of AI on a single foldable. The Almanac in your pocket.
$19 — Coming soon