kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI

Era VI · 2024

The Multimodal Turn

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

The multimodal turn closed the gap between what a language model could see and what it could read. GPT-4V, Claude 3 with vision, and Gemini all shipped vision capabilities within a six-month window. Document parsing, screenshot reasoning, chart reading, and visual-question-answering moved from research demos to production features inside a year.

GPT-4V

September 2023, vision capability added to GPT-4. The first frontier model that could read a chart and answer questions about it without an OCR pre-pass.

Claude 3 with vision

March 2024, all three Claude 3 sizes shipped with vision. The competitive parity was the news; the capability had become table-stakes.

Audio and video

Gemini led on video understanding through 2024 and 2025. GPT-4o introduced realtime audio. By the end of 2025 every frontier model supported at least text, image, and audio inputs.

Signature models of the era

  • GPT-4V
  • Claude 3 (Opus, Sonnet, Haiku)
  • Gemini 1.5 Pro

Technical shifts

  • Vision becomes universal
  • Document parsing collapses as a separate product category
  • Audio inputs and outputs ship across providers

Market shifts

  • OCR-as-a-service companies pivot or shrink
  • Document-AI startups consolidate

Authentication — is the document from this era?

TellMeaning
Mentions 'GPT-4V' as if separate from 'GPT-4'Document dates from late 2023 to mid-2024.

Agents catalogued in this era

  • Stable Diffusion 3Open-weights image generation flagship. Text rendering finally works.
  • Midjourney v7The image generation model that made AI art a creative category. Photorealism and style control best-in-class.

Primary sources

  1. [1] OpenAI: GPT-4V system card2023-09-25

From the Almanac shop

The AI Eras — Pocket Field Guide

Ten eras of AI on a single foldable. The Almanac in your pocket.

$19Coming soon

Back to the timeline