OpenAI · The Tool-Use Inflection
GPT-4o
The omni model. Text, image, audio natively in one system. Speed doubled vs. GPT-4.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
GPT-4o ('o' for omni) was the first OpenAI model to handle text, image, and audio natively in one system without a modality handoff. The voice mode update enabled sub-300ms conversational response, making it feel qualitatively different from prior voice AI. On text benchmarks it matched GPT-4 Turbo at roughly half the latency.
Field signature
Voice mode responds in under 300ms with emotion and tone tracking.
Specifications
| Released | 2024-05-13 |
|---|---|
| Context window | 128,000 tokens |
| Pricing | $2.50 / $10 per million tokens |
| Modalities | text · image · audio |
| License | Commercial API only |
| Era | The Tool-Use Inflection |
Strengths
- Native multimodal
- Speed
- Voice quality
Weaknesses
- Reasoning depth behind o-series
- Context window shorter than Gemini
Authentication markers
The fingerprints by which GPT-4o can be identified from its output alone.
| Tell | Meaning |
|---|---|
| Native voice mode without a TTS post-processing step. | GPT-4o or GPT-4o-mini. |
Notable works
- First sub-300ms conversational voice AI at frontier quality
Market position
$2.50-$10 per million tokens
Partner offer
OpenAI's API surface remains the broadest commercial offering.
Try OpenAI →Affiliate link — see disclosure.
Primary sources
- [1] OpenAI: GPT-4o
From the Almanac shop
The Operator's Compendium
Every agent harness, every routing pattern, every cost trick. 90-page PDF.
$29 — Coming soon