Era I · 2017–2018
The Transformer Genesis
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
In June of 2017 a team of Google researchers published 'Attention Is All You Need' at NeurIPS. Within fifteen months, two architectures derived from it (BERT, from the same Google team, and GPT-1, from a small lab in San Francisco) had set new state-of-the-art on most benchmarks the field cared about. The transformer, as the architecture became known, was not the only viable design at the time. It was simply the only one that scaled.
The paper
Vaswani et al., 'Attention Is All You Need', published as arXiv 1706.03762 on the 12th of June 2017. Eight authors, fifteen pages, one diagram. The diagram showed an encoder-decoder stack with multi-head self-attention; the prose argued that recurrence and convolution were unnecessary. The paper has been cited well over a hundred thousand times.
What the paper did not say, but became immediately apparent, was that the architecture was specifically friendly to parallel training on modern GPUs. Recurrent networks had been bottlenecked by their sequential nature. Transformers were not. The scaling that followed was a direct consequence.
BERT
October 2018, also from Google, also derived from the transformer block. BERT was bidirectional, encoder-only, and pre-trained on a masked-language-modelling objective. It set state-of-the-art on eleven benchmarks simultaneously. For a brief moment it looked like encoder-only models would dominate.
GPT-1
June 2018, from OpenAI, decoder-only, generative. Smaller than BERT, less hyped, less benchmarked. The architecture choice that ultimately won.
Signature models of the era
- Transformer (the architecture, not a model)
- BERT-base / BERT-large
- GPT-1
Technical shifts
- Self-attention replaces recurrence
- Pre-training on raw text becomes the default starting point
- Parallel GPU training scales smoothly
Market shifts
- Hugging Face begins as a chatbot company; pivots to model hosting
- Google Research and OpenAI emerge as the two reference labs
Authentication — is the document from this era?
| Tell | Meaning |
|---|---|
| Architecture references 'multi-head self-attention' | Transformer-derived (i.e., almost everything since 2017). |
| Pre-training described as 'masked language modelling' | BERT lineage. |
Primary sources
- [1] Vaswani et al.: Attention Is All You Need — 2017-06-12
- [2] Devlin et al.: BERT — 2018-10-11
From the Almanac shop
The AI Eras — Pocket Field Guide
Ten eras of AI on a single foldable. The Almanac in your pocket.
$19 — Coming soon