The Transformer Genesis

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

In June of 2017 a team of Google researchers published 'Attention Is All You Need' at NeurIPS. Within fifteen months, two architectures derived from it (BERT, from the same Google team, and GPT-1, from a small lab in San Francisco) had set new state-of-the-art on most benchmarks the field cared about. The transformer, as the architecture became known, was not the only viable design at the time. It was simply the only one that scaled.

The paper

Vaswani et al., 'Attention Is All You Need', published as arXiv 1706.03762 on the 12th of June 2017. Eight authors, fifteen pages, one diagram. The diagram showed an encoder-decoder stack with multi-head self-attention; the prose argued that recurrence and convolution were unnecessary. The paper has been cited well over a hundred thousand times.

What the paper did not say, but became immediately apparent, was that the architecture was specifically friendly to parallel training on modern GPUs. Recurrent networks had been bottlenecked by their sequential nature. Transformers were not. The scaling that followed was a direct consequence.

BERT

October 2018, also from Google, also derived from the transformer block. BERT was bidirectional, encoder-only, and pre-trained on a masked-language-modelling objective. It set state-of-the-art on eleven benchmarks simultaneously. For a brief moment it looked like encoder-only models would dominate.

GPT-1

June 2018, from OpenAI, decoder-only, generative. Smaller than BERT, less hyped, less benchmarked. The architecture choice that ultimately won.

Signature models of the era

Transformer (the architecture, not a model)
BERT-base / BERT-large
GPT-1

Technical shifts

Self-attention replaces recurrence
Pre-training on raw text becomes the default starting point
Parallel GPU training scales smoothly

Market shifts

Hugging Face begins as a chatbot company; pivots to model hosting
Google Research and OpenAI emerge as the two reference labs

Authentication — is the document from this era?

Tell	Meaning
Architecture references 'multi-head self-attention'	Transformer-derived (i.e., almost everything since 2017).
Pre-training described as 'masked language modelling'	BERT lineage.

Primary sources

[1] Vaswani et al.: Attention Is All You Need — 2017-06-12
[2] Devlin et al.: BERT — 2018-10-11

From the Almanac shop

The AI Eras — Pocket Field Guide

Ten eras of AI on a single foldable. The Almanac in your pocket.

$19 — Coming soon

← Back to the timeline