kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI

Era I · 2017–2018

The Transformer Genesis

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

In June of 2017 a team of Google researchers published 'Attention Is All You Need' at NeurIPS. Within fifteen months, two architectures derived from it (BERT, from the same Google team, and GPT-1, from a small lab in San Francisco) had set new state-of-the-art on most benchmarks the field cared about. The transformer, as the architecture became known, was not the only viable design at the time. It was simply the only one that scaled.

The paper

Vaswani et al., 'Attention Is All You Need', published as arXiv 1706.03762 on the 12th of June 2017. Eight authors, fifteen pages, one diagram. The diagram showed an encoder-decoder stack with multi-head self-attention; the prose argued that recurrence and convolution were unnecessary. The paper has been cited well over a hundred thousand times.

What the paper did not say, but became immediately apparent, was that the architecture was specifically friendly to parallel training on modern GPUs. Recurrent networks had been bottlenecked by their sequential nature. Transformers were not. The scaling that followed was a direct consequence.

BERT

October 2018, also from Google, also derived from the transformer block. BERT was bidirectional, encoder-only, and pre-trained on a masked-language-modelling objective. It set state-of-the-art on eleven benchmarks simultaneously. For a brief moment it looked like encoder-only models would dominate.

GPT-1

June 2018, from OpenAI, decoder-only, generative. Smaller than BERT, less hyped, less benchmarked. The architecture choice that ultimately won.

Signature models of the era

  • Transformer (the architecture, not a model)
  • BERT-base / BERT-large
  • GPT-1

Technical shifts

  • Self-attention replaces recurrence
  • Pre-training on raw text becomes the default starting point
  • Parallel GPU training scales smoothly

Market shifts

  • Hugging Face begins as a chatbot company; pivots to model hosting
  • Google Research and OpenAI emerge as the two reference labs

Authentication — is the document from this era?

TellMeaning
Architecture references 'multi-head self-attention'Transformer-derived (i.e., almost everything since 2017).
Pre-training described as 'masked language modelling'BERT lineage.

Primary sources

  1. [1] Vaswani et al.: Attention Is All You Need2017-06-12
  2. [2] Devlin et al.: BERT2018-10-11

From the Almanac shop

The AI Eras — Pocket Field Guide

Ten eras of AI on a single foldable. The Almanac in your pocket.

$19Coming soon

Back to the timeline