Field identification
Fine-tuning: what's possible, what's worth it
Fine-tuning closed models is now table-stakes. Whether it's worth your money is a different question.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
OpenAI, Google, and Anthropic all offer fine-tuning APIs on production models. The training cost is in the hundreds to thousands of dollars; the served-token cost is 2-3× the base rate. The question is rarely 'can we fine-tune'; it is 'will the lift justify the costs over our deployment lifetime'.
Supervised fine-tuning
The default. Submit a few hundred to a few thousand high-quality example pairs. Iterate. Most production fine-tunes top out at a 5-15% capability lift on the targeted task and pay it back over months of use.
Preference fine-tuning
DPO and related techniques are now exposed in commercial fine-tuning APIs. The data requirements are different — pairs of better/worse responses rather than single targets.
When to fine-tune vs. prompt-engineer
Fine-tune if the task is narrow, the volume is high, the prompt is hitting context limits, or the production cost of large-prompt-engineering exceeds the fine-tune amortised cost. Prompt-engineer otherwise.
Open-weights fine-tuning
Cheaper, faster iteration, full data control. Required if the workload involves data the operator cannot send to a third-party API.
Tells
| Marker | Meaning |
|---|---|
| Fine-tune produces wildly different style from base model on identical prompts | Training data was off-distribution; expect overfitting. |
| Fine-tune outperforms base on training-set tasks but not held-out | Classical overfitting; gather more data or simplify the task. |
Frequently asked
Can I fine-tune Claude?
Yes — Anthropic exposes fine-tuning on Sonnet via partners as of 2026.
Is RLHF available for fine-tuning?
Limited — commercial APIs expose DPO-style preference tuning, not full RLHF pipelines.
From the Almanac shop
Model Tells — Flashcard Deck
Identify any frontier model from a paragraph of output. 60 cards.
$14 — Coming soon