Fine-tuning: what's possible, what's worth it

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

OpenAI, Google, and Anthropic all offer fine-tuning APIs on production models. The training cost is in the hundreds to thousands of dollars; the served-token cost is 2-3× the base rate. The question is rarely 'can we fine-tune'; it is 'will the lift justify the costs over our deployment lifetime'.

Supervised fine-tuning

The default. Submit a few hundred to a few thousand high-quality example pairs. Iterate. Most production fine-tunes top out at a 5-15% capability lift on the targeted task and pay it back over months of use.

Preference fine-tuning

DPO and related techniques are now exposed in commercial fine-tuning APIs. The data requirements are different — pairs of better/worse responses rather than single targets.

When to fine-tune vs. prompt-engineer

Fine-tune if the task is narrow, the volume is high, the prompt is hitting context limits, or the production cost of large-prompt-engineering exceeds the fine-tune amortised cost. Prompt-engineer otherwise.

Open-weights fine-tuning

Cheaper, faster iteration, full data control. Required if the workload involves data the operator cannot send to a third-party API.

Tells

Marker	Meaning
Fine-tune produces wildly different style from base model on identical prompts	Training data was off-distribution; expect overfitting.
Fine-tune outperforms base on training-set tasks but not held-out	Classical overfitting; gather more data or simplify the task.

Frequently asked

Can I fine-tune Claude?

Yes — Anthropic exposes fine-tuning on Sonnet via partners as of 2026.

Is RLHF available for fine-tuning?

Limited — commercial APIs expose DPO-style preference tuning, not full RLHF pipelines.

From the Almanac shop

Model Tells — Flashcard Deck

Identify any frontier model from a paragraph of output. 60 cards.

$14 — Coming soon

← All identification topics