Operations · 11 min
Self-hosting open-weights: when it pays and when it doesn't
Self-hosting Llama or Qwen makes sense at the scale where you stop counting dollars and start counting hours.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Most operators who consider self-hosting underestimate the operational burden and overestimate the savings. Some operators benefit enormously. The trick is knowing which group you are in before you commit.
Volume threshold
Self-hosting beats API pricing roughly when you exceed several hundred million tokens per day at sustained utilisation. Below that, dedicated cloud beats both. Below that, shared API wins.
Operational burden
Inference servers fail. Models drift in production behaviour after vLLM upgrades. Quantisation choices cost capability in non-obvious ways. Plan for an engineer at least part-time on operations.
Privacy gain
The most common honest reason to self-host. Data never leaves the VPC. Regulated industries (healthcare, banking, defence) often have no other option.
Frequently asked
Which open-weights model right now?
Llama 4 if you need scale, Qwen 3 if you need multilingual, DeepSeek R1 if you need reasoning.
Partner offer
Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.
Try Claude →Affiliate link — see disclosure.
From the Almanac shop
The Operator's Compendium
Every agent harness, every routing pattern, every cost trick. 90-page PDF.
$29 — Coming soon