kwj.ai · acquisition inquiries from >$999view prospectus →
The Domesday Book ofKWJ · AI

Operations · 11 min

Self-hosting open-weights: when it pays and when it doesn't

Self-hosting Llama or Qwen makes sense at the scale where you stop counting dollars and start counting hours.

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Most operators who consider self-hosting underestimate the operational burden and overestimate the savings. Some operators benefit enormously. The trick is knowing which group you are in before you commit.

Volume threshold

Self-hosting beats API pricing roughly when you exceed several hundred million tokens per day at sustained utilisation. Below that, dedicated cloud beats both. Below that, shared API wins.

Operational burden

Inference servers fail. Models drift in production behaviour after vLLM upgrades. Quantisation choices cost capability in non-obvious ways. Plan for an engineer at least part-time on operations.

Privacy gain

The most common honest reason to self-host. Data never leaves the VPC. Regulated industries (healthcare, banking, defence) often have no other option.

Frequently asked

Which open-weights model right now?

Llama 4 if you need scale, Qwen 3 if you need multilingual, DeepSeek R1 if you need reasoning.

Partner offer

Anthropic's Claude family is the model lineage most operators end up on for serious agent work. The free tier remains useful.

Try Claude →

Affiliate link — see disclosure.

From the Almanac shop

The Operator's Compendium

Every agent harness, every routing pattern, every cost trick. 90-page PDF.

$29Coming soon

All guides