Field identification
Deployment: APIs, dedicated, self-hosted
Where the model runs changes everything about cost, latency, privacy, and operational burden.
By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026
Frontier models run on the provider's infrastructure. Open-weights run wherever you have a GPU. The choice is not technical — it is a contract about privacy, cost, and the operations work an operator is willing to do.
Shared API
The default. Pay per token, no infrastructure. Latency varies by load. Suitable for everything that isn't bound by privacy or cost constraints.
Provisioned throughput
Anthropic and OpenAI both offer reserved capacity at a flat monthly rate. Becomes cost-effective above some volume threshold (typically tens of millions of tokens per day).
Dedicated deployment
AWS Bedrock, Google Vertex, Azure OpenAI: the model is deployed in your cloud account. Data does not leave your VPC. Costs are higher per token but predictable.
Self-hosted open-weights
Llama 4, Qwen 3, DeepSeek. Run on your own GPUs or rent on Together, Fireworks, Groq. Full control, full operational burden. Becomes cost-effective at very high volume or when privacy demands it.
Tells
| Marker | Meaning |
|---|---|
| Latency varies day-to-day on a shared API | Shared-tenant load. Provisioned throughput or dedicated will stabilise it. |
| GPU bill exceeds equivalent API spend | Self-host scale isn't there yet; reconsider. |
Frequently asked
Where does my data go on a shared API?
Generally not retained for training (Anthropic, OpenAI), but read each provider's terms. Most enterprise tiers include explicit no-retention guarantees.
Can I self-host Claude?
No. The weights are not released.
From the Almanac shop
Model Tells — Flashcard Deck
Identify any frontier model from a paragraph of output. 60 cards.
$14 — Coming soon