Deployment: APIs, dedicated, self-hosted

By C.W. Jameson · Published 19 May 2026 · Last reviewed 19 May 2026

Frontier models run on the provider's infrastructure. Open-weights run wherever you have a GPU. The choice is not technical — it is a contract about privacy, cost, and the operations work an operator is willing to do.

Shared API

The default. Pay per token, no infrastructure. Latency varies by load. Suitable for everything that isn't bound by privacy or cost constraints.

Provisioned throughput

Anthropic and OpenAI both offer reserved capacity at a flat monthly rate. Becomes cost-effective above some volume threshold (typically tens of millions of tokens per day).

Dedicated deployment

AWS Bedrock, Google Vertex, Azure OpenAI: the model is deployed in your cloud account. Data does not leave your VPC. Costs are higher per token but predictable.

Self-hosted open-weights

Llama 4, Qwen 3, DeepSeek. Run on your own GPUs or rent on Together, Fireworks, Groq. Full control, full operational burden. Becomes cost-effective at very high volume or when privacy demands it.

Tells

Marker	Meaning
Latency varies day-to-day on a shared API	Shared-tenant load. Provisioned throughput or dedicated will stabilise it.
GPU bill exceeds equivalent API spend	Self-host scale isn't there yet; reconsider.

Frequently asked

Where does my data go on a shared API?

Generally not retained for training (Anthropic, OpenAI), but read each provider's terms. Most enterprise tiers include explicit no-retention guarantees.

Can I self-host Claude?

No. The weights are not released.

From the Almanac shop

Model Tells — Flashcard Deck

Identify any frontier model from a paragraph of output. 60 cards.

$14 — Coming soon

← All identification topics