Three diverging AI deployment pathways — API, hosted, and self-hosted

1.1 — The Three Deployment Models

4 min read

NovaSpark

Priya walks you through the three teams. Team Alpha — the support chatbot team — is calling OpenAI's API directly. Their code does something like: "Send this customer message to GPT-4o, get back a reply, show it to the customer." They pay per API call. No infrastructure to manage. No GPU to provision. Just a credit card and an API key. Team Beta — the internal search team — is using AWS Bedrock to run Llama 3.1. Same idea as Team Alpha, except the model is open-source and hosted on AWS's infrastructure. They still pay per call, but at a lower per-token rate than OpenAI's GPT-4o. Team Gamma — the ML research team — actually bought GPU capacity. Four EC2 p4d instances, reserved for 12 months. They're running their own fine-tuned model on hardware they control. Fixed monthly cost, regardless of how much (or how little) they use it. Three teams. Three completely different cost structures. Same output: AI responses. The question Priya needs answered first is simple: which structure is right for which workload? Because Team Gamma is paying $156K/month for hardware that runs at 40% utilization. And Team Alpha just had a Tuesday where their chatbot cost 3× a normal day — and nobody knows why. Let's start with the fundamentals.

The Three AI Deployment Models

Every AI workload sits in one of three infrastructure categories. Understanding which category you're looking at determines how you measure it, forecast it, optimize it, and govern it.

Model 1 — Closed-Source API (Third-Party)

You call an API. You pay for what you use. You have no access to the model weights, no control over the infrastructure, and no ability to run it anywhere other than the provider's servers.

Examples: OpenAI (GPT-4o, o1, o3), Anthropic (Claude Sonnet, Haiku, Opus), Google (Gemini 2.0 Flash, Gemini 2.5 Pro), Mistral AI.

How billing works: Per token — you pay separately for input tokens (the text you send) and output tokens (the text you receive back). Price is quoted per million tokens. More on this in Section 1.2.

When this model makes economic sense:

Workloads under roughly 1 million tokens per day
Teams that need to move fast without managing infrastructure
Proof-of-concept and early-stage products
When model quality matters more than cost per token

When it doesn't:

High-volume, predictable workloads where per-token costs compound quickly
When data privacy requirements prohibit sending data to third-party endpoints
When you need to fine-tune or modify the model itself

Model 2 — Third-Party Hosted Open Source

An open-source model (one whose weights are publicly available) running on someone else's managed infrastructure. You still call an API. You still pay per token. But the model is open-source, so the per-token rate is typically lower than closed-source alternatives.

Examples: AWS Bedrock (Llama 3, Mistral, Titan), Azure AI Foundry (Llama, Phi), Google Vertex AI (Llama, Gemma), Together.ai, Groq, Replicate.

How billing works: Same token-based pricing as closed-source APIs, but usually 5–20× cheaper per token for comparable model sizes. Some providers also offer PTUs (Provisioned Throughput Units) — reserved capacity with predictable billing.

When this model makes economic sense:

Mid-scale workloads: roughly 1 billion to 10 billion tokens per month
When you want open-source flexibility without managing GPU infrastructure
When you need predictable pricing via provisioned throughput
When data residency requirements can be met by choosing specific cloud regions

When it doesn't:

When volume is high enough that managing your own GPU fleet becomes cheaper
When you need maximum customization (deeper fine-tuning, model modifications)

Model 3 — Self-Hosted / DIY

You own (or rent long-term) the GPU hardware. You run the model yourself. You manage the infrastructure, the scaling, the uptime, and the MLOps pipeline.

Examples: EC2 p4d/p5 or g5 instances (AWS), A100/H100 VMs (Azure), TPU v4/v5 (Google Cloud), on-premises GPU servers.

How billing works: Fixed capacity cost — instance per hour or reserved instance annual commitment — regardless of how many tokens you process. If your utilization is low, your effective per-token cost skyrockets. If your utilization is high, your effective per-token cost drops dramatically.

When this model makes economic sense:

Very high volume: above roughly 100 million tokens per day sustained
When you need to fine-tune models extensively or modify architecture
When GPU utilization can be maintained above 60–70% continuously
When data cannot leave your infrastructure under any circumstances

When it doesn't:

At low to moderate token volumes — the fixed cost makes it far more expensive than pay-as-you-go APIs
When you don't have an MLOps team to manage the infrastructure

The Economic Crossover

Think of it like car ownership. A taxi (closed-source API) is expensive per mile but has zero fixed cost — perfect if you travel occasionally. A rental car (hosted open-source) is cheaper per mile with moderate commitment. Buying a car (self-hosted) is cheapest per mile but only if you drive enough to justify the fixed cost.

NovaSpark's Team Gamma paid $156K last month for GPUs running at 40% utilization. At that utilization rate, they are in the most expensive quadrant possible — high fixed cost, low volume output. Their effective per-token cost is higher than if they'd just used OpenAI's API.

Key Concepts

Closed-Source API

A third-party hosted model you access via API, paying per token with no access to model weights or infrastructure control.

Third-Party Hosted Open Source

An open-source model running on a cloud provider's managed infrastructure, billed per token at rates typically 5-20x cheaper than closed-source.

Self-Hosted / DIY

You rent or own GPU hardware and run the model yourself, paying a fixed capacity cost regardless of token volume processed.

Economic Crossover

The volume threshold where one deployment model becomes cheaper than another — roughly 1M tokens/day for API vs. hosted, 100M tokens/day for hosted vs. self-hosted.

FinOps Foundation Source

Choosing an AI Approach and Infrastructure Strategy, FinOps Foundation Working Group — finops.org/wg/choosing-an-ai-approach-and-infrastructure-strategy/

Exam Tip

The FinOps for AI exam tests your ability to identify which deployment model is appropriate for a given scenario — not just name them. Know the economic crossover points: roughly 1M tokens/day (API vs. hosted open-source), and roughly 100M tokens/day (hosted vs. self-hosted). Scenarios often involve a team at the wrong model for their volume.

Match the Invoice

NovaSpark received three invoices this month. Each comes from a different deployment model. Drag each invoice to the correct category.

Invoices

Invoice A — OpenAI

Line items: gpt-4o input tokens: 4,200,000 @ $5.00/1M = $21.00 | gpt-4o output tokens: 890,000 @ $15.00/1M = $13.35 | Total: $34.35

Invoice B — AWS

Line items: Amazon Bedrock — meta.llama3-1-70b-instruct — 8,400,000 input tokens @ $0.72/1M = $6.05 | 1,780,000 output tokens @ $0.72/1M = $1.28 | Total: $7.33

Invoice C — AWS

Line items: EC2 p4d.24xlarge × 4 instances, 720 hours @ $32.77/hr = $94,454 | EBS gp3 storage 4TB = $320 | Data transfer out 12TB = $1,080 | Total: $95,854