Foundational
What Are We Even Paying For?
The Fundamentals of AI Cost Models
1.1 — The Three Deployment Models
The Three AI Deployment Models
Every AI workload sits in one of three infrastructure categories. Understanding which category you're looking at determines how you measure it, forecast it, optimize it, and govern it.
Model 1 — Closed-Source API (Third-Party)
You call an API. You pay for what you use. You have no access to the model weights, no control over the infrastructure, and no ability to run it anywhere other than the provider's servers.
Examples: OpenAI (GPT-4o, o1, o3), Anthropic (Claude Sonnet, Haiku, Opus), Google (Gemini 2.0 Flash, Gemini 2.5 Pro), Mistral AI.
How billing works: Per token — you pay separately for input tokens (the text you send) and output tokens (the text you receive back). Price is quoted per million tokens. More on this in Section 1.2.
When this model makes economic sense:
- Workloads under roughly 1 million tokens per day
- Teams that need to move fast without managing infrastructure
- Proof-of-concept and early-stage products
- When model quality matters more than cost per token
When it doesn't:
- High-volume, predictable workloads where per-token costs compound quickly
- When data privacy requirements prohibit sending data to third-party endpoints
- When you need to fine-tune or modify the model itself
Model 2 — Third-Party Hosted Open Source
An open-source model (one whose weights are publicly available) running on someone else's managed infrastructure. You still call an API. You still pay per token. But the model is open-source, so the per-token rate is typically lower than closed-source alternatives.
Examples: AWS Bedrock (Llama 3, Mistral, Titan), Azure AI Foundry (Llama, Phi), Google Vertex AI (Llama, Gemma), Together.ai, Groq, Replicate.
How billing works: Same token-based pricing as closed-source APIs, but usually 5–20× cheaper per token for comparable model sizes. Some providers also offer PTUs (Provisioned Throughput Units) — reserved capacity with predictable billing.
When this model makes economic sense:
- Mid-scale workloads: roughly 1 billion to 10 billion tokens per month
- When you want open-source flexibility without managing GPU infrastructure
- When you need predictable pricing via provisioned throughput
- When data residency requirements can be met by choosing specific cloud regions
When it doesn't:
- When volume is high enough that managing your own GPU fleet becomes cheaper
- When you need maximum customization (deeper fine-tuning, model modifications)
Model 3 — Self-Hosted / DIY
You own (or rent long-term) the GPU hardware. You run the model yourself. You manage the infrastructure, the scaling, the uptime, and the MLOps pipeline.
Examples: EC2 p4d/p5 or g5 instances (AWS), A100/H100 VMs (Azure), TPU v4/v5 (Google Cloud), on-premises GPU servers.
How billing works: Fixed capacity cost — instance per hour or reserved instance annual commitment — regardless of how many tokens you process. If your utilization is low, your effective per-token cost skyrockets. If your utilization is high, your effective per-token cost drops dramatically.
When this model makes economic sense:
- Very high volume: above roughly 100 million tokens per day sustained
- When you need to fine-tune models extensively or modify architecture
- When GPU utilization can be maintained above 60–70% continuously
- When data cannot leave your infrastructure under any circumstances
When it doesn't:
- At low to moderate token volumes — the fixed cost makes it far more expensive than pay-as-you-go APIs
- When you don't have an MLOps team to manage the infrastructure
The Economic Crossover
Think of it like car ownership. A taxi (closed-source API) is expensive per mile but has zero fixed cost — perfect if you travel occasionally. A rental car (hosted open-source) is cheaper per mile with moderate commitment. Buying a car (self-hosted) is cheapest per mile but only if you drive enough to justify the fixed cost.
NovaSpark's Team Gamma paid $156K last month for GPUs running at 40% utilization. At that utilization rate, they are in the most expensive quadrant possible — high fixed cost, low volume output. Their effective per-token cost is higher than if they'd just used OpenAI's API.
Choosing an AI Approach and Infrastructure Strategy, FinOps Foundation Working Group — finops.org/wg/choosing-an-ai-approach-and-infrastructure-strategy/
The FinOps for AI exam tests your ability to identify which deployment model is appropriate for a given scenario — not just name them. Know the economic crossover points: roughly 1M tokens/day (API vs. hosted open-source), and roughly 100M tokens/day (hosted vs. self-hosted). Scenarios often involve a team at the wrong model for their volume.
1.2 — Tokens: The New Unit of Compute
Tokens: The New Unit of Compute
What Is a Token?
A token is the basic unit of text that a language model processes. Not a word — a piece of a word, a word, or sometimes multiple short words together.
A rough rule of thumb: 1 token ≈ 0.75 words in English. More precisely:
- "NovaSpark" → 2 tokens (Nova + Spark)
- "the" → 1 token
- "AI" → 1 token
- "cost" → 1 token
- "optimization" → 4 tokens (optim + ization + ... varies by model)
- A typical business email (300 words) ≈ 400 tokens
- A detailed system prompt (800 words) ≈ 1,066 tokens
- A full legal contract (10,000 words) ≈ 13,333 tokens
Different models tokenize text slightly differently. OpenAI's models use the tiktoken tokenizer. Anthropic's Claude models use a different tokenizer. The 0.75 words-per-token ratio is a useful approximation, not an exact conversion.
Input Tokens vs. Output Tokens
Every API call has two parts:
Input tokens — everything you send to the model:
- The system prompt (instructions for how the model should behave)
- The conversation history (all previous messages in a multi-turn chat)
- The current user message
- Any documents or context you've injected (RAG retrieval results, file contents)
Output tokens — everything the model sends back:
- The model's response
This distinction matters because output tokens cost more than input tokens — typically 3× to 8× more, depending on the model.
Why the premium? Generating each output token requires a full forward pass through the model. Reading input tokens is comparatively cheap (the model processes them in parallel). Writing output tokens is sequential — the model generates one token at a time, each dependent on the previous. That sequential computation is why providers charge a premium.
The Token Cost Formula
Cost = (Input Tokens / 1,000,000 × Input Price per 1M)
+ (Output Tokens / 1,000,000 × Output Price per 1M)Current benchmark pricing (February 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Output/Input ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4× |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 5× |
| Gemini 2.0 Flash | $0.10 | $0.40 | 4× |
| GPT-4o mini | $0.15 | $0.60 | 4× |
| Claude 3 Haiku | $0.25 | $1.25 | 5× |
| Llama 3.1 70B (Bedrock) | $0.72 | $0.72 | 1× |
| Llama 3.1 8B (Groq) | $0.05 | $0.08 | 1.6× |
Prices change frequently — always verify against provider documentation before forecasting.
A Worked Example
NovaSpark's support chatbot receives a customer message:
- System prompt: 600 tokens
- Conversation history (3 prior turns): 800 tokens
- Current user message: 45 tokens
- Total input: 1,445 tokens
The model responds:
- Response: 220 tokens
- Total output: 220 tokens
At GPT-4o pricing ($2.50 input / $10.00 output per 1M tokens):
Input cost: 1,445 / 1,000,000 × $2.50 = $0.0036
Output cost: 220 / 1,000,000 × $10.00 = $0.0022
Total per call: $0.0058That feels tiny. But NovaSpark's chatbot handles 180,000 conversations per month:
$0.0058 × 180,000 = $1,044/monthNow the product team adds a richer, more detailed system prompt — 2,400 tokens instead of 600:
New input: 2,400 + 800 + 45 = 3,245 tokens
Input cost: 3,245 / 1,000,000 × $2.50 = $0.0081
Output cost unchanged: $0.0022
New total per call: $0.0103
$0.0103 × 180,000 = $1,854/monthAn 800-token system prompt change → $810/month in extra costs. Multiply that across a larger chatbot or a higher-traffic product, and a single engineering commit can add tens of thousands to the monthly bill.
Why This Changes How You Think About Costs
In traditional cloud FinOps, cost scales with infrastructure decisions — instance sizes, storage tiers, network throughput. These are relatively stable and predictable.
In AI FinOps, cost scales with content decisions — what you put in your prompts, how long your conversations are, how verbose your model's responses are. A developer making what looks like a UX change (longer, more helpful responses) is simultaneously making a cost decision. Most developers don't know this yet. Your job is to help bridge that gap.
GenAI FinOps: How Token Pricing Really Works, FinOps Foundation Working Group — finops.org/wg/genai-finops-how-token-pricing-really-works/
The token cost formula is almost certainly on the exam — both as a direct calculation question and embedded in scenario questions. Memorize: Cost = (Input / 1M × Input price) + (Output / 1M × Output price). Also know why output costs more than input (sequential generation vs. parallel processing).
1.3 — The Context Window Tax
The Context Window Tax
Why APIs Are Stateless
Language models don't have persistent memory between API calls. Each call is completely independent. The model doesn't "remember" that it talked to this user five minutes ago. To create the experience of a continuous conversation, your application code must resend the entire conversation history with every new message.
This is a fundamental architectural reality of how current LLM APIs work. It is not a bug or a limitation that will be patched — it is the design.
The Cost Growth Pattern
Consider a 10-turn customer support conversation. Each turn, the token count grows:
| Turn | New tokens added | Cumulative tokens sent | Cost at GPT-4o |
|---|---|---|---|
| 1 | 100 (user msg) | 700 (system + msg) | $0.0035 |
| 2 | 250 (user + response) | 1,050 | $0.0053 |
| 3 | 250 | 1,400 | $0.0070 |
| 5 | 250 | 2,100 | $0.0105 |
| 10 | 250 | 3,350 | $0.0168 |
| 20 | 250 | 5,850 | $0.0293 |
Turn 20 costs 8× more than Turn 1 — not because the user's message is longer, but because the history is. A customer who has a long back-and-forth with your chatbot costs significantly more to serve than one who resolves their issue in two messages.
What the Context Window Tax Means in Practice
For NovaSpark's chatbot — 180,000 conversations per month, average 8 turns:
- Without context management: average cost ~$0.012/conversation = $2,160/month
- With context trimming (keep last 4 turns): average cost ~$0.007/conversation = $1,260/month
- Savings from one architectural change: $900/month, $10,800/year
For a higher-volume product — 2 million conversations/month:
- Same optimization: $120,000/year in savings
Three Mitigation Approaches
1. Context windowing — Keep only the last N turns of conversation history. Discard older turns. Simple to implement, slight UX risk if conversations reference early context.
2. Summarization compression — Periodically summarize earlier turns into a compact summary, replacing the full transcript. Higher quality retention, moderate implementation complexity.
3. RAG-based memory — Store conversation history externally, retrieve only the semantically relevant parts for each new message. Most sophisticated, best UX, highest implementation cost.
The FinOps Angle
The Context Window Tax is a cost pattern that comes from a product decision (stateful conversation UX), not from infrastructure choices. Engineers building chatbots are making cost decisions every time they choose how much history to include. FinOps practitioners need to work with engineers to surface these cost patterns — not as critique, but as shared visibility. Most engineers building chatbots have never run the compounding math on conversation length.
GenAI FinOps vs. Cloud FinOps, FinOps Foundation Working Group — finops.org/wg/genai-finops-vs-cloud-finops/
The Context Window Tax is tested as a "hidden cost" question and as an optimization scenario. Key facts: APIs are stateless by design; conversation history is resent with every call; cost grows with conversation length, not just volume.
1.4 — The Full Bill of Materials
The Full Bill of Materials — What's Actually on Your AI Invoice
The API cost — the token charges you calculated in the previous section — is often less than half of the total cost of running an AI workload. Here are the five categories that make up the rest.
TCO questions appear in two forms: "What are the hidden cost components beyond API charges?" (knowledge) and "Why is this company's bill higher than expected?" (scenario). Know all five categories. Data egress is the most commonly missed and often the largest surprise.
1.5 — AI vs. Traditional Cloud: What's Different
What's Different About AI Cost Governance
Traditional cloud FinOps is built around infrastructure — virtual machines, storage, databases, network. AI FinOps introduces a new layer: the cost of computation encoded in content.
| Dimension | Traditional Cloud | AI Workloads |
|---|---|---|
| Primary cost unit | CPU-hours, GB-hours, requests | Tokens (input + output) |
| What drives cost | Infrastructure decisions (instance size, storage class) | Content decisions (prompt length, response verbosity, conversation depth) |
| Who controls costs | Infrastructure and platform teams | Engineers, product managers, prompt engineers — anyone who touches prompts |
| Pricing model | Relatively stable, predictable tiers | Volatile: prices dropping ~10× per year (LLMflation); new model SKUs constantly |
| Idle cost | Significant (running but unused instances) | Minimal for API model; high for self-hosted (same as cloud) |
| Tagging and attribution | Mature tooling (AWS Cost Explorer, native tags) | Immature — shared API keys, non-standard units, limited vendor tooling |
| Forecasting | Trend analysis works well | Unreliable without understanding usage patterns AND price trajectory |
| Optimization levers | Right-sizing, Reserved Instances, Savings Plans | Prompt compression, model selection, caching, context windowing, quantization |
| Anomaly profile | Gradual drift, infrastructure scaling events | Sharp spikes from runaway loops, prompt changes, traffic events |
| Governance maturity | Well-established (FOCUS spec, native dashboards) | Emerging (FOCUS 1.2–1.3 adding AI support, tooling fragmented) |
What Transfers from Cloud FinOps
- Unit economics thinking (cost per unit of value delivered)
- Tagging and attribution discipline
- The Crawl-Walk-Run maturity model
- Showback and chargeback governance
- Budget alerts and anomaly detection concepts
- Cross-functional collaboration model (FinOps practitioner as bridge)
What Doesn't Transfer Directly
- Right-sizing has no equivalent — you don't pick an "instance size" for API calls; you pick a model and prompt strategy
- Reserved Instance savings logic doesn't apply to per-token billing (though Provisioned Throughput Units serve a similar role)
- Standard cost per request metrics ignore token volume, making comparisons misleading
- Tagging infrastructure at the API key level doesn't give you per-team or per-feature attribution without additional proxy or gateway tooling
The Practitioner's Mental Model Shift
In cloud FinOps, you ask: "What infrastructure are we running, and is it the right size?"
In AI FinOps, you ask: "What content are we processing, at what volume, with what model, through what architecture — and is every component justified by the value it delivers?"
GenAI FinOps vs. Cloud FinOps, FinOps Foundation Working Group — finops.org/wg/genai-finops-vs-cloud-finops/
The FinOps for AI exam tests this comparison directly. Know: (1) token vs. CPU-hour as cost units, (2) content decisions vs. infrastructure decisions as cost drivers, (3) why traditional right-sizing doesn't map to AI APIs, (4) what Provisioned Throughput Units replace in the AI context.
Complete the Knowledge Check to earn
Token Tracker
150 pts