FinOps for AI
5 sections

Foundational

What Are We Even Paying For?

The Fundamentals of AI Cost Models

90 min5 sections10 quiz questions
Exam topics:AI Cost ModelsToken Billing MechanicsDeployment Infrastructure
NovaSpark
It's Monday morning, day three at NovaSpark. You've barely found the good coffee machine when Priya — VP Engineering, the person who hired you — drops a laptop on your desk and pulls up an AWS Billing console. The number at the top is $847,000. That's not the annual budget. That's last month. "AI spend," she says. "Up 340% from two months ago. Finance is asking questions. The board meeting is Friday." She slides a printed spreadsheet across the desk — four teams, four cost centers, all contributing to a single consolidated bill that looks like it was generated by three different companies using three different currencies. "I need to understand what we're paying for," Priya says. "Not the total. The mechanics. Where does this money actually go?" You look at the spreadsheet. Three lines catch your eye immediately. Line 1: "OpenAI API — gpt-4o — $214,440" Line 2: "AWS Bedrock — Llama 3.1 70B — $89,200" Line 3: "EC2 p4d.24xlarge (4× reserved) — $156,800" Same goal — run AI workloads — three completely different billing structures. This is where you start.

1.1The Three Deployment Models

NovaSpark
Priya walks you through the three teams. Team Alpha — the support chatbot team — is calling OpenAI's API directly. Their code does something like: "Send this customer message to GPT-4o, get back a reply, show it to the customer." They pay per API call. No infrastructure to manage. No GPU to provision. Just a credit card and an API key. Team Beta — the internal search team — is using AWS Bedrock to run Llama 3.1. Same idea as Team Alpha, except the model is open-source and hosted on AWS's infrastructure. They still pay per call, but at a lower per-token rate than OpenAI's GPT-4o. Team Gamma — the ML research team — actually bought GPU capacity. Four EC2 p4d instances, reserved for 12 months. They're running their own fine-tuned model on hardware they control. Fixed monthly cost, regardless of how much (or how little) they use it. Three teams. Three completely different cost structures. Same output: AI responses. The question Priya needs answered first is simple: which structure is right for which workload? Because Team Gamma is paying $156K/month for hardware that runs at 40% utilization. And Team Alpha just had a Tuesday where their chatbot cost 3× a normal day — and nobody knows why. Let's start with the fundamentals.

The Three AI Deployment Models

Every AI workload sits in one of three infrastructure categories. Understanding which category you're looking at determines how you measure it, forecast it, optimize it, and govern it.


Model 1 — Closed-Source API (Third-Party)

You call an API. You pay for what you use. You have no access to the model weights, no control over the infrastructure, and no ability to run it anywhere other than the provider's servers.

Examples: OpenAI (GPT-4o, o1, o3), Anthropic (Claude Sonnet, Haiku, Opus), Google (Gemini 2.0 Flash, Gemini 2.5 Pro), Mistral AI.

How billing works: Per token — you pay separately for input tokens (the text you send) and output tokens (the text you receive back). Price is quoted per million tokens. More on this in Section 1.2.

When this model makes economic sense:

  • Workloads under roughly 1 million tokens per day
  • Teams that need to move fast without managing infrastructure
  • Proof-of-concept and early-stage products
  • When model quality matters more than cost per token

When it doesn't:

  • High-volume, predictable workloads where per-token costs compound quickly
  • When data privacy requirements prohibit sending data to third-party endpoints
  • When you need to fine-tune or modify the model itself

Model 2 — Third-Party Hosted Open Source

An open-source model (one whose weights are publicly available) running on someone else's managed infrastructure. You still call an API. You still pay per token. But the model is open-source, so the per-token rate is typically lower than closed-source alternatives.

Examples: AWS Bedrock (Llama 3, Mistral, Titan), Azure AI Foundry (Llama, Phi), Google Vertex AI (Llama, Gemma), Together.ai, Groq, Replicate.

How billing works: Same token-based pricing as closed-source APIs, but usually 5–20× cheaper per token for comparable model sizes. Some providers also offer PTUs (Provisioned Throughput Units) — reserved capacity with predictable billing.

When this model makes economic sense:

  • Mid-scale workloads: roughly 1 billion to 10 billion tokens per month
  • When you want open-source flexibility without managing GPU infrastructure
  • When you need predictable pricing via provisioned throughput
  • When data residency requirements can be met by choosing specific cloud regions

When it doesn't:

  • When volume is high enough that managing your own GPU fleet becomes cheaper
  • When you need maximum customization (deeper fine-tuning, model modifications)

Model 3 — Self-Hosted / DIY

You own (or rent long-term) the GPU hardware. You run the model yourself. You manage the infrastructure, the scaling, the uptime, and the MLOps pipeline.

Examples: EC2 p4d/p5 or g5 instances (AWS), A100/H100 VMs (Azure), TPU v4/v5 (Google Cloud), on-premises GPU servers.

How billing works: Fixed capacity cost — instance per hour or reserved instance annual commitment — regardless of how many tokens you process. If your utilization is low, your effective per-token cost skyrockets. If your utilization is high, your effective per-token cost drops dramatically.

When this model makes economic sense:

  • Very high volume: above roughly 100 million tokens per day sustained
  • When you need to fine-tune models extensively or modify architecture
  • When GPU utilization can be maintained above 60–70% continuously
  • When data cannot leave your infrastructure under any circumstances

When it doesn't:

  • At low to moderate token volumes — the fixed cost makes it far more expensive than pay-as-you-go APIs
  • When you don't have an MLOps team to manage the infrastructure

The Economic Crossover

Think of it like car ownership. A taxi (closed-source API) is expensive per mile but has zero fixed cost — perfect if you travel occasionally. A rental car (hosted open-source) is cheaper per mile with moderate commitment. Buying a car (self-hosted) is cheapest per mile but only if you drive enough to justify the fixed cost.

NovaSpark's Team Gamma paid $156K last month for GPUs running at 40% utilization. At that utilization rate, they are in the most expensive quadrant possible — high fixed cost, low volume output. Their effective per-token cost is higher than if they'd just used OpenAI's API.

FinOps Foundation Source

Choosing an AI Approach and Infrastructure Strategy, FinOps Foundation Working Group finops.org/wg/choosing-an-ai-approach-and-infrastructure-strategy/

Exam Tip

The FinOps for AI exam tests your ability to identify which deployment model is appropriate for a given scenario — not just name them. Know the economic crossover points: roughly 1M tokens/day (API vs. hosted open-source), and roughly 100M tokens/day (hosted vs. self-hosted). Scenarios often involve a team at the wrong model for their volume.

1.2Tokens: The New Unit of Compute

NovaSpark
"Why did the chatbot cost three times as much on Tuesday?" You pull up Team Alpha's usage logs. Tuesday was a normal traffic day — same number of users, same number of conversations. But the bill was $890. Monday was $310. The difference: on Tuesday, the product team ran a test. They changed the system prompt — the set of instructions sent to the model at the start of every conversation. The new system prompt was 2,400 words long instead of the usual 400 words. And because it went out with every single API call that day, those extra 2,000 words multiplied across 180,000 API calls. 180,000 calls × 2,000 extra words × ~1.33 tokens per word = 479 million extra input tokens. At $5 per million tokens: $2,395 in extra charges. From a text edit. This is what makes AI cost management different from everything you've done before. The unit of cost isn't a server-hour or a request. It's a token. And until you understand exactly what a token is and why it's priced the way it is, the bills will keep surprising you.

Tokens: The New Unit of Compute


What Is a Token?

A token is the basic unit of text that a language model processes. Not a word — a piece of a word, a word, or sometimes multiple short words together.

A rough rule of thumb: 1 token ≈ 0.75 words in English. More precisely:

  • "NovaSpark" → 2 tokens (Nova + Spark)
  • "the" → 1 token
  • "AI" → 1 token
  • "cost" → 1 token
  • "optimization" → 4 tokens (optim + ization + ... varies by model)
  • A typical business email (300 words) ≈ 400 tokens
  • A detailed system prompt (800 words) ≈ 1,066 tokens
  • A full legal contract (10,000 words) ≈ 13,333 tokens

Different models tokenize text slightly differently. OpenAI's models use the tiktoken tokenizer. Anthropic's Claude models use a different tokenizer. The 0.75 words-per-token ratio is a useful approximation, not an exact conversion.


Input Tokens vs. Output Tokens

Every API call has two parts:

Input tokens — everything you send to the model:

  • The system prompt (instructions for how the model should behave)
  • The conversation history (all previous messages in a multi-turn chat)
  • The current user message
  • Any documents or context you've injected (RAG retrieval results, file contents)

Output tokens — everything the model sends back:

  • The model's response

This distinction matters because output tokens cost more than input tokens — typically 3× to 8× more, depending on the model.

Why the premium? Generating each output token requires a full forward pass through the model. Reading input tokens is comparatively cheap (the model processes them in parallel). Writing output tokens is sequential — the model generates one token at a time, each dependent on the previous. That sequential computation is why providers charge a premium.


The Token Cost Formula

Cost = (Input Tokens / 1,000,000 × Input Price per 1M)
     + (Output Tokens / 1,000,000 × Output Price per 1M)

Current benchmark pricing (February 2026):

ModelInput (per 1M tokens)Output (per 1M tokens)Output/Input ratio
GPT-4o$2.50$10.00
Claude 3.5 Sonnet$3.00$15.00
Gemini 2.0 Flash$0.10$0.40
GPT-4o mini$0.15$0.60
Claude 3 Haiku$0.25$1.25
Llama 3.1 70B (Bedrock)$0.72$0.72
Llama 3.1 8B (Groq)$0.05$0.081.6×

Prices change frequently — always verify against provider documentation before forecasting.


A Worked Example

NovaSpark's support chatbot receives a customer message:

  • System prompt: 600 tokens
  • Conversation history (3 prior turns): 800 tokens
  • Current user message: 45 tokens
  • Total input: 1,445 tokens

The model responds:

  • Response: 220 tokens
  • Total output: 220 tokens

At GPT-4o pricing ($2.50 input / $10.00 output per 1M tokens):

Input cost:  1,445 / 1,000,000 × $2.50 = $0.0036
Output cost:   220 / 1,000,000 × $10.00 = $0.0022
Total per call: $0.0058

That feels tiny. But NovaSpark's chatbot handles 180,000 conversations per month:

$0.0058 × 180,000 = $1,044/month

Now the product team adds a richer, more detailed system prompt — 2,400 tokens instead of 600:

New input: 2,400 + 800 + 45 = 3,245 tokens
Input cost: 3,245 / 1,000,000 × $2.50 = $0.0081
Output cost unchanged: $0.0022
New total per call: $0.0103

$0.0103 × 180,000 = $1,854/month

An 800-token system prompt change → $810/month in extra costs. Multiply that across a larger chatbot or a higher-traffic product, and a single engineering commit can add tens of thousands to the monthly bill.


Why This Changes How You Think About Costs

In traditional cloud FinOps, cost scales with infrastructure decisions — instance sizes, storage tiers, network throughput. These are relatively stable and predictable.

In AI FinOps, cost scales with content decisions — what you put in your prompts, how long your conversations are, how verbose your model's responses are. A developer making what looks like a UX change (longer, more helpful responses) is simultaneously making a cost decision. Most developers don't know this yet. Your job is to help bridge that gap.

FinOps Foundation Source

GenAI FinOps: How Token Pricing Really Works, FinOps Foundation Working Group finops.org/wg/genai-finops-how-token-pricing-really-works/

Exam Tip

The token cost formula is almost certainly on the exam — both as a direct calculation question and embedded in scenario questions. Memorize: Cost = (Input / 1M × Input price) + (Output / 1M × Output price). Also know why output costs more than input (sequential generation vs. parallel processing).

1.3The Context Window Tax

NovaSpark
Two days in, you've identified the Tuesday problem. But there's a second anomaly in Team Alpha's data that's harder to explain. The chatbot's cost per conversation isn't flat. It starts low — $0.004 for the first message — and climbs with every turn. By turn 10, a single conversation is costing $0.031. By turn 20, it's $0.089. You call Marcus, the engineer who built the chatbot. He explains how it works. "Every time a user sends a message," he says, "we send the entire conversation history to the API. So the model has context — it remembers what was said before." "Every time?" "Every time. That's how it works. The API doesn't remember anything. It's stateless. So if you want it to feel like a continuous conversation, you include all the previous messages in every new call." You do the math. Turn 1: one user message. Turn 2: two messages + one response. Turn 10: ten user messages + nine model responses. Turn 20: twenty user messages + nineteen model responses. Each new turn, you're resending everything that came before. The token count — and the cost — grows with every exchange. Not linearly. The conversation itself keeps getting longer, so each new turn is more expensive than the last. This is the Context Window Tax.

The Context Window Tax


Why APIs Are Stateless

Language models don't have persistent memory between API calls. Each call is completely independent. The model doesn't "remember" that it talked to this user five minutes ago. To create the experience of a continuous conversation, your application code must resend the entire conversation history with every new message.

This is a fundamental architectural reality of how current LLM APIs work. It is not a bug or a limitation that will be patched — it is the design.


The Cost Growth Pattern

Consider a 10-turn customer support conversation. Each turn, the token count grows:

TurnNew tokens addedCumulative tokens sentCost at GPT-4o
1100 (user msg)700 (system + msg)$0.0035
2250 (user + response)1,050$0.0053
32501,400$0.0070
52502,100$0.0105
102503,350$0.0168
202505,850$0.0293

Turn 20 costs 8× more than Turn 1 — not because the user's message is longer, but because the history is. A customer who has a long back-and-forth with your chatbot costs significantly more to serve than one who resolves their issue in two messages.


What the Context Window Tax Means in Practice

For NovaSpark's chatbot — 180,000 conversations per month, average 8 turns:

  • Without context management: average cost ~$0.012/conversation = $2,160/month
  • With context trimming (keep last 4 turns): average cost ~$0.007/conversation = $1,260/month
  • Savings from one architectural change: $900/month, $10,800/year

For a higher-volume product — 2 million conversations/month:

  • Same optimization: $120,000/year in savings

Three Mitigation Approaches

1. Context windowing — Keep only the last N turns of conversation history. Discard older turns. Simple to implement, slight UX risk if conversations reference early context.

2. Summarization compression — Periodically summarize earlier turns into a compact summary, replacing the full transcript. Higher quality retention, moderate implementation complexity.

3. RAG-based memory — Store conversation history externally, retrieve only the semantically relevant parts for each new message. Most sophisticated, best UX, highest implementation cost.


The FinOps Angle

The Context Window Tax is a cost pattern that comes from a product decision (stateful conversation UX), not from infrastructure choices. Engineers building chatbots are making cost decisions every time they choose how much history to include. FinOps practitioners need to work with engineers to surface these cost patterns — not as critique, but as shared visibility. Most engineers building chatbots have never run the compounding math on conversation length.

FinOps Foundation Source

GenAI FinOps vs. Cloud FinOps, FinOps Foundation Working Group finops.org/wg/genai-finops-vs-cloud-finops/

Exam Tip

The Context Window Tax is tested as a "hidden cost" question and as an optimization scenario. Key facts: APIs are stateless by design; conversation history is resent with every call; cost grows with conversation length, not just volume.

1.4The Full Bill of Materials

NovaSpark
You've accounted for Team Alpha's API charges. But when you add up all three teams' API and compute costs, you get $460,000. The actual bill Priya showed you was $847,000. There's $387,000 you can't explain yet. You dig deeper. Line by line, you find five categories that weren't in your initial mental model of "AI costs." They were right there in the invoice — you just didn't know what you were looking at.

The Full Bill of Materials — What's Actually on Your AI Invoice

The API cost — the token charges you calculated in the previous section — is often less than half of the total cost of running an AI workload. Here are the five categories that make up the rest.

Exam Tip

TCO questions appear in two forms: "What are the hidden cost components beyond API charges?" (knowledge) and "Why is this company's bill higher than expected?" (scenario). Know all five categories. Data egress is the most commonly missed and often the largest surprise.

1.5AI vs. Traditional Cloud: What's Different

NovaSpark
On Friday morning, you walk into the all-hands meeting with a two-page summary. You've explained the $847,000. You've found the system prompt issue. You've identified the data egress problem. Priya's final question before the meeting: "Can we just apply our standard cloud cost governance to this?" The honest answer is: partly. Some tools transfer. But enough is different that applying cloud FinOps patterns directly will leave blind spots. Here's what changes — and what doesn't.

What's Different About AI Cost Governance

Traditional cloud FinOps is built around infrastructure — virtual machines, storage, databases, network. AI FinOps introduces a new layer: the cost of computation encoded in content.


DimensionTraditional CloudAI Workloads
Primary cost unitCPU-hours, GB-hours, requestsTokens (input + output)
What drives costInfrastructure decisions (instance size, storage class)Content decisions (prompt length, response verbosity, conversation depth)
Who controls costsInfrastructure and platform teamsEngineers, product managers, prompt engineers — anyone who touches prompts
Pricing modelRelatively stable, predictable tiersVolatile: prices dropping ~10× per year (LLMflation); new model SKUs constantly
Idle costSignificant (running but unused instances)Minimal for API model; high for self-hosted (same as cloud)
Tagging and attributionMature tooling (AWS Cost Explorer, native tags)Immature — shared API keys, non-standard units, limited vendor tooling
ForecastingTrend analysis works wellUnreliable without understanding usage patterns AND price trajectory
Optimization leversRight-sizing, Reserved Instances, Savings PlansPrompt compression, model selection, caching, context windowing, quantization
Anomaly profileGradual drift, infrastructure scaling eventsSharp spikes from runaway loops, prompt changes, traffic events
Governance maturityWell-established (FOCUS spec, native dashboards)Emerging (FOCUS 1.2–1.3 adding AI support, tooling fragmented)

What Transfers from Cloud FinOps

  • Unit economics thinking (cost per unit of value delivered)
  • Tagging and attribution discipline
  • The Crawl-Walk-Run maturity model
  • Showback and chargeback governance
  • Budget alerts and anomaly detection concepts
  • Cross-functional collaboration model (FinOps practitioner as bridge)

What Doesn't Transfer Directly

  • Right-sizing has no equivalent — you don't pick an "instance size" for API calls; you pick a model and prompt strategy
  • Reserved Instance savings logic doesn't apply to per-token billing (though Provisioned Throughput Units serve a similar role)
  • Standard cost per request metrics ignore token volume, making comparisons misleading
  • Tagging infrastructure at the API key level doesn't give you per-team or per-feature attribution without additional proxy or gateway tooling

The Practitioner's Mental Model Shift

In cloud FinOps, you ask: "What infrastructure are we running, and is it the right size?"

In AI FinOps, you ask: "What content are we processing, at what volume, with what model, through what architecture — and is every component justified by the value it delivers?"

FinOps Foundation Source

GenAI FinOps vs. Cloud FinOps, FinOps Foundation Working Group finops.org/wg/genai-finops-vs-cloud-finops/

Exam Tip

The FinOps for AI exam tests this comparison directly. Know: (1) token vs. CPU-hour as cost units, (2) content decisions vs. infrastructure decisions as cost drivers, (3) why traditional right-sizing doesn't map to AI APIs, (4) what Provisioned Throughput Units replace in the AI context.

🛡️

Complete the Knowledge Check to earn

Token Tracker

150 pts