Calculating AI Inference Unit Economics for Machine Learning Pricing Models
This skill teaches you how to measure and model the real per-request cost of AI inference—including token consumption, GPU compute, API call expenses, and infrastructure overhead—so you can set pricing floors and build profitable machine learning pricing models.
To calculate AI inference unit economics, decompose every request into its cost components: input tokens, output tokens, GPU compute time, orchestration overhead, and fixed infrastructure amortization. Sum the variable costs per request, add a proportional share of fixed costs based on projected volume, then validate the total against your target gross margin. This cost-to-serve number becomes the floor for your machine learning pricing models.
Outcome: You produce a validated cost-per-request model that gives you the exact dollar amount it costs to serve each AI-powered interaction, enabling you to set price floors, forecast COGS at scale, and make confident decisions about your machine learning pricing models.
Prerequisites
- Basic understanding of how LLM tokenization works (input tokens vs. output tokens)
- Access to your AI provider's billing dashboard or invoices (OpenAI, Anthropic, AWS Bedrock, etc.)
- Familiarity with spreadsheet modeling or a tool like Google Sheets / Excel
- Knowledge of your current request volume or a reasonable estimate of projected usage
- Understanding of gross margin concepts (revenue minus COGS divided by revenue)
Overview
Every AI-powered product has a cost structure that looks nothing like traditional SaaS. In a classic software product, the marginal cost of serving one more user is effectively zero—your servers handle another request and the incremental expense is fractions of a penny. With AI inference, every single request burns real money: tokens are consumed, GPU cycles spin, API meters tick. If you don't know your cost-to-serve with precision, you're either leaving money on the table or quietly bleeding margin on every interaction. Calculating AI inference unit economics is the foundational skill in the AI Pricing Playbook: Unit Economics & Tiering because every downstream pricing decision—tier design, markup strategy, overage pricing—depends on a reliable cost floor.
This skill walks you through building a per-request cost model from scratch. You'll decompose a single AI-powered interaction into its component costs: the tokens consumed by the language model (both input and output), the compute time on GPUs or inference endpoints, any orchestration overhead like retrieval-augmented generation (RAG) lookups or embedding calls, and the amortized share of fixed infrastructure like vector databases, caching layers, and monitoring. The output is a single spreadsheet or model that maps request types to their fully-loaded cost, giving you numbers like '$0.0023 per summarization request' or '$0.018 per complex agentic workflow.' These numbers are the bedrock of sound machine learning pricing models.
The reason this skill exists as a standalone practice—rather than a back-of-napkin estimate—is that AI costs are deceptively variable. A request that generates 50 output tokens costs radically less than one that generates 2,000. A cached prompt is cheaper than a fresh one. A batch-processed request at off-peak hours on a reserved GPU instance is a different economic animal than a real-time request on on-demand compute. Without a structured model, teams routinely underestimate costs by 2-5x, discover they're underwater only after scaling, and then face the painful choice of raising prices or cutting features. The artifact you'll produce here—a cost model with per-request granularity—prevents that surprise and gives you the confidence to price aggressively where margins allow and conservatively where they don't.
When done well, this model becomes a living document. You'll update it quarterly as model providers change pricing, as you shift between models or providers, and as your request mix evolves. It feeds directly into sibling skills like modeling token cost pass-through and managing gross margins on AI features, and it's the first thing you'll reference when designing usage-based pricing tiers.
How It Works
The core mental model behind AI inference unit economics is full-cost decomposition per logical request. Instead of looking at your monthly AI bill as a single number and dividing by total requests (which gives you an average that obscures the variance), you break each type of request your product serves into its atomic cost components, model each independently, then reassemble them into a fully-loaded cost figure.
Think of it like a restaurant costing a dish. You don't just divide total food spend by dishes served. You cost the protein, the vegetables, the sauce, the gas for cooking, the plate depreciation, and the labor—per dish, per variant. A steak dinner has different unit economics than a salad. Similarly, a simple classification request has different unit economics than a multi-step agentic workflow with tool calls, RAG retrieval, and streaming output.
Why decomposition matters more than averaging: AI cost structures have extremely high variance between request types. In a typical AI product, the most expensive 10% of requests might consume 60-70% of total cost. If you average, you'll underprice heavy requests and overprice light ones. Machine learning pricing models that use blended averages create adverse selection—power users flock to your underpriced heavy features, and light users leave because they're subsidizing everyone else.
The model works in three layers:
Layer 1: Variable costs per request. These scale linearly (or near-linearly) with each request. Token costs are the biggest variable: input tokens (what you send to the model) and output tokens (what the model generates) are priced differently by every provider, with output tokens typically costing 3-5x more per token. Compute time matters if you're running self-hosted models—each second of GPU time has a known cost. API orchestration costs include embedding calls for RAG, vector database queries, tool-use calls, and any chained model calls in agentic workflows.
Layer 2: Semi-variable costs. These scale with usage but not linearly per request. Caching infrastructure costs more as your cache grows, but each cached hit avoids a full model call—so caching is both a cost and a savings. Logging and monitoring scale with request volume but are often tiered. Bandwidth costs for streaming responses scale with output size.
Layer 3: Fixed costs amortized per request. These exist regardless of volume: vector database hosting, fine-tuning amortization, GPU reserved instances, prompt engineering labor, evaluation pipeline costs. You amortize these across your projected monthly request volume to get a per-request share. This is where the model gets tricky—if you project 1M requests/month and only hit 200K, your per-request fixed cost is 5x higher than planned.
The formula for a single request type becomes:
Cost per request = (input_tokens × input_price) + (output_tokens × output_price) + (embedding_calls × embedding_price) + (retrieval_cost) + (compute_time × compute_rate) + (fixed_monthly_costs / projected_monthly_requests)
This is the number that feeds into every machine learning pricing model you build. It's your cost floor. Your price must sit above this number by enough to hit your target gross margin (typically 60-80% for software, though many early AI products operate at 40-60% while optimizing). Understanding why each component exists and how it behaves at different scales is what separates teams that price profitably from those that discover margin problems at scale. The AI Pricing Playbook treats this number as the gravitational constant of your pricing universe—everything else orbits around it.
Step-by-Step
Step 1: Catalog Your AI Request Types
Before you can cost anything, you need to know exactly what you're costing. Open your product and list every distinct type of AI-powered interaction a user can trigger. Be specific—'summarization' and 'question answering over documents' are different request types even if they use the same model, because their token profiles differ dramatically. For each request type, note: which model it calls (GPT-4o, Claude 3.5 Sonnet, a fine-tuned model, etc.), whether it involves RAG retrieval, how many chained calls it makes (e.g., an agentic workflow might call the model 3-5 times per user request), and whether responses are streamed or batched. The output of this step is a simple table with one row per request type and columns for model, chain depth, RAG involvement, and average frequency (what percentage of total requests does this type represent). Most products have 3-8 distinct request types. If you have more than 15, look for types you can group—the cost model needs to be maintainable.
Tip: Check your application logs or API call logs, not your product spec. Engineers often add model calls that PMs don't know about—retry logic, fallback models, pre-classification calls to route requests. These hidden calls are real costs that must be in your model.
Step 2: Measure Token Consumption Per Request Type
For each request type, you need the actual token counts—not estimates, not what the prompt 'should' use, but real measured data. Pull a sample of 100-500 requests per type from your logs. For each request, record input tokens and output tokens separately (your API provider's response headers or billing API will have these). Calculate the median, mean, P75, and P95 for both input and output tokens per request type. The median is your planning number; the P95 is your risk number. If you're using RAG, also count the tokens consumed by the retrieved context—these are input tokens but they vary based on how many chunks you retrieve and how large they are. If you have agentic or multi-step workflows, sum all model calls in the chain. A single user-facing request that triggers 4 model calls consumes 4x the tokens of a single call. Record this multiplier per request type. The output is a table: request type, median input tokens, median output tokens, P95 input tokens, P95 output tokens, average chain depth.
Tip: Output tokens are the most volatile cost driver. A summarization request might produce 50-500 output tokens depending on document length and user instructions. If your P95 output tokens are more than 3x your median, you likely have a bimodal distribution—investigate whether you actually have two distinct request types hiding in one bucket.
Step 3: Map Current Provider Pricing to Each Request Type
Pull the current pricing from every AI provider you use. Create a pricing reference table with columns for provider, model, input price per 1K tokens (or per 1M tokens—just be consistent), and output price per 1K tokens. Include all models you call, including embedding models, and any per-call fees (some providers charge a flat fee per API call on top of token costs). For self-hosted models, calculate the effective per-token cost by dividing your GPU costs (instance cost per hour) by the throughput of that instance (tokens per second × 3,600 seconds). This is less precise than API pricing because throughput varies with batch size, sequence length, and model, so use your measured throughput from production, not the vendor's benchmarks. Now multiply: for each request type, take the median token counts from Step 2 and multiply by the per-token prices. This gives you the raw model cost per request type. Write it down with four decimal places—these numbers are small individually but massive at scale.
Tip: Watch for pricing that differs between cached and uncached tokens. Anthropic and OpenAI both offer prompt caching that can cut input token costs by 50-90% for repeated system prompts. If you use caching, you need two cost figures per request type: cache-hit cost and cache-miss cost, then weight them by your actual cache hit rate.
Step 4: Add Orchestration and Infrastructure Costs
The model call is rarely the only cost. List every other service that gets invoked during a request: vector database queries (Pinecone, Weaviate, pgvector on a database), embedding generation for the query, re-ranking model calls, web search API calls, tool-use endpoints, image processing, or any other external service. For each, find the per-call or per-query cost. Vector database costs are often a combination of storage (per GB/month) and query costs (per query or per compute unit). Embedding costs are typically per-token, just like LLM calls but much cheaper. Sum all of these per-request-type. Then add compute overhead: if your orchestration layer (LangChain, your custom agent framework, etc.) runs on application servers, estimate the compute time per request and multiply by your server cost per second. For most cloud-hosted applications, this is $0.00001-$0.0005 per request—small but not zero, and it adds up. The output is an updated cost table with a new column: 'orchestration and infra cost per request.'
Tip: Don't forget egress and bandwidth costs. If you're streaming large responses or serving results that include retrieved document chunks, cloud egress fees can be $0.01-0.05 per GB. At high volumes with large payloads, this becomes material.
Step 5: Amortize Fixed Monthly Costs Across Projected Volume
List every fixed cost that exists regardless of request volume: GPU reserved instances or committed-use discounts, vector database base hosting fees, fine-tuning costs (amortized over the useful life of the fine-tune, typically 3-6 months before retraining), monitoring and observability tool subscriptions, prompt engineering and evaluation labor (if you have dedicated staff), and any minimum spend commitments with providers. Sum these into a total monthly fixed cost. Now divide by your projected monthly request volume. This is the trickiest number in the model because it's a forecast, not a measurement. Use three scenarios: pessimistic (50% of target volume), expected (your planning number), and optimistic (150% of target). Calculate the per-request fixed cost allocation under each scenario. The spread between pessimistic and optimistic shows your volume risk—if the pessimistic scenario makes your unit economics unprofitable, you need to rethink your fixed cost structure or your volume assumptions before you price anything.
Tip: For early-stage products with low volume, fixed cost amortization can dominate your unit economics and make per-request costs look terrifying. Separate your model into 'at current volume' and 'at target volume (12 months out)' views. Price for where you're going, not where you are—but track actuals monthly to make sure you're getting there.
Step 6: Build the Fully-Loaded Cost-Per-Request Model
Now assemble everything into a single spreadsheet or model. Create one row per request type. Columns: request type, median input tokens, median output tokens, model cost (tokens × price), orchestration cost, fixed cost allocation, and total fully-loaded cost per request. Add a weighted average row at the bottom that weights each request type by its share of total volume—this gives you a blended cost per request that's useful for back-of-envelope checks but should never be used for actual pricing decisions (use the per-type costs instead). Add a sensitivity analysis: what happens to costs if output tokens increase 50%? If your provider raises prices 20%? If volume drops 30%? Build these as toggleable scenarios. Finally, add a column for your target gross margin (start with 70% as a benchmark) and calculate the minimum price per request type that achieves that margin. This is your price floor. The output of this step is the artifact: a cost model spreadsheet with per-request-type costs, scenarios, and price floors.
Tip: Color-code the cells: green for request types where your current pricing exceeds the price floor by 2x+ (healthy margin), yellow for 1-2x (tight), red for below the floor (losing money). This visual makes it immediately obvious where you have pricing problems.
Step 7: Validate Against Actual Spend
A model is only useful if it matches reality. Take your last full month of actual AI provider invoices, infrastructure bills, and any other costs included in the model. Calculate what your model would have predicted for that month's spend given the actual request volume and mix. Compare predicted vs. actual. If they're within 10%, your model is solid. If they diverge by more than 15%, investigate: Are there request types you missed? Is your token measurement sample unrepresentative? Are there costs not captured (support, incident response, model evaluation)? Reconcile until the model predicts last month within 5-10% accuracy. Then run it forward: predict next month's cost based on your growth trajectory and see if the prediction feels reasonable. This validation step is what separates a useful cost model from a theoretical exercise. Document the validation date and accuracy so you know when it's time to revalidate.
Tip: The most common source of divergence is retries and error handling. If 5% of your requests fail and get retried, you're consuming tokens on the failed attempts too. Check your error rate and add a retry multiplier (e.g., 1.05x for a 5% retry rate) to your token consumption estimates.
Step 8: Establish a Refresh Cadence and Cost Monitoring
AI inference costs are not stable. Model providers change pricing (often downward, but not always). Your request mix shifts as users adopt new features. Your engineering team optimizes prompts, adds caching, or switches models. Set a calendar reminder to refresh this model monthly for the first quarter, then quarterly once it's stable. Create a simple dashboard or alert that tracks your actual cost-per-request against the model's prediction—if they diverge by more than 15%, trigger an immediate refresh. Also monitor your cost-per-request trend over time: is it going up (more complex features, larger contexts) or down (prompt optimization, caching, cheaper models)? This trend line is critical input for your pricing strategy. If costs are declining 10% per quarter, you can either improve margins or pass savings to customers to drive adoption. If costs are rising, you need to adjust pricing or optimize before margins erode.
Tip: Set up a Slack or email alert when your daily average cost-per-request exceeds 120% of the modeled value. This catches problems like a prompt regression (someone accidentally removed caching), a model version change with different token economics, or a sudden shift in usage patterns—before they show up on your monthly invoice.
Examples
Example: B2B SaaS with RAG-Powered Document Q&A
A 15-person startup sells a document intelligence platform to legal teams. Users upload contracts and ask questions. The product uses OpenAI GPT-4o for generation, a Pinecone vector database for retrieval, and OpenAI's embedding model for query and document embedding. Monthly volume: ~200,000 requests. The team currently has no cost model and just watches the monthly OpenAI invoice grow.
The team catalogs three request types: simple factual Q&A (60% of volume, ~800 input tokens including context, ~150 output tokens), complex analytical questions (30% of volume, ~3,200 input tokens with more retrieved chunks, ~600 output tokens), and document summarization (10% of volume, ~8,000 input tokens for long documents, ~1,200 output tokens). They pull 300 production requests per type and confirm these medians. At GPT-4o pricing of $2.50/1M input tokens and $10.00/1M output tokens, simple Q&A costs $0.0035/request, complex Q&A costs $0.014/request, and summarization costs $0.032/request. Adding Pinecone query costs ($0.08/query for their plan amortized) and embedding costs ($0.00002/query), orchestration adds ~$0.0001-$0.0003 per request. Fixed costs: Pinecone hosting at $70/month, monitoring at $200/month, and a 20% allocation of one ML engineer's time for prompt maintenance ($2,000/month). At 200K requests, fixed allocation is $0.0114/request. The blended weighted average is $0.0098/request, but the per-type costs range from $0.015 (simple) to $0.043 (summarization). At their current pricing of $500/seat/month with average usage of 5,000 requests per seat, they're earning $0.10/request—healthy margins across all types. But they identify that if complex and summarization requests grow to 60% of mix (which usage trends suggest), the blended cost rises 40%. They decide to implement per-request-type tracking to watch the mix shift and prepare tier adjustments.
Example: Solo Developer Building an AI Writing Assistant
A solo developer is launching an AI writing assistant targeting freelance copywriters. The product uses Anthropic Claude 3.5 Sonnet via API for content generation and editing. No RAG, no vector database—just direct model calls with a system prompt and user input. Projected volume: 50,000 requests/month initially. Budget is tight; the developer needs to ensure they don't lose money on a $20/month subscription price.
There are two request types: short-form editing (70% of volume—user pastes a paragraph, model rewrites it, ~400 input tokens, ~350 output tokens) and long-form generation (30%—user provides a brief, model writes 500+ words, ~300 input tokens, ~800 output tokens). At Claude 3.5 Sonnet pricing of $3/1M input and $15/1M output tokens, short-form costs $0.0065/request and long-form costs $0.013/request. The developer implements prompt caching for the system prompt (~800 tokens, always identical), achieving a 92% cache hit rate, which cuts input costs by ~40% on cached requests. After caching, short-form drops to $0.0054 and long-form to $0.011. Fixed costs are minimal: $0 for infrastructure (serverless deployment on Vercel), $29/month for error monitoring, and $0 for the developer's own time (they're not counting labor as COGS yet). At 50K requests/month, fixed allocation is $0.00058/request—negligible. The weighted blended cost is $0.0068/request. With 50K requests/month split across 100 users (500 requests/user), each user's COGS is ~$3.40/month against $20/month revenue—an 83% gross margin. The developer validates by checking last month's Anthropic invoice: $310 actual vs. $340 modeled (within 9%). They're comfortable with their pricing but build a scenario for what happens if users average 1,000 requests/month: COGS rises to $6.80/user, still 66% margin. The model gives them confidence to launch.
Example: Enterprise Platform with Self-Hosted Models and API Fallback
A 200-person company runs a customer support automation platform. They self-host Llama 3 70B on AWS GPU instances for routine classification and response drafting, and fall back to GPT-4o via API for complex escalations. Monthly volume: 2 million requests. Four NVIDIA A100 GPU instances run 24/7 for the self-hosted model.
The team identifies four request types: ticket classification (50% of volume, self-hosted Llama, ~200 input/~10 output tokens), routine response drafting (30%, self-hosted Llama, ~1,500 input/~400 output tokens), complex response generation (15%, GPT-4o API, ~3,000 input/~800 output tokens), and sentiment analysis (5%, self-hosted Llama, ~300 input/~5 output tokens). For self-hosted Llama, they calculate the cost differently: 4× A100 instances at $3.67/hour each = $10,732/month total GPU cost. They measure throughput at 850 tokens/second aggregate across all instances for their workload. Monthly token capacity: 850 × 3,600 × 24 × 30 = ~2.2 billion tokens. At 1.7M self-hosted requests consuming ~1.8B tokens total, their utilization is 82%. Effective cost: $10,732 / 1.8B tokens = $0.00000596/token—roughly 400x cheaper per token than GPT-4o input pricing. Self-hosted classification costs $0.0000013/request, routine drafting $0.0000113/request. But the 15% of requests hitting GPT-4o (300K/month) cost $0.0155/request—1,000x more per request than self-hosted. GPT-4o requests represent 15% of volume but 89% of variable model costs. Fixed costs include GPU instances ($10,732), ML ops engineer allocation ($8,000), monitoring ($500), and model evaluation pipeline ($1,200). Total fixed: $20,432/month, or $0.0102/request at 2M volume. The key insight from the model: optimizing the GPT-4o fallback rate from 15% to 10% would save $7,750/month in variable costs alone. The team prioritizes improving their routing classifier to keep more requests on the self-hosted model, and sets their machine learning pricing models to charge a premium for features that trigger GPT-4o escalation.
Example: Consumer AI App with Freemium Model
A consumer startup offers an AI-powered meal planning app. Free users get 5 meal plans/week; paid users ($9.99/month) get unlimited plans plus grocery list optimization. The product uses GPT-4o-mini for meal plans and GPT-4o for the premium grocery optimization feature. 500,000 MAU, 15,000 paid users. Volume: ~3M free-tier requests and ~800K paid-tier requests per month.
Request types: basic meal plan generation (free tier, GPT-4o-mini, ~500 input/~300 output tokens), premium meal plan with dietary optimization (paid tier, GPT-4o-mini, ~900 input/~500 output tokens), and grocery list optimization (paid only, GPT-4o, ~1,200 input/~400 output tokens). GPT-4o-mini at $0.15/1M input and $0.60/1M output: basic meal plan costs $0.000255/request, premium meal plan costs $0.000435/request. GPT-4o grocery optimization costs $0.007/request. Free tier COGS: 3M × $0.000255 = $765/month, or $0.00153/user/month. Paid tier COGS: 600K premium plans × $0.000435 + 200K grocery optimizations × $0.007 = $261 + $1,400 = $1,661/month, or $0.111/user/month. Fixed costs: $2,000/month infrastructure + $3,000 ML ops allocation = $5,000/month. At 3.8M total requests, that's $0.0013/request. Paid user fully-loaded COGS: $0.111 + $0.33 (fixed allocation weighted by paid usage share) = ~$0.44/user/month against $9.99 revenue—95.6% gross margin. The grocery optimization feature is 84% of paid tier variable costs despite being only 25% of paid requests. The team realizes they could offer an intermediate tier without grocery optimization at $4.99/month with 97%+ margins, capturing users who want more meal plans but don't need grocery lists. The cost model directly informed their machine learning pricing models and tier design.
Best Practices
Measure tokens from production, not development. Development prompts are shorter, simpler, and more predictable than production prompts with real user inputs. Always base your cost model on production log samples of at least 100 requests per type. Teams that cost from development data consistently underestimate production costs by 30-60%, and the gap grows as users discover creative (expensive) ways to use your product.
Model input and output tokens separately with distinct distributions. Input tokens are relatively stable (your system prompt + retrieved context is predictable) while output tokens are highly variable (model verbosity, user request complexity). Treating them as a single number hides the variance that drives your cost risk. If your output token P95 is 4x+ your median, that tail is where your margin disappears.
Always include a 'worst case request' analysis. Identify the single most expensive possible request a user could make in your product—maximum context window, maximum output, multiple tool calls, full RAG retrieval. Cost it out fully. This number tells you your maximum exposure per request and informs decisions about rate limiting, output caps, and whether you need guardrails. If your worst-case request costs $0.50 and your pricing charges $0.01, one adversarial or pathological user can destroy your margins.
Use the per-request-type costs, never the blended average, for pricing decisions. The blended average is a vanity metric that hides cross-subsidization. A product with two request types—one costing $0.001 and another costing $0.05—has a blended average that depends entirely on mix, which you don't control. Price each tier or feature against its specific cost, not the average. This prevents the adverse selection spiral where heavy users get subsidized by light users until the light users leave.
Build your model to be provider-swappable. Structure the spreadsheet so that changing the model provider's pricing is a single cell edit, not a full rebuild. AI providers change pricing frequently, new models launch monthly, and your team will want to model 'what if we switch from GPT-4o to Claude 3.5 Sonnet?' quickly. A provider-agnostic structure also makes it easy to model multi-provider strategies where you route different request types to different models based on cost-quality tradeoffs.
Include the cost of quality: evaluation, monitoring, and guardrails. Running an AI feature in production requires evaluating output quality, monitoring for regressions, content filtering, and safety checks. These are real per-request costs (or per-batch costs that amortize per request) that teams routinely omit from cost models. If your evaluation pipeline runs a second model call on 10% of responses, that's a cost. If your guardrail check adds latency that requires higher-tier compute, that's a cost. Omitting quality costs leads to the painful discovery that 'making it good enough for production' costs 20-40% more than the raw inference.
Document your assumptions explicitly in the model. Every cost model contains assumptions: projected volume, cache hit rate, average chain depth, retry rate. Write each assumption in a dedicated cell or section with the date and source. When assumptions change (and they will), you can trace exactly which parts of the model are affected. Without this, refreshing the model becomes a full rebuild because nobody remembers why a particular number was chosen.
Common Mistakes
Using the API provider's published 'average cost per request' instead of measuring your own token consumption.
Correction
Provider averages are based on their entire customer base, which has a completely different request mix than your product. A provider might quote $0.002 average per request, but your product's RAG-augmented requests with large context windows might cost $0.02-$0.08 each. Always measure your own token distributions from production logs. The signal that you've made this mistake is that your modeled monthly cost is consistently 2-5x lower than your actual invoice. Pull a sample of 200+ real requests, measure tokens, and re-baseline.
Ignoring the cost of multi-step and agentic workflows by counting only the 'main' model call.
Correction
Modern AI features often involve chains: a classifier routes the request, RAG retrieves context, the main model generates a response, and a second model validates or reformats it. Each step consumes tokens and compute. Teams frequently cost only the 'main' generation call and miss 40-70% of actual token consumption. The diagnostic sign is a cost model that predicts accurately for simple requests but wildly underestimates complex ones. Trace one complete request through your system, log every external API call, and sum them all.
Building the cost model once and never refreshing it, even as models, providers, and request patterns change.
Correction
AI inference economics shift faster than almost any other cost input in software. Model providers change pricing quarterly, your engineering team optimizes prompts and adds caching, and your user base shifts toward different features. A model built in January can be 30-50% inaccurate by June. The warning sign is a growing gap between your modeled cost-per-request and your actual invoice divided by requests. Set a quarterly refresh cadence at minimum, and revalidate immediately after any model migration, major prompt change, or provider pricing update.
Amortizing fixed costs across optimistic volume projections, making unit economics look artificially good.
Correction
When you divide $10,000/month in fixed infrastructure costs by a projected 1M requests, it's only $0.01 per request. But if you're currently at 100K requests, your actual fixed cost allocation is $0.10 per request—10x higher. This mistake makes early-stage unit economics look profitable on paper while the company loses money in practice. Always model at current actual volume alongside projected volume, and be honest about which number you're using for pricing decisions. If your product can't achieve target margins at current volume with fixed costs included, acknowledge that gap and plan for it explicitly rather than hiding it behind a forecast.
Treating all requests as equal cost when designing machine learning pricing models, leading to a single blended price.
Correction
This is the most strategically dangerous mistake. If your product offers both a lightweight autocomplete feature ($0.001/request) and a deep document analysis feature ($0.05/request), a single blended price creates a massive cross-subsidy. Users of the expensive feature get a bargain, users of the cheap feature overpay, and as your user base shifts toward the expensive feature (which they will, because it's underpriced), your blended cost rises but your blended price doesn't. The fix is to cost each request type independently and use those per-type costs as inputs to your tiering and pricing strategy. You can still present a simple price to users, but the internal model must understand the cost structure per feature.
Forgetting to account for prompt caching economics, leading to either over- or under-estimation.
Correction
If you've implemented prompt caching (where repeated system prompts are cached by the provider at reduced cost), but your model uses the full uncached token price, you're overestimating costs. Conversely, if you assume caching for all requests but your actual cache hit rate is only 40%, you're underestimating. The fix is to measure your actual cache hit rate from API response headers (most providers indicate cache hits), then calculate a weighted average: (cache_hit_rate × cached_price) + (cache_miss_rate × full_price). Update this monthly as your cache behavior changes with new prompt versions or user patterns.
Other Skills in This Method
Designing Usage-Based Pricing Tiers for AI Products
How to structure tiered pricing plans around usage metrics like API calls, tokens, or seats that align customer value with your cost structure.
Choosing Between AI Pricing Models: Seat vs. Usage vs. Outcome
A decision framework for selecting the right pricing model—per-seat, per-token, per-outcome, or hybrid—based on your AI product's value delivery and cost profile.
Modeling Token Cost Pass-Through and Markup Strategy
How to build financial models that account for underlying LLM token costs, apply sustainable markups, and forecast margin impact as token prices fluctuate.
Managing Gross Margins on AI-Powered Features
Techniques for monitoring, protecting, and improving gross margins when variable AI compute costs threaten profitability at scale.
Benchmarking AI Product Pricing Against Competitors
A systematic approach to researching, comparing, and positioning your AI product's pricing relative to competitors and market expectations.
Migrating from Flat Subscription to Usage-Based AI Pricing
A step-by-step playbook for transitioning existing customers from fixed subscription plans to usage-based or hybrid pricing without excessive churn.
Setting Rate Limits and Overage Pricing for AI APIs
How to define usage caps, throttling policies, and overage charges that protect margins while preserving a positive customer experience.
Frequently Asked Questions
How do I calculate inference costs when I'm using multiple models in a single request chain?
Trace the full request lifecycle through your system, logging every external model or API call. Sum the token costs of each call in the chain separately—don't estimate the 'main' call and add a percentage buffer. For a typical RAG workflow that calls an embedding model, then a reranker, then a generation model, you'll have three distinct cost lines. If you have agentic workflows where the number of calls varies per request, measure the average and P95 chain depth from production logs and model both. The P95 is particularly important because a few runaway chains can dominate your monthly bill.
How long should the initial cost model take to build from scratch?
For a product with 3-5 request types and a single API provider, expect 2-4 hours for the initial build: about 1 hour to catalog request types and pull log samples, 1 hour to measure token distributions, 30 minutes to map pricing and calculate costs, and 30-60 minutes to add fixed costs, build scenarios, and validate against actual spend. Products with self-hosted models, multiple providers, or complex agentic workflows can take a full day. The good news is that refreshes take 30-60 minutes once the structure exists—you're just updating numbers, not rebuilding the framework.
Should I calculate inference unit economics before or after designing my pricing tiers?
Always before. Your cost-per-request is the foundation that every pricing decision rests on. Without it, you're guessing at tier boundaries, markup rates, and overage pricing. The [AI Pricing Playbook](/methods/ai-pricing-playbook) sequences this skill first deliberately—you need the cost floor before you can decide how much to mark up, which features to bundle, or where to set tier thresholds. Think of it as measuring ingredients before writing a recipe. That said, the cost model and pricing tiers iterate together: once you design tiers, you'll want to model the expected request mix per tier and verify the economics still work.
How do I handle inference costs for self-hosted models vs. API-based models?
For API-based models, cost-per-token is explicit in the provider's pricing. For self-hosted models, you need to calculate an effective per-token cost: take your total infrastructure cost for the inference cluster (GPU instances, networking, storage, DevOps allocation) and divide by your measured throughput in tokens per month. The key metric to measure is tokens-per-second at your production batch size and sequence lengths—not the peak throughput cited in benchmarks. Also factor in utilization: if your GPUs are only 60% utilized (common during off-peak hours), your effective per-token cost is 67% higher than the theoretical maximum. Model both the fully-utilized and actual-utilization costs so you understand your optimization opportunity.
Why does my cost model keep drifting from actual spend month to month?
The three most common causes of model drift are: (1) request mix shift—the proportion of expensive vs. cheap request types changes as users adopt new features or change behavior, which changes your blended cost even if per-type costs are stable; (2) token consumption creep—prompts get longer as your team adds instructions, guardrails, or examples, silently increasing input tokens by 20-50% over a few months; (3) hidden retry and error costs—failed requests that consume tokens before failing aren't always captured in your model. Fix this by tracking actual per-type token distributions monthly, comparing prompt lengths against your baseline, and adding a retry rate multiplier based on your error monitoring.
How do I account for costs that only apply at scale, like rate limiting overhead and queueing infrastructure?
Costs that emerge only at scale—queue management systems, rate limiting infrastructure, load balancing across multiple API keys or model endpoints, and burst pricing from providers—should be modeled as a step function, not a smooth curve. Below certain volume thresholds, these costs are zero. Above them, they can be significant. Build a 'scaling triggers' section in your cost model that lists each infrastructure component, the volume threshold at which it becomes necessary, and its cost. For example: 'At >500 requests/second sustained, we need a dedicated queue service ($200/month) and a second API key with separate rate limits.' Include these in your projected volume scenarios so you're not surprised when you cross a threshold.
What target gross margin should I aim for when building machine learning pricing models?
Traditional SaaS targets 75-85% gross margins. AI-powered products typically operate at 50-70% in their first year, improving toward 65-80% as they optimize prompts, implement caching, shift to cheaper models, and scale volume to amortize fixed costs. If your cost model shows margins below 50%, investigate before launching: you likely need prompt optimization, model downgrades for simpler tasks, or caching strategies before your pricing can work. If margins are above 80%, you may be under-investing in AI quality or have room to lower prices aggressively to capture market share. The target depends on your business model—pure API products need higher margins than products where AI is one feature among many.