Modeling Token Cost Pass-Through and Markup Strategy
Teaches you to build a financial model that translates raw LLM token costs into customer-facing prices with sustainable markups, and to forecast how margin shifts when token prices or usage volumes change.
Start by mapping every LLM call in your product to its token consumption (input and output tokens separately), then apply your provider's pricing to calculate per-request cost. Layer a markup of 3–8× on raw token cost to cover infrastructure, R&D, and margin. Build a scenario model that stress-tests your margin at 50% and 80% token price drops, and at 2×–10× usage growth. Update the model quarterly as provider pricing changes.
Outcome: A working financial model that tells you exactly what markup to charge per AI action, how your gross margin responds to token price changes and usage spikes, and when you need to renegotiate provider contracts or adjust customer pricing.
Prerequisites
- Basic spreadsheet or financial modeling skills (pivot tables, scenario tables, named ranges)
- Understanding of your product's LLM call patterns — which features call which models and roughly how many tokens per request
- Familiarity with your LLM provider's pricing page (OpenAI, Anthropic, Google, or self-hosted inference costs)
- Completion of or familiarity with the sibling skill: Calculating AI Inference Unit Economics
Overview
Every AI-powered product sits on top of a volatile cost layer: LLM inference. Unlike a traditional SaaS product where compute costs are relatively fixed per user, an AI feature's marginal cost scales with every request, and the unit price of that cost — the token — changes unpredictably as providers compete on price. If you don't model this cost pass-through explicitly, you end up either overcharging (losing deals to competitors who price more aggressively) or undercharging (watching margins erode as adoption grows). This skill teaches you to build the financial model that prevents both outcomes.
Inside the AI Pricing Playbook: Unit Economics & Tiering, token cost pass-through modeling is the bridge between raw unit economics (what each AI call actually costs you) and customer-facing pricing (what you charge). The sibling skill Calculating AI Inference Unit Economics tells you what a single request costs today. This skill extends that number forward in time and across scenarios: what happens to your margin when OpenAI drops prices 50% next quarter? What happens when your power users send 10× the tokens you forecasted? What markup absorbs those shocks without requiring an emergency pricing change?
The concrete artifact you'll produce is a multi-tab spreadsheet (or equivalent model) with four components: a token cost inventory mapping every AI feature to its token consumption, a markup and pricing calculator that translates cost into customer price, a scenario engine that stress-tests margin under different token price and usage volume assumptions, and a trigger dashboard that flags when margin crosses thresholds requiring action. A well-built model takes 2–4 hours the first time and becomes the single source of truth your product, finance, and engineering teams reference every time a pricing decision is on the table.
This is a core component of any serious AI pricing strategy because it forces you to separate the things you can control (markup, tier design, rate limits) from the things you can't (provider pricing, customer usage patterns) and to build explicit plans for responding to both. Without it, your pricing is a guess that gets staler every quarter.
How It Works
The mental model behind token cost pass-through is a three-layer stack: the provider layer (what you pay per token), the infrastructure layer (your overhead on top of tokens), and the value layer (what the customer perceives and pays). Your financial model must capture all three, because your markup isn't just a multiplier on token cost — it also needs to cover non-token expenses like prompt engineering, caching infrastructure, monitoring, and the R&D that makes your AI features useful in the first place.
Why a simple percentage markup fails. Many teams start by applying a flat 3× or 5× markup on token cost and calling it done. This breaks in two directions. First, when token prices drop (as they reliably have — GPT-4 class models have seen 90%+ price reductions in 18 months), your revenue per request drops proportionally even though your non-token costs stay fixed. Second, when usage spikes, your absolute cost grows but your infrastructure costs grow sublinearly, meaning a flat markup leaves money on the table at scale and undercharges at low volume.
The right model separates fixed and variable cost layers. Token cost is purely variable — it scales linearly with usage. Infrastructure cost (caching, monitoring, orchestration) is semi-variable — it grows in steps. R&D cost is fixed — your prompt engineering team's salary doesn't change with request volume. Your markup formula should therefore be: customer_price_per_request = (token_cost × variable_markup) + (fixed_cost_allocation_per_request). The variable markup covers token cost plus margin. The fixed cost allocation amortizes your infrastructure and R&D across expected request volume. This structure means that when token prices drop, your variable component shrinks but your fixed allocation holds, protecting overall margin.
Scenario modeling is the real value. The static model tells you today's margin. The scenario engine tells you whether your pricing survives contact with reality. You need to stress-test at least four scenarios: token price drops 50% (you keep current customer pricing — margin expands), token price drops 50% and you pass savings to customers (margin holds, volume grows), usage per customer doubles (variable costs spike, fixed allocation dilutes), and a new model generation forces migration (temporary cost spike during transition). Each scenario produces a margin forecast and a trigger point — the moment at which you must act.
Input and output tokens are separate cost lines. This is a detail that many models get wrong. LLM providers charge different rates for input (prompt) tokens and output (completion) tokens, and the ratio between them varies wildly by feature. A summarization feature might use 2,000 input tokens and 200 output tokens. A content generation feature might use 200 input tokens and 2,000 output tokens. Since output tokens typically cost 2–4× more per token than input tokens, a model that uses a blended average will systematically misprice features with skewed ratios. Always model input and output token costs separately for each feature.
The AI Pricing Playbook recommends rebuilding this model quarterly — not because the math is complex, but because the inputs change fast. Provider pricing shifts, your product adds new AI features, usage patterns evolve, and your infrastructure costs change as you optimize caching and prompt engineering. A quarterly cadence catches drift before it becomes a margin crisis.
Step-by-Step
Step 1: Inventory Every AI Feature and Its Token Profile
Create a table listing every feature in your product that makes an LLM call. For each feature, record: the model used (e.g., GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro), the average input token count per request, the average output token count per request, the P95 (95th percentile) input and output token counts, and the current monthly request volume. Get these numbers from your LLM provider dashboard, your logging infrastructure, or by sampling 100–500 requests per feature if you don't have aggregate data. If the feature is pre-launch, estimate by running 50 representative prompts through a token counter and logging the results. Separate system prompts, user input, and retrieved context (RAG) in your input token count — system prompts are fixed cost per request, while user input and context vary. This inventory is the foundation of every downstream calculation, so precision here saves you from compounding errors later.
Tip: Most teams undercount input tokens by 30-50% because they forget to include system prompts and RAG context. Run a trace on 10 live requests per feature and compare actual token usage to your estimates — the gap is usually sobering.
Step 2: Build the Per-Request Cost Calculator
For each feature in your inventory, calculate the fully-loaded cost per request. Start with the token cost:
(input_tokens × input_price_per_token) + (output_tokens × output_price_per_token). Use your provider's current pricing, and if you have a negotiated rate, use that instead of list price. Then add non-token variable costs: API gateway fees, logging/monitoring cost per request (typically $0.0001–$0.001), and any embedding or retrieval costs for RAG features. Finally, calculate a fixed cost allocation per request by dividing your monthly fixed AI infrastructure costs (prompt engineering salaries, caching infrastructure, model evaluation tools) by your total monthly request volume across all features. The output of this step is a cost-per-request for each feature broken into three components: token cost, variable overhead, and fixed allocation. Keep these separate — collapsing them into one number makes scenario modeling impossible.Tip: If you use caching (semantic or exact match), calculate cost-per-request as a weighted average: `(cache_hit_rate × cached_cost) + ((1 - cache_hit_rate) × uncached_cost)`. A 40% cache hit rate on a feature with $0.02 uncached cost and $0.001 cached cost drops effective cost to $0.0124 — a 38% reduction that directly impacts your markup math.
Step 3: Determine Your Target Gross Margin Range
Set the gross margin band you need to sustain. For venture-backed SaaS, the standard target is 70–80% gross margin. AI-native products often run lower — 50–65% is common in early stages, with a roadmap to 70%+ as token costs decline and efficiency improves. Talk to your finance team or investors about the acceptable range. Define three thresholds: the target margin (where you want to be, e.g., 65%), the floor margin (below which you must act immediately, e.g., 45%), and the ceiling margin (above which you're likely overcharging and leaving growth on the table, e.g., 80%). These thresholds become the trigger points in your scenario model. Document why you chose each number — the reasoning matters when you're defending pricing decisions to the board or renegotiating with providers.
Tip: Don't benchmark against traditional SaaS margins without adjusting for AI cost structure. A 60% gross margin on an AI product with declining input costs is strategically healthier than a 75% margin on a legacy product with flat costs, because your margin naturally expands over time as token prices fall.
Step 4: Calculate the Required Markup Per Feature
For each feature, work backward from your target gross margin to the required customer price. The formula is:
customer_price = total_cost_per_request / (1 - target_gross_margin). If a request costs $0.015 fully loaded and your target margin is 65%, the customer price is $0.015 / 0.35 = $0.043 per request. Express this as a markup multiple as well: $0.043 / $0.015 = 2.86×. Calculate the markup at all three thresholds (target, floor, ceiling) so you know your pricing band. Compare the resulting customer prices to what competitors charge for similar capabilities — if your 65%-margin price is 3× what competitors charge, either your cost structure is wrong or your margin target is unrealistic. Also compare the markup multiples across features. If your summarization feature requires a 2.5× markup but your content generation feature requires a 6× markup to hit the same margin, you may need feature-specific pricing rather than a single per-request rate.Tip: When markup multiples vary widely across features (more than 2× difference between lowest and highest), consider bundling high-margin and low-margin features into composite pricing units rather than pricing each call individually. This simplifies customer communication while averaging margin across the portfolio.
Step 5: Build the Scenario Engine
Create a scenario table that stress-tests your pricing under different conditions. Build at minimum six scenarios: (1) token prices drop 30%, you hold customer pricing — shows margin expansion; (2) token prices drop 50%, you pass 50% of savings to customers — shows margin hold with volume upside; (3) usage per customer doubles with current pricing — shows whether fixed cost allocation dilutes or concentrates; (4) you migrate to a newer, cheaper model — shows one-time cost and steady-state improvement; (5) a competitor undercuts your price by 40% — shows whether you can match and survive; (6) token prices increase 20% (unlikely but possible with new model generations) — shows margin compression. For each scenario, calculate: new cost per request, new gross margin at current customer price, new gross margin if you adjust customer price, and the number of months until your margin hits the floor threshold. Present this as a table with conditional formatting — green above target, yellow between target and floor, red below floor.
Tip: Add a 'breakeven usage' row to each scenario showing how many requests per customer per month you need to cover your fixed costs. This number is the most useful signal for your sales team — if a customer's projected usage is below breakeven, the deal is margin-negative regardless of markup.
Step 6: Set Pricing Triggers and Response Playbooks
Define the specific conditions under which you'll change pricing, and document what you'll do when each trigger fires. A pricing trigger has three parts: the metric being watched (e.g., blended gross margin, cost per request, provider list price), the threshold that fires the trigger (e.g., gross margin drops below 50% for two consecutive months), and the response playbook (e.g., increase per-request price by 15%, switch to cheaper model for non-premium tiers, implement stricter caching). Create at least four triggers: margin floor breach, provider price drop greater than 25%, usage per customer exceeding 150% of forecast, and new model availability from your provider. For each trigger, specify who owns the decision (product, finance, engineering), the maximum response time (e.g., 'pricing change within 30 days of trigger'), and the customer communication plan. This step transforms your model from a passive report into an active management system.
Tip: The most commonly missed trigger is the 'good news' trigger: when token prices drop significantly and you DON'T need to pass savings through to customers immediately. Document this scenario explicitly — your default should be to bank the margin improvement for 1–2 quarters before adjusting customer pricing, giving you a buffer for future cost increases.
Step 7: Model the Customer-Facing Price Translation
Your internal cost model uses per-request pricing, but customers rarely want to buy individual requests. Translate your per-request economics into the pricing structure your customers will actually see. If you use credit-based pricing, define what one credit equals in token terms (e.g., 1 credit = up to 1,000 input tokens + 500 output tokens) and price credits with your markup baked in. If you use tier-based pricing, calculate the token budget each tier supports and set the tier price to cover the expected usage at your target margin plus 10–15% headroom for usage variance within the tier. If you use per-seat pricing with AI features included, calculate the average AI usage per seat per month and embed that cost in the seat price, then model what happens when AI-heavy users are 3× the average. Document the translation formula explicitly: anyone on your team should be able to trace from a customer's monthly bill back to the underlying token cost and verify the margin.
Tip: Build a 'customer bill simulator' tab in your model that takes a specific customer's usage pattern and outputs their monthly bill under different pricing structures. Run it against your top 10 customers by usage — you'll immediately see which pricing structure produces the most predictable revenue and which creates outlier bills that will trigger churn.
Step 8: Validate Against Competitive Benchmarks and Customer Willingness to Pay
Your model now tells you what you NEED to charge. This step checks whether you CAN charge it. Gather competitor pricing for equivalent AI capabilities — check their pricing pages, API documentation, and any published case studies with cost examples. Use the sibling skill Benchmarking AI Product Pricing Against Competitors for a structured approach. Create a comparison table showing your calculated price, competitor prices, and the delta. If you're more than 30% above the cheapest competitor for comparable quality, you need to either reduce your cost base (better caching, cheaper models, more efficient prompts) or justify the premium with differentiated value. Also validate against customer willingness-to-pay data: what do sales calls and customer interviews tell you about price sensitivity for AI features? If customers consistently balk at your calculated price, the model is technically correct but commercially useless — you need to find cost reductions rather than pushing a price the market won't bear.
Tip: When comparing to competitors, normalize for output quality. A competitor charging 50% less but delivering noticeably worse AI output isn't a true comparable. Build a simple quality scorecard (accuracy, latency, relevance) and price-adjust: if your output is 2× better on a blind evaluation, a 30% price premium is justifiable and should be communicated to sales.
Step 9: Schedule Quarterly Model Reviews
Lock in a quarterly review cadence by scheduling four recurring meetings for the next year. Each review should take 60–90 minutes and include product, finance, and engineering leads. The agenda is fixed: update the token cost inventory with actual usage data from the past quarter, refresh provider pricing (check for announced or rumored price changes), re-run all six scenarios with current numbers, check each pricing trigger against current metrics, and decide on any pricing adjustments for the next quarter. After each review, update the model file, write a one-page summary of findings and decisions, and distribute to stakeholders. The quarterly cadence is non-negotiable in the current AI pricing environment — provider prices have changed every 3–6 months for the past two years, and a model that's two quarters stale is actively misleading.
Tip: Create a 'model changelog' tab that logs every update with date, what changed, why, and who approved it. This audit trail is invaluable when a pricing decision is questioned six months later and nobody remembers the context.
Examples
Example: B2B SaaS with AI Document Summarization (Series A, 200 Customers)
A document management SaaS with 200 paying customers ($500/month per customer) has added an AI summarization feature that processes uploaded documents. The feature uses GPT-4o with an average of 4,000 input tokens and 400 output tokens per request. Customers make an average of 150 requests/month, but the top 10% of customers average 800 requests/month. Current gross margin across the business is 72%, and the board wants AI features to sustain at least 60% margin.
The team builds the token inventory: one feature, GPT-4o, 4,000 input / 400 output average, 12,000 input / 1,200 output at P95. At GPT-4o pricing ($2.50/1M input, $10/1M output), the average request costs $0.014 in tokens. Adding $0.002 for monitoring, logging, and RAG embedding costs brings the total variable cost to $0.016 per request. Fixed AI infrastructure (one ML engineer at $15k/month plus tooling at $2k/month) divided by 30,000 total monthly requests gives $0.00057 per request fixed allocation — rounding to $0.001. Fully loaded cost: $0.017 per request. At 60% target margin: customer price = $0.017 / 0.40 = $0.043 per request, a 2.5× markup. For the average customer at 150 requests/month, this is $6.45/month — easily absorbed in a $500 subscription. For the top-10% customer at 800 requests, it's $34.40/month. The team decides to include 200 AI requests in the base subscription and charge $0.04 per additional request. The scenario model shows that if GPT-4o prices drop 50%, the per-request cost falls to $0.009 and margin jumps from 60% to 78% at current pricing — the team plans to increase the included allocation from 200 to 350 requests per month after two quarters of sustained lower pricing, effectively passing through savings as increased usage allowance rather than a price cut.
Example: AI Writing Assistant API (Seed Stage, Usage-Based Pricing)
A startup sells an AI writing API to developers. The product supports three tiers of quality: Fast (GPT-4o-mini), Standard (GPT-4o), and Premium (Claude 3.5 Sonnet). Average token usage is 300 input / 1,500 output for all tiers. The company has 50 API customers, burns $80k/month, and needs to reach profitability within 12 months. Current pricing is a flat $0.02 per API call across all tiers.
The model immediately reveals the problem: flat pricing across tiers ignores massive cost differences. GPT-4o-mini costs $0.000045 input + $0.0009 output = $0.00095 per request. GPT-4o costs $0.00075 + $0.015 = $0.01575. Claude 3.5 Sonnet costs $0.0009 + $0.0225 = $0.0234. At $0.02 per call, Fast tier has 95% margin, Standard has −21% margin (losing money on every call), and Premium has −17% margin. Usage data shows 60% of calls go to Standard and Premium — the company is losing money on the majority of its requests. The fix: tier-specific pricing. At 65% target margin: Fast = $0.003/call (rounded from $0.0027), Standard = $0.045/call, Premium = $0.067/call. The team runs competitive benchmarks and finds these prices are 20% below comparable APIs, validating the model. The scenario engine shows that even a 40% drop in Claude pricing only reduces Premium per-call cost to $0.014, keeping margin above 75% at the new pricing. The breakeven calculation shows the company needs 2.1M monthly API calls at the new blended rate to cover $80k monthly burn — currently at 800k calls, so they need 2.6× growth, which the 12-month forecast supports. The team implements the new tiered pricing with a 60-day grandfather period for existing customers.
Example: Enterprise Customer Support Platform (Growth Stage, 2,000 Customers)
An enterprise support platform with 2,000 customers on per-seat pricing ($50–$200/seat/month) has embedded AI across five features: ticket auto-routing (low token), response drafting (high token), sentiment analysis (medium token), knowledge base search (medium token with RAG), and conversation summarization (high token). The platform processes 5M AI requests/month across all features. Engineering wants to switch from GPT-4o to a mix of models (GPT-4o-mini for routing and sentiment, GPT-4o for drafting and summarization, an embedding model for search). The CFO needs to understand margin impact before approving.
The team builds a five-row token inventory with current (all GPT-4o) and proposed (mixed model) costs. Current state: routing costs $0.003/request (low tokens), drafting $0.032/request (high output), sentiment $0.008/request, KB search $0.015/request (including embedding), summarization $0.022/request. Blended average across 5M requests at current usage mix: $0.018/request, total monthly token cost: $90,000. Proposed mixed-model state: routing on GPT-4o-mini drops to $0.0004/request, sentiment to $0.001/request. Drafting and summarization stay on GPT-4o. New blended average: $0.011/request, total monthly token cost: $55,000 — a 39% reduction. With 2,000 customers averaging 2.5 seats at $120/seat (midpoint), monthly revenue is $600,000. Current AI cost ratio: $90k / $600k = 15% of revenue. Proposed: $55k / $600k = 9.2% of revenue. The model also flags a risk: if the model switch degrades quality on routing and sentiment, support ticket escalations increase, which drives more human agent time — a cost not captured in the token model. The team adds a quality-adjusted scenario that models a 10% escalation rate increase, adding $12k/month in agent costs, netting the savings to $23k/month instead of $35k. The CFO approves the switch with a 30-day quality monitoring period and an automatic rollback trigger if escalation rate increases more than 5%.
Example: AI Tutoring Platform (Pre-Revenue, Projecting for Investor Deck)
An edtech startup building an AI tutoring platform needs to project unit economics for a seed round pitch. The product hasn't launched but has 500 beta users. Each tutoring session averages 15 back-and-forth exchanges, each exchange uses approximately 800 input tokens (student question + conversation history) and 600 output tokens (tutor response). The team plans to use Claude 3.5 Sonnet. They're projecting 10,000 paying users at $29/month within 18 months.
The model starts with per-session cost. Each session = 15 exchanges × (800 input + 600 output tokens) = 12,000 input + 9,000 output tokens per session. At Claude 3.5 Sonnet pricing ($3/1M input, $15/1M output): input cost = $0.036, output cost = $0.135, total = $0.171 per session. Beta data shows users average 12 sessions/month, so monthly AI cost per user = $2.05. At $29/month subscription, AI cost is 7.1% of revenue — well within healthy margins. But the model flags two risks. First, the input token count grows with conversation history: by exchange 15, the context window contains all previous exchanges, so the actual input token count is not 800 × 15 = 12,000 but closer to 800 + 1,600 + 2,400 + ... = approximately 60,000 input tokens total for the session. Recalculated: input cost = $0.18, output cost = $0.135, total = $0.315 per session, monthly cost = $3.78 per user (13% of revenue). Second, power users (top 10% in beta) average 35 sessions/month, pushing their monthly cost to $11.03 — 38% of the $29 subscription. The team adds a session limit of 20/month on the base plan with a $49/month unlimited plan. The scenario model shows that at projected 10,000 users with 70% on base and 30% on unlimited, blended margin is 72% at current Anthropic pricing and improves to 81% if pricing drops 30% (which the 18-month timeline makes likely). The investor deck now shows realistic unit economics with identified risks and mitigation strategies.
Best Practices
Always model input and output tokens as separate cost lines, because LLM providers price them differently (output tokens often cost 2–4× more) and your features have varying input/output ratios. Blending them into an average token cost introduces systematic pricing errors that compound across your feature portfolio, leading to cross-subsidization where some features secretly subsidize others.
Include non-token variable costs in every per-request calculation — API gateway fees, logging, monitoring, embedding generation for RAG, and any third-party API calls triggered by the LLM response. Teams that model only token cost typically understate true per-request cost by 15–30%, which translates directly into margin shortfalls once you scale past a few hundred customers.
Build the model with named variables and clear cell references rather than hardcoded numbers. When OpenAI announces a price change at 9 AM, you should be able to update one cell (the token price) and immediately see the impact on every feature's margin, every scenario, and every trigger. Hardcoded numbers require a full audit every time an input changes, which means the model either drifts out of date or someone introduces an update error.
Price in bands, not points. Set a target margin with a floor and ceiling rather than optimizing for a single number. A target of 65% with a floor of 50% and ceiling of 80% gives you room to absorb cost fluctuations without triggering an emergency pricing change every quarter. Point targets create false precision — your usage forecasts aren't accurate enough to justify optimizing for 65% versus 63%.
Separate your pass-through model from your customer-facing pricing model. The pass-through model is an internal cost tool; the customer-facing model translates those costs into credits, tiers, or seat prices. Conflating them leads to pricing structures that are economically sound but commercially incomprehensible — customers don't care about your token cost structure, they care about predictable bills.
Stress-test every pricing decision against your highest-usage customer, not your average customer. The average customer is a statistical fiction — your actual margin is determined by the distribution of usage, and power users (top 5%) typically account for 40–60% of your total token spend. If your pricing doesn't work for them, your blended margin will miss target even if the average-customer math looks perfect.
Document your markup rationale in writing, not just in spreadsheet formulas. When a new VP of Finance or a board member asks why you're charging 4× token cost, you need a narrative that explains the fixed cost allocation, the margin buffer for price volatility, and the competitive positioning. A spreadsheet without context is an argument waiting to happen.
Track the actual margin per customer cohort monthly, not just the modeled margin. Your model is a forecast; reality will differ because of usage variance, caching hit rates, prompt efficiency changes, and provider pricing shifts. A monthly actual-vs-modeled comparison is the fastest way to catch model drift before it becomes a margin problem.
Common Mistakes
Using blended average token cost instead of separating input and output tokens
Correction
This mistake happens because LLM pricing pages often lead with a single headline number or teams average for simplicity. In practice, a content generation feature might have a 1:10 input-to-output ratio, meaning output tokens dominate cost — and output tokens cost 2–4× more. You'll catch this by comparing your modeled cost per request against actual provider invoices; if the model consistently underestimates cost for generation-heavy features and overestimates for analysis-heavy features, you have a blending problem. Fix it by adding separate input_tokens and output_tokens columns for every feature in your inventory, each multiplied by its specific per-token price.
Setting markup as a fixed multiple and never adjusting it as token prices change
Correction
Teams set a 5× markup at launch and then treat it as permanent policy. When token prices drop 50%, the customer price drops proportionally — and the fixed cost allocation (R&D, infrastructure) that the markup was supposed to cover no longer does. The symptom is margin compression despite falling input costs, which seems paradoxical until you realize the fixed cost layer didn't shrink with tokens. Prevent this by separating your markup into a variable component (applied to token cost) and a fixed component (per-request allocation of infrastructure and R&D), and by reviewing the split quarterly.
Modeling only the average request and ignoring the distribution of token usage across users
Correction
This mistake comes from using aggregate data (total tokens / total requests = average) without examining the spread. AI usage is almost always heavy-tailed: a small percentage of users or features consume a disproportionate share of tokens. If your pricing is calibrated to the average but your top 10% of users consume 5× the average, those users are margin-negative and growing faster than your other segments. Detect this by adding P50, P90, and P99 columns to your token inventory alongside the mean. If P99 is more than 3× the mean, you need either per-request pricing, usage caps, or tiered pricing that charges more at higher volumes.
Ignoring caching impact on effective cost
Correction
Many teams build their cost model on uncached request costs and then implement caching as an afterthought. This means your model overstates cost — sometimes by 30–50% if you have good semantic or exact-match caching on common queries. The downstream effect is overpricing: your customer-facing price has headroom you don't realize you have, making you look expensive against competitors who've factored caching into their pricing. Add a cache_hit_rate column to each feature and calculate effective cost as the weighted average of cached and uncached costs. Update cache hit rates monthly as your caching infrastructure improves.
Building the model once and not scheduling quarterly reviews
Correction
The initial model gets built during a pricing project, produces great outputs, and then sits untouched for 6–12 months. In that time, the LLM provider has changed pricing twice, three new features have shipped without being added to the token inventory, and actual usage patterns have diverged 40% from the original forecast. The model becomes a historical artifact rather than a management tool. The fix is mechanical: schedule four recurring quarterly reviews on the day you complete the initial model. Put them on the calendar of the product lead, finance lead, and engineering lead. No agenda flexibility — the review happens even if 'nothing has changed,' because that assumption is almost always wrong.
Passing through 100% of token cost savings to customers immediately when provider prices drop
Correction
When a provider announces a 50% price cut, the instinct is to immediately lower customer prices to stay competitive and drive adoption. This destroys the margin buffer you need for the inevitable quarter when costs go up (new model generation, usage spike, or provider pricing correction). Instead, adopt a delayed pass-through policy: bank the margin improvement for 1–2 quarters, use the period to validate that the cost reduction is durable, and then pass through 50–70% of the savings while retaining the rest as permanent margin improvement. Announce the price reduction as a 'price improvement' rather than a cost adjustment — customers see it as a benefit, not an admission that you were overcharging.
Other Skills in This Method
Designing Usage-Based Pricing Tiers for AI Products
How to structure tiered pricing plans around usage metrics like API calls, tokens, or seats that align customer value with your cost structure.
Choosing Between AI Pricing Models: Seat vs. Usage vs. Outcome
A decision framework for selecting the right pricing model—per-seat, per-token, per-outcome, or hybrid—based on your AI product's value delivery and cost profile.
Calculating AI Inference Unit Economics
How to measure and model the per-request cost of AI inference including token consumption, GPU compute, and API call expenses to establish your true cost-to-serve.
Managing Gross Margins on AI-Powered Features
Techniques for monitoring, protecting, and improving gross margins when variable AI compute costs threaten profitability at scale.
Benchmarking AI Product Pricing Against Competitors
A systematic approach to researching, comparing, and positioning your AI product's pricing relative to competitors and market expectations.
Migrating from Flat Subscription to Usage-Based AI Pricing
A step-by-step playbook for transitioning existing customers from fixed subscription plans to usage-based or hybrid pricing without excessive churn.
Setting Rate Limits and Overage Pricing for AI APIs
How to define usage caps, throttling policies, and overage charges that protect margins while preserving a positive customer experience.
Frequently Asked Questions
How do I model token cost pass-through when my product uses multiple LLM providers?
Add a provider column to your token inventory and use each provider's specific pricing for cost calculations. Calculate a weighted average cost per feature based on how traffic is distributed across providers (e.g., 70% OpenAI, 30% Anthropic). In your scenario engine, model each provider's pricing independently — a price drop from OpenAI doesn't affect your Anthropic costs. If you use a router that dynamically selects providers based on cost or quality, model the router's allocation logic as a variable and stress-test what happens when the optimal routing changes.
What markup multiple should I target for AI features?
There's no universal number, but observed ranges are 2.5–8× on raw token cost depending on the value of the output and the non-token cost layer. Low-value, high-volume features (auto-tagging, sentiment analysis) typically sustain 2.5–4× because customers are price-sensitive and alternatives are abundant. High-value, low-volume features (document drafting, code generation, complex analysis) can sustain 5–8× because the output replaces expensive human labor. Start by calculating the markup required to hit your target gross margin, then validate against competitive benchmarks and customer willingness to pay.
Should I model token cost pass-through before or after designing my pricing tiers?
Build the pass-through model first. Your pricing tiers are a customer-facing abstraction on top of the underlying economics, and you can't design a sustainable tier structure without knowing what each tier costs to serve. The pass-through model tells you the cost floor for each tier; the tier design (covered in [Designing Usage-Based Pricing Tiers for AI Products](/skills/designing-usage-based-pricing-tiers)) then layers in customer value perception, competitive positioning, and packaging strategy. Doing it in reverse order means you'll design tiers that feel right commercially but may be margin-negative on your highest-usage tier.
How do I handle token cost pass-through when I'm self-hosting open-source models instead of using API providers?
Replace per-token API pricing with your fully-loaded inference cost per token. Calculate this by summing GPU instance costs, networking, storage, model serving infrastructure (vLLM, TGI), and ops team time, then dividing by total tokens processed per month. The per-token cost will be lower than API pricing at high utilization (typically above 60–70% GPU utilization) but higher at low utilization because you're paying for idle capacity. Your model needs a utilization variable that adjusts effective per-token cost as usage scales, and your scenario engine should include a 'utilization drop' scenario showing what happens to margin if usage dips below your breakeven utilization rate.
Why does my modeled margin keep drifting from actual margin month over month?
The three most common causes are: (1) your average tokens-per-request estimate is stale — actual usage patterns shift as customers discover new ways to use the product, typically toward higher token consumption; (2) your cache hit rate has changed but isn't reflected in the model — a product change or new user cohort can drop cache effectiveness significantly; (3) your usage mix across features has shifted — if high-cost features grow faster than low-cost features, blended cost per request increases even if individual feature costs are stable. Fix this by adding a monthly reconciliation step that compares modeled vs. actual cost per feature, identifies the top three variance drivers, and updates model inputs accordingly.
How long should the initial token cost pass-through model take to build?
For a product with 3–5 AI features and one LLM provider, expect 2–4 hours for the initial build if you already have token usage data accessible. If you need to instrument logging to capture token counts, add 1–2 days for instrumentation and a week of data collection before you can build a reliable model. The scenario engine adds another 1–2 hours. Quarterly updates should take 30–60 minutes once the model structure is stable. The biggest time sink on the first build is usually getting accurate token usage data — most teams don't log input and output tokens separately at the feature level until they need this model.
How do I account for prompt engineering improvements that reduce token usage over time?
Add a 'prompt efficiency' variable to each feature in your model, starting at 1.0 (baseline). As your prompt engineering team optimizes prompts — reducing system prompt length, improving few-shot examples, or implementing chain-of-thought compression — update this multiplier downward (e.g., 0.85 after a 15% token reduction). Track this variable over time to quantify the ROI of prompt engineering investment. In your scenario engine, include an 'optimization roadmap' scenario that models planned efficiency improvements and their cumulative margin impact. This also helps justify prompt engineering headcount to finance — you can show exactly how much margin each optimization wave produces.