Blog/analysis

The inference economics that guarantee your AI tool will eventually nerf you

2026-05-17·13 min read·by gotnerfed

# The inference economics that guarantee your AI tool will eventually nerf you

Every receipt in the gotnerfed index ultimately traces back to the same question, which is what an inference token costs to generate versus what the user is paying for it. The framing of any individual nerf event tends to focus on what the vendor did and how they did it. The deeper question is why the vendor had to do anything at all.

The answer is that consumer-priced AI subscriptions, the $10 to $20 to $30 monthly plans that dominate the prosumer market, sit on top of a cost structure that doesn’t support them for heavy users. The vendors who avoid nerfs for the longest are the ones who manage to grow their base of light users fast enough to subsidize the heavy ones. The vendors who hit the cliff are the ones whose user base shifts toward heavy usage faster than their inference cost per token drops.

This is a structural problem, not a vendor-specific one. Once you understand the math, the receipts stop looking like betrayals and start looking like the inevitable consequence of a business model that was never going to hold.

What a single GPU hour costs in 2026

Pricing for the latest generation of NVIDIA datacenter GPUs is the cleanest place to start. An H100 on the spot market in May 2026 runs around $2.50 per hour in the US, depending on the provider. Reserved capacity comes in at about $1.80. A B200, the newer generation, runs roughly $4 per hour on spot and $3 on reserved.

Hyperscale cloud providers (AWS, GCP, Azure, OCI) have their own internal cost, which is somewhere between half and two-thirds of what they charge externally. So the all-in cost to a vendor running on their own reserved capacity at scale is roughly $1.50 to $2.50 per H100-hour and $2.50 to $3.50 per B200-hour, depending on power, cooling, network, and capex spread across multiple years.

Now translate that into tokens. A single H100 running a 70B parameter model at int8 quantization can output something in the range of 800 to 1,500 tokens per second, depending on the model architecture and the batch size. Call it 1,000 tokens per second for round numbers. That’s 3.6 million tokens per H100-hour. At $2 per H100-hour, the marginal cost is about $0.55 per million output tokens, in the abstract case where the GPU is 100% utilized and only producing output.

Real-world utilization is lower because of context loading, attention overhead, KV cache pressure, and idle time between requests. A more realistic cost for a vendor running a 70B-class model at production scale is probably $1.50 to $3 per million output tokens. For larger models like the 200B-plus frontier class, the cost rises roughly with parameter count and context length. For Claude Sonnet 4 or GPT-5 class models, the all-in cost per million output tokens at the vendor’s own infrastructure is likely in the $3 to $6 range.

Anthropic charges $15 per million output tokens for Sonnet 4 on the API. That gives them a roughly 3x to 5x markup over their inference cost. Sounds healthy. It looks less healthy when you factor in research, training, model improvement, infrastructure overhead, and the fact that the cost benchmark above is the marginal cost, not the fully loaded cost.

How those numbers translate to a $20 subscription

Take an active Claude Pro user who runs 4 to 6 serious work sessions per day through Claude Code, with each session burning between 200,000 and 800,000 tokens of context and output across the agentic loop. Pick a midpoint of 500,000 tokens per session and 5 sessions per day. That’s 2.5 million tokens per day, or roughly 75 million tokens per month.

At the API rate of $15 per million output tokens (and ignoring input cost for simplicity, which is conservative), 75 million tokens is about $1,125 of API consumption per month. The user is paying $20 for that. The vendor is losing the gap.

Now adjust for the fact that the vendor’s cost is lower than the API rate. At a 3x markup over marginal cost, the inference cost to Anthropic is about $375 for that user’s monthly consumption. Anthropic is still losing $355 a month on this single user, before counting any of their fixed costs.

This is not a niche case. Power users running Claude Code in this volume aren’t exotic; they’re common in the developer community, and they’re the most vocal segment of the customer base. The math holds for OpenAI on their $20 Plus tier when a user is doing serious work in ChatGPT. It holds for Cursor on $20 Pro for an active developer. It held for GitHub Copilot on the $19 individual plan, which is part of why the June 1 move to usage-based billing happened.

The cost gap is the structural force behind every receipt in the index. The vendor is losing money on the top 5% to 10% of subscribers, and the question is just how long they can keep losing it.

Why “compute is getting cheaper” doesn’t save the math

The standard counterargument is that GPU compute is getting cheaper at roughly 30% per year, and frontier model inference costs are dropping at a faster rate as architectures improve. So the math will get easier over time, and current losses will turn into future profits.

This argument has been correct in narrow time windows and wrong in the aggregate. The reason is that as inference cost drops, agentic behaviors expand to consume the savings. A 2024-era Copilot session might have used 50,000 tokens. A 2025-era Cursor session used 250,000. A 2026-era Claude Code session uses 800,000 to 2 million. The cost per token has dropped by roughly 4x over that period. The token consumption per active session has increased by 16x to 40x.

This is a Jevons paradox in plain form. When inference gets cheaper, products use more inference. The vendor’s per-user cost doesn’t fall; it stays flat or rises, while their willingness to charge the user more doesn’t keep up.

Open-weight models complicate the picture but don’t reverse it. Running Llama 4 or Qwen-3 Coder on your own hardware is much cheaper per token than the frontier API rates, but the quality gap on hard agentic tasks remains meaningful, and the vendor offering the frontier model has to keep up with the next frontier model. Their CAPEX cycle doesn’t slow down because some users can route to cheaper alternatives. If anything, it accelerates, because the only defense against open-weight competition is being measurably better.

The training cost is the wrong number

Most public discussion of AI economics fixates on training cost. Frontier model training runs are in the hundreds of millions of dollars, sometimes the low billions. Those numbers make headlines.

Training is mostly a sunk cost. Once the model is trained, the marginal cost of one more inference is the GPU time it takes to run, not a fraction of the training run. The big cost driver for an AI vendor in steady state isn’t the next training run. It’s the next million inference tokens.

This is why the gotnerfed receipts cluster around inference-related changes (rate limits, model swaps, quota cuts, usage-based billing) rather than around any training-related events. Training cost gets paid once and spread across the model’s lifetime. Inference cost gets paid every time a user uses the product. The ongoing exposure is on the inference side, and that’s where the cost pressure shows up as nerfs.

Anthropic, OpenAI, Google, and Meta can all afford their next training run. None of them can afford to keep eating losses on their top decile of subscribers indefinitely. The receipts are downstream of that asymmetry.

What this means for the long-term shape of the market

If you accept the math, a few things follow that are worth thinking through.

Flat-rate consumer AI subscriptions are not a stable equilibrium. They’re a customer acquisition strategy. The endgame is some combination of usage-based billing for power users and lower-priced flat-rate tiers for light users, with frontier capability gated behind premium tiers that price closer to the underlying inference cost. The Anthropic Pro Max move, the GitHub Copilot usage-based shift, the Cursor Ultra tier launch, and the OpenAI Codex credit system are all steps in this direction.

The “race to the bottom” framing of AI pricing is mostly wrong. Vendors compete on capability and on developer experience, but the price floor is set by inference cost, and inference cost isn’t dropping fast enough to support an indefinite price war. The receipts since early 2025 show a market that’s moving upward in price for active users, not downward, even as headline prices stay nominally similar.

Light users will continue to get cheap access to AI. Heavy users will pay closer to what their usage costs the vendor, either through usage-based billing or through tier upgrades. The middle disappears. This is the standard shape of mature SaaS pricing in any compute-heavy category, and AI is just catching up to that pattern faster than most.

Vendor lock-in becomes more financially meaningful over time. Once a heavy user’s workflow is wired into a specific vendor’s tooling, switching costs are non-trivial, and the vendor’s pricing power increases. This is why the receipts will keep coming. The user base for any given AI tool stratifies over time into a small group of high-cost power users who can’t easily leave, and the vendor monetizes that group via the pricing moves the index keeps documenting.

What to do with this if you’re a paying user

A few practical implications fall out of the math.

If you’re a light user, you’re getting a great deal. The economics work in your favor. The receipts mostly don’t affect you, because the changes are typically calibrated to extract more from heavy users without disturbing the bottom of the curve. The defensive move is to stay light, which most light users do by accident anyway.

If you’re a heavy user paying a flat subscription, you should assume your current pricing is temporary. The question is whether you have 6 months or 18 months before the structural move arrives. The signals in the receipts give you a rough window: when the vendor adds a higher tier, when the marketing softens on usage claims, when the model routing changes silently, when the API price moves. Track those signals, run the math on your own usage at API rates, and have a migration plan ready before you need one.

If you’re a heavy user already on usage-based billing, the math is more honest, but you’re now exposed to per-unit price increases that the vendor can implement with 30 days notice. Your hedge here is to maintain familiarity with at least one cheaper alternative, even if you don’t actively use it, so the cost of switching is operational rather than skill-based.

A fifth point that’s worth its own line is that the math gets uglier for the vendor every time the frontier moves. When a new model drops with bigger context windows, more capable tool use, longer agentic horizons, the cost per useful task goes up even if the cost per token goes down. This is what makes any vendor’s “we just shipped a major capability upgrade” announcement a leading indicator for a pricing change. The capability upgrade increased the vendor’s average cost per user. The pricing change is the lagging response. The gotnerfed receipts that follow major capability launches by 6 to 14 weeks are documenting this exact lag, repeatedly, across multiple vendors.

There’s a fourth implication that doesn’t sort cleanly into the three above, which is about how you think about your AI tooling more generally. Treat AI vendor relationships the way you’d treat a hosting provider relationship from 2010, which is to say with the assumption that the vendor will eventually make a decision that breaks your workflow, and that the cost of switching is something you should be paying down a little bit at a time rather than letting accumulate. The receipts are public. The math is public. The only variable left is whether your own setup is positioned for the moves that the math makes inevitable.

The index will keep documenting the events. The events will keep happening for the same reason they’ve been happening. And the developers who treat each receipt as a one-off betrayal will keep being surprised, while the ones who treat them as data points in a structural cycle will mostly be fine.