How we score, step by step.
Every receipt gets two independent layers of analysis. A deterministic Nerf Score (0-100) computed from a public rubric - no LLM in the scoring loop, ever - and a separate multi-LLM jury that issues a public verdict and gets Brier-scored against subsequent reality.
The score is the rigorous thing. The jury is the unusual thing. Skip straight to a worked example or read top-to-bottom.
Three frontier models, grading us in public.
For every receipt at major severity or above, three independent LLMs from three different vendors read the change diff, the sources, and the impact context. Each issues a private verdict - harmful, neutral, beneficial, or unclear - plus a 0-100 score and a short rationale. Verdicts are sealed, published, and immutable.
- ●Render a public verdict per receipt, signed by model
- ●Provide a short rationale you can argue with
- ●Get Brier-scored against subsequent reality - accuracy public
- ●Disagree with each other; disagreement is the data
- ●Influence the deterministic Nerf Score in any way
- ●Decide which receipts get filed (humans + scanner do that)
- ●Get rerun if a model “changes its mind” later - verdicts are sealed
- ●Speak for gotnerfed - they speak for themselves
Why three models from three vendors? Because no single LLM can be trusted to grade plan changes from its own parent company without conflict. When Anthropic gets nerfed, claude-haiku-4-5 still votes - and we log how often it votes differently from the other two.
Score = kind_weight × severity_multiplier
Each receipt is classified by kind (rate-limit cut, billing model shift, etc.) and severity (critical / major / minor / info). Multiply the two, clamp to 0-100. Every coefficient is in lib/score.ts on GitHub.
| CHANGE KIND | WEIGHT | WHY |
|---|---|---|
| tier-removed | +28 | Bought-thing is gone (Claude Code Apr 2026) |
| billing-model-shift | +26 | Surprise-bill territory (Cursor Jun 2025) |
| rate-limit-cut | +24 | Same price, less throughput |
| free-tier-nerf | +22 | Funnel collapse, trust hit |
| model-swap | +20 | Silent quality regression |
| feature-gated | +18 | Used to have it, now upsell |
| price-increase | +16 | Direct cost increase |
| tos-change | +12 | Slow-burn rights creep |
| feature-ungated | −6 | Positive: gated thing now free |
| tier-added | −4 | Positive (unless it cannibalizes existing tier) |
| rate-limit-raise | −8 | Positive: more throughput |
| price-decrease | −10 | Positive: direct cost reduction |
How bad is bad?
Bars show relative harm; values multiply kind weight before the 0–100 clamp.
One real receipt, walked through the rubric.
On April 21, 2026 Anthropic removed Claude Code from the $20 Pro tier, requiring Max ($100+/mo) for continued use. The receipt is here. The Nerf Score came out at 73. Here’s the math.
export function eventScore(e: Pick<ChangelogEntry, "kind" | "severity">): number { // 1. Look up the kind weight (signed integer, may be negative for positive changes) const base = KIND_WEIGHTS[e.kind] ?? 8; // 2. Look up the severity multiplier (critical = 2.6, info = 0.5, etc.) const mult = SEVERITY_MULTIPLIER[e.severity] ?? 1; // 3. Multiply, clamp to [-30, 100], max with 0 → only harm goes public const raw = base * mult; const score = Math.max(-30, Math.min(100, raw)); return Math.max(0, Math.round(score)); }
The jury verdicts on this receipt - 88 / 92 / 89, all harmful, average 90 - are a separate signal entirely. They don’t enter the score. If you disagree with the kind classification, file a PR. If you disagree with the jury, you can wait - they get Brier-scored against subsequent reality.
Recency-weighted aggregate, 0-100.
Aggregate of every receipt for a vendor, weighted by recency. Half-life is 180 days - a year-old nerf still counts but at a quarter weight. Mapped onto 0-100 via a logistic squash so a single critical nerf doesn’t max the index; repeated nerfs stack with diminishing returns.
| RANGE | LABEL | INTERPRETATION |
|---|---|---|
| 0–4 | no receipts | Vendor has no logged plan changes - either too new to track or unusually stable |
| 5–24 | clean | A few minor edits, no material harm to users |
| 25–49 | on watch | Pattern of cuts emerging - worth watching, not yet escape-velocity |
| 50–74 | shrinking | Repeated material cuts - users feel the squeeze every renewal |
| 75–100 | predator | Hostile to paid users - migration recommended |
How we estimate user impact and financial damage.
Every receipt with a delta gets two predicted-impact numbers: an estimate of the percentage of paid users affected, and a 12-month financial-damage range per affected user. Both are heuristic - derived from kind + severity + the breadth of the affected tier - and explicitly labeled as estimates, not measurements.
Heuristic confidence is graded high / medium / low based on how much the receipt’s kind constrains the impact range. A tier removal at critical severity has a well-bounded financial impact (the cost difference between tiers); a model swap has a fuzzier one (depends on how much you used the swapped model). Confidence pills on each receipt make this explicit.