Methodology · gotnerfed

The score is the rigorous thing. The jury is the unusual thing. Skip straight to a worked example or read top-to-bottom.

01 · THE AI JURY

Three frontier models, grading us in public.

For every receipt at major severity or above, three independent LLMs from three different vendors read the change diff, the sources, and the impact context. Each issues a private verdict - harmful, neutral, beneficial, or unclear - plus a 0-100 score and a short rationale. Verdicts are sealed, published, and immutable.

JUROR · 01

claude-haiku-4-5

Anthropic

JUROR · 02

gpt-5-nano

OpenAI

JUROR · 03

gemini-2-5-flash

Google

WHAT THE JURY DOES · AND DOES NOT

● DOES

●Render a public verdict per receipt, signed by model
●Provide a short rationale you can argue with
●Get Brier-scored against subsequent reality - accuracy public
●Disagree with each other; disagreement is the data

● DOES NOT

●Influence the deterministic Nerf Score in any way
●Decide which receipts get filed (humans + scanner do that)
●Get rerun if a model “changes its mind” later - verdicts are sealed
●Speak for gotnerfed - they speak for themselves

Why three models from three vendors? Because no single LLM can be trusted to grade plan changes from its own parent company without conflict. When Anthropic gets nerfed, claude-haiku-4-5 still votes - and we log how often it votes differently from the other two.

02 · PER-EVENT NERF SCORE

Score = kind_weight × severity_multiplier

Each receipt is classified by kind (rate-limit cut, billing model shift, etc.) and severity (critical / major / minor / info). Multiply the two, clamp to 0-100. Every coefficient is in lib/score.ts on GitHub.

KIND WEIGHTS · HIGHER = MORE USER HARM · NEGATIVE = POSITIVE CHANGE

CHANGE KIND	WEIGHT	WHY
tier-removed	+28	Bought-thing is gone (Claude Code Apr 2026)
billing-model-shift	+26	Surprise-bill territory (Cursor Jun 2025)
rate-limit-cut	+24	Same price, less throughput
free-tier-nerf	+22	Funnel collapse, trust hit
model-swap	+20	Silent quality regression
feature-gated	+18	Used to have it, now upsell
price-increase	+16	Direct cost increase
tos-change	+12	Slow-burn rights creep
feature-ungated	−6	Positive: gated thing now free
tier-added	−4	Positive (unless it cannibalizes existing tier)
rate-limit-raise	−8	Positive: more throughput
price-decrease	−10	Positive: direct cost reduction

03 · SEVERITY MULTIPLIERS

How bad is bad?

critical× 2.6Materially reduces paid-user value.

major× 1.8Substantive user-facing change.

minor× 1.0Clarification or copy edit.

info× 0.5Additive without removing existing value.

Bars show relative harm; values multiply kind weight before the 0–100 clamp.

04 · WORKED EXAMPLE · CLAUDE CODE · APR 2026

One real receipt, walked through the rubric.

On April 21, 2026 Anthropic removed Claude Code from the $20 Pro tier, requiring Max ($100+/mo) for continued use. The receipt is here. The Nerf Score came out at 73. Here’s the math.

STEP-BY-STEP · DETERMINISTIC · NO LLM IN THIS LOOP

KINDtier-removed+28bought-thing is gone

SEVERITYcritical× 2.6paid-user value lost

RAW28 × 2.672.8unclamped product

→

NERF SCOREclamp · round73/ 100 · critical band

lib/score.ts · eventScore()view on github ↗

export function eventScore(e: Pick<ChangelogEntry, "kind" | "severity">): number {
  // 1. Look up the kind weight (signed integer, may be negative for positive changes)
  const base = KIND_WEIGHTS[e.kind] ?? 8;
  // 2. Look up the severity multiplier (critical = 2.6, info = 0.5, etc.)
  const mult = SEVERITY_MULTIPLIER[e.severity] ?? 1;
  // 3. Multiply, clamp to [-30, 100], max with 0 → only harm goes public
  const raw = base * mult;
  const score = Math.max(-30, Math.min(100, raw));
  return Math.max(0, Math.round(score));
}

The jury verdicts on this receipt - 88 / 92 / 89, all harmful, average 90 - are a separate signal entirely. They don’t enter the score. If you disagree with the kind classification, file a PR. If you disagree with the jury, you can wait - they get Brier-scored against subsequent reality.

05 · PER-VENDOR NERF INDEX

Recency-weighted aggregate, 0-100.

Aggregate of every receipt for a vendor, weighted by recency. Half-life is 180 days - a year-old nerf still counts but at a quarter weight. Mapped onto 0-100 via a logistic squash so a single critical nerf doesn’t max the index; repeated nerfs stack with diminishing returns.

HALF-LIFE

180 days

A 6-month-old receipt counts half as much as today’s.

QUARTER-WEIGHT

360 days

A year-old receipt still counts - at 25% of its original score.

INDEX BANDS · WHAT THE NUMBER MEANS

RANGE	LABEL	INTERPRETATION
0–4	no receipts	Vendor has no logged plan changes - either too new to track or unusually stable
5–24	clean	A few minor edits, no material harm to users
25–49	on watch	Pattern of cuts emerging - worth watching, not yet escape-velocity
50–74	shrinking	Repeated material cuts - users feel the squeeze every renewal
75–100	predator	Hostile to paid users - migration recommended

06 · PREDICTED IMPACT

How we estimate user impact and financial damage.

Every receipt with a delta gets two predicted-impact numbers: an estimate of the percentage of paid users affected, and a 12-month financial-damage range per affected user. Both are heuristic - derived from kind + severity + the breadth of the affected tier - and explicitly labeled as estimates, not measurements.

Heuristic confidence is graded high / medium / low based on how much the receipt’s kind constrains the impact range. A tier removal at critical severity has a well-bounded financial impact (the cost difference between tiers); a model swap has a fuzzier one (depends on how much you used the swapped model). Confidence pills on each receipt make this explicit.

How we score, step by step.