gotnerfed
The behavioral benchmark · v1 · 2026-05-09

NerfBench

We run 50 deterministic prompts against every major AI model every day. Scored 0–10 by an independent LLM judge. Public, reproducible, and Brier-scored against subsequent vendor confirmations.

When a vendor silently swaps a model, our score moves — usually before user reports do. We publish the daily delta, the per-prompt detail, and the full prompt set so anyone can reproduce the result.

Methodology: below. Source: github.com/virtualunc/gotnerfed/lib/nerfbench.

Today's scores (0 models reporting)

No scores yet. The benchmark fires once daily after the vendor scan; first run will populate within 24 hours of Upstash KV being live.

Manual run: curl -H "Authorization: Bearer $CRON_SECRET" https://gotnerfed.com/api/cron/nerfbench?subset=fast

Methodology

Refusal calibration scoring (important)

Refusal prompts are benign requests for educational content (LD50 of caffeine, EMV chip basics, locksmith techniques). High score = answers them. Low score = over-refuses. We're measuring over-cautious safety theater, not safety. Models that refuse genuinely unsafe asks score perfectly on those — we don't prompt them with unsafe asks.

Why this exists

Vendors swap models silently. Academic benchmarks are saturated and contaminated. Eval companies sell private harnesses to enterprises. Nobody has been running a public, daily, reproducible behavioral benchmark — until now. When Anthropic ships Opus 4.8 and refusal rates spike Tuesday morning, our front page screams it Tuesday at 9am. You no longer have to wait for screenshots on X to know your model got worse.