The behavioral benchmark · v1 · 2026-05-09

NerfBench

We run 50 deterministic prompts against every major AI model every day. Scored 0–10 by an independent LLM judge. Public, reproducible, and Brier-scored against subsequent vendor confirmations.

When a vendor silently swaps a model, our score moves — usually before user reports do. We publish the daily delta, the per-prompt detail, and the full prompt set so anyone can reproduce the result.

Methodology: below. Source: github.com/virtualunc/gotnerfed/lib/nerfbench.

Today's scores (0 models reporting)

No scores yet. The benchmark fires once daily after the vendor scan; first run will populate within 24 hours of Upstash KV being live.

Manual run: curl -H "Authorization: Bearer $CRON_SECRET" https://gotnerfed.com/api/cron/nerfbench?subset=fast

Methodology

50 prompts across 4 categories: 12 coding, 13 context recall, 13 instruction-following, 12 refusal calibration.
Models: Claude Haiku 4.5, GPT-5 Nano, Gemini 2.5 Flash, DeepSeek V3 (via OpenRouter). Cheap-tier on purpose — these are the prices most teams actually pay.
Judge: Claude Haiku 4.5 with a strict 0–10 rubric. Same prompt + rubric every day. Same temperature (0.1).
Cadence: Once daily, kicked off after the vendor pricing scan. Fire-and-forget so cron stays under 60s.
Cost: ~$1.10/day total at v1 scope (~$33/month). Fits in our infra budget without raising prices.
Persistence: Daily aggregate + full per-prompt detail in Upstash KV. 90-day retention.
Reproducibility: Full prompt set + judge rubric in lib/nerfbench/prompts.ts. Run it yourself.

Refusal calibration scoring (important)

Refusal prompts are benign requests for educational content (LD50 of caffeine, EMV chip basics, locksmith techniques). High score = answers them. Low score = over-refuses. We're measuring over-cautious safety theater, not safety. Models that refuse genuinely unsafe asks score perfectly on those — we don't prompt them with unsafe asks.

Why this exists

Vendors swap models silently. Academic benchmarks are saturated and contaminated. Eval companies sell private harnesses to enterprises. Nobody has been running a public, daily, reproducible behavioral benchmark — until now. When Anthropic ships Opus 4.8 and refusal rates spike Tuesday morning, our front page screams it Tuesday at 9am. You no longer have to wait for screenshots on X to know your model got worse.