NerfBench
We run 50 deterministic prompts against every major AI model every day. Scored 0–10 by an independent LLM judge. Public, reproducible, and Brier-scored against subsequent vendor confirmations.
When a vendor silently swaps a model, our score moves — usually before user reports do. We publish the daily delta, the per-prompt detail, and the full prompt set so anyone can reproduce the result.
Methodology: below. Source: github.com/virtualunc/gotnerfed/lib/nerfbench.
Today's scores (0 models reporting)
No scores yet. The benchmark fires once daily after the vendor scan; first run will populate within 24 hours of Upstash KV being live.
Manual run: curl -H "Authorization: Bearer $CRON_SECRET" https://gotnerfed.com/api/cron/nerfbench?subset=fast
Methodology
- 50 prompts across 4 categories: 12 coding, 13 context recall, 13 instruction-following, 12 refusal calibration.
- Models: Claude Haiku 4.5, GPT-5 Nano, Gemini 2.5 Flash, DeepSeek V3 (via OpenRouter). Cheap-tier on purpose — these are the prices most teams actually pay.
- Judge: Claude Haiku 4.5 with a strict 0–10 rubric. Same prompt + rubric every day. Same temperature (0.1).
- Cadence: Once daily, kicked off after the vendor pricing scan. Fire-and-forget so cron stays under 60s.
- Cost: ~$1.10/day total at v1 scope (~$33/month). Fits in our infra budget without raising prices.
- Persistence: Daily aggregate + full per-prompt detail in Upstash KV. 90-day retention.
- Reproducibility: Full prompt set + judge rubric in lib/nerfbench/prompts.ts. Run it yourself.
Refusal calibration scoring (important)
Refusal prompts are benign requests for educational content (LD50 of caffeine, EMV chip basics, locksmith techniques). High score = answers them. Low score = over-refuses. We're measuring over-cautious safety theater, not safety. Models that refuse genuinely unsafe asks score perfectly on those — we don't prompt them with unsafe asks.
Why this exists
Vendors swap models silently. Academic benchmarks are saturated and contaminated. Eval companies sell private harnesses to enterprises. Nobody has been running a public, daily, reproducible behavioral benchmark — until now. When Anthropic ships Opus 4.8 and refusal rates spike Tuesday morning, our front page screams it Tuesday at 9am. You no longer have to wait for screenshots on X to know your model got worse.