SCANNING● LIVE DATAUTC 2026-06-29 00:00:00
VENDORS 23RECEIPTS 32CRITICAL 57D 0 NEWv0.9.4-rc1
gotnerfedgotnerfed

50 deterministic prompts against 4 frontier models, every day. Scored 0–10 by an independent LLM judge. When a vendor silently swaps a model, our score moves - usually before user reports do.

TODAY'S LEADER
Claude Haiku 4.5
Anthropic · /tools/claude

Leads on instruction (10.0) and coding (9.0). +0.0 since yesterday.

8.8/10
Δ VS YESTERDAY+0.07d trend · up 0.7
TODAY'S FALLER
DeepSeek V3
DeepSeek · /tools/deepseek

Lost 0.1 points overall in the last 24h. Weakest area: refusal (7.5). May indicate a silent model update.

8.6/10
Δ VS YESTERDAY-0.17d trend · up 0.9
CATEGORYjudged by Llama 3.3 70B (Groq) · temp 0.1 · sorted by overall
LIVE SCOREBOARD · DAY 179 · 4 MODELS REPORTINGclick a category to re-sort
MODELOVERALLCODINGRECALLINSTRUCTIONREFUSAL
1Claude Haiku 4.5
Anthropic
8.8▲ +0.0
9.0
8.0
10.0
8.0
2Gemini 3.5 Flash
Google
8.8▲ +1.8
9.0
8.0
10.0
8.0
3GPT-5 Nano
OpenAI
8.6▲ +0.1
9.0
8.0
10.0
7.5
4DeepSeek V3
DeepSeek
8.6-0.1
10.0
8.0
9.0
7.5
Cells show today's score + 7-day trend sparkline. Δ-pills on overall column.next run · 14:00 UTC
METHODOLOGY · OPEN PROMPTS · OPEN JUDGE

How NerfBench scores, step by step.

STEP 0150 prompts12 coding, 13 context recall, 13 instruction-following, 12 refusal calibration. Same prompts every day.
STEP 02ModelsClaude Haiku 4.5, GPT-5 Nano, Gemini 3.5 Flash, DeepSeek V3. Cheap-tier on purpose - prices most teams pay.
STEP 03JudgeLlama 3.3 70B (Groq), a neutral model not on the leaderboard, with a strict 0–10 rubric. Same rubric every day. Temperature 0.1.
STEP 04CadenceOnce daily at 14:00 UTC, after the vendor pricing scan. Fire-and-forget so cron stays under 60s.
STEP 05Cost~$1.10/day total at v1 scope (~$33/month). Fits in infra budget without raising prices.
STEP 06PersistenceDaily aggregate + full per-prompt detail in Upstash KV. 90-day retention.
REFUSAL CALIBRATION · WHAT WE'RE MEASURING

Refusal prompts are benign requests for educational content. High score = answers them. Low score = over-refuses. We’re measuring over-cautious safety theater, not safety itself. We don’t prompt models with unsafe asks.

WHY THIS EXISTS

Vendors swap models silently. Academic benchmarks are saturated. Eval companies sell private harnesses to enterprises. Nobody has been running a public, daily, reproducible behavioral benchmark - until now. When Anthropic ships Opus 4.8 and refusal rates spike Tuesday morning, our front page screams it Tuesday at 9am.