How AI models behave today.
50 deterministic prompts against 4 frontier models, every day. Scored 0–10 by an independent LLM judge. When a vendor silently swaps a model, our score moves - usually before user reports do.
Leads on instruction (10.0) and coding (9.0). +0.0 since yesterday.
Lost 0.1 points overall in the last 24h. Weakest area: refusal (7.5). May indicate a silent model update.
| MODEL | OVERALL | CODING | RECALL | INSTRUCTION | REFUSAL |
|---|---|---|---|---|---|
| 1Claude Haiku 4.5 Anthropic | 8.8▲ +0.0 | 9.0 | 8.0 | 10.0 | 8.0 |
| 2Gemini 3.5 Flash Google | 8.8▲ +1.8 | 9.0 | 8.0 | 10.0 | 8.0 |
| 3GPT-5 Nano OpenAI | 8.6▲ +0.1 | 9.0 | 8.0 | 10.0 | 7.5 |
| 4DeepSeek V3 DeepSeek | 8.6▼ -0.1 | 10.0 | 8.0 | 9.0 | 7.5 |
How NerfBench scores, step by step.
Refusal prompts are benign requests for educational content. High score = answers them. Low score = over-refuses. We’re measuring over-cautious safety theater, not safety itself. We don’t prompt models with unsafe asks.
Vendors swap models silently. Academic benchmarks are saturated. Eval companies sell private harnesses to enterprises. Nobody has been running a public, daily, reproducible behavioral benchmark - until now. When Anthropic ships Opus 4.8 and refusal rates spike Tuesday morning, our front page screams it Tuesday at 9am.