Skip to main content

Safety & Security

BenchRisk

Scoring the reliability of LLM benchmarks across 57 failure modes and 196 mitigations.

7items
26subjects
100%observed
Modelsubject type

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 26 subjects × 7 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

BenchRisk response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: 0 to 1 reliability score (higher = a more reliable benchmark).

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

26 subjects, ranked by mean response across this benchmark's items.

  1. 15b83d9630.706
  2. 2e0652cf00.593
  3. 3031cd2060.448
  4. 41cac18550.447
  5. 59abf621a0.434
  6. 66a08f34e0.428
  7. 7de433ed50.419
  8. 8ad505a3a0.389
  9. 9814193e60.383
  10. 10c4b9e72d0.379
  11. 11a9dc1b940.372
  12. 1299665bad0.363
  13. 1329ad16a30.361
  14. 1410f901600.353
  15. 159faa120f0.35
  16. 1676bf377f0.323
  17. 17c240e59b0.32
  18. 18b40a48df0.317
  19. 1993bc2ca90.296
  20. 200eb1f54b0.278
  21. 21bd5a48770.267
  22. 2212628c870.265
  23. 23e7b876660.239
  24. 24a48acf9f0.23
  25. 2581f48fa30.216
  26. 262da98da30.115