Skip to main content

Reasoning & Knowledge

AbstentionBench

Whether LLMs abstain on unanswerable or underspecified questions across 20 datasets.

31items
23subjects
88%observed
Modelsubject type

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 23 subjects × 31 items, 88% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

AbstentionBench response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: 0 to 1 (per metric): abstention F1 / precision / recall — how reliably a model declines unanswerable or underspecified questions. The authors release these graded scores, not a per-response binary label.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

23 subjects, ranked by mean response across this benchmark's items.

  1. 10dc069200.813
  2. 2d1f34a950.797
  3. 3a2a69af70.79
  4. 4d12230130.779
  5. 52352da720.779
  6. 6e7904abf0.771
  7. 7908fa0fd0.77
  8. 8dbf1d0c00.767
  9. 9db68274f0.766
  10. 1077bd1a280.751
  11. 1118270ef10.748
  12. 126c6be7db0.737
  13. 138aa62b8f0.73
  14. 1481dd6aa30.725
  15. 151c7739c10.691
  16. 1698cfe9c60.691
  17. 1723e989bc0.666
  18. 184f87d9ec0.649
  19. 19cfe3e6380.643
  20. 20e8b341ee0.633
  21. 2121f91ce30.616
  22. 225db1c30d0.596
  23. 23c1d71a9e0.486