Skip to main content

Coding & Software

ResearchCodeBench

212 coding challenges from 20 recent ML papers testing LLMs on novel research code.

212items
32subjects
100%observed
Modelsubject type

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 32 subjects × 212 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

ResearchCodeBench response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = the generated code snippet passes its reference tests · 0 = it fails.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

32 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 14a938d770.642
  2. 21c618d9d0.594
  3. 3ac78f79a0.59
  4. 4b2436ce80.585
  5. 521a139c20.524
  6. 639edf3fb0.519
  7. 78d49a7a10.5
  8. 8bce3d3ee0.486
  9. 91872c1340.481
  10. 1022e9aca80.458
  11. 116439159a0.453
  12. 12a531ea370.429
  13. 13b97dccb60.425
  14. 14efdd1f090.425
  15. 157184624c0.41
  16. 168d84404d0.377
  17. 1792da83810.373
  18. 185a9db67a0.354
  19. 19d9512a7b0.335
  20. 20f89d3f720.311
  21. 214a199d310.307
  22. 2242d17e7b0.274
  23. 2309f4a41b0.259
  24. 240e68149d0.25
  25. 253f126a3b0.231
  26. 269dafe6210.222
  27. 272bfd4cd30.198
  28. 28f346f5a40.184
  29. 29be16a6de0.151
  30. 30c117f9210.123
  31. 310201bf340.08
  32. 329bbf8d460.014