Coding & Software
ResearchCodeBench
212 coding challenges from 20 recent ML papers testing LLMs on novel research code.
212items
32subjects
100%observed
Modelsubject type
Response matrix
Every model, scored item by item.
Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 32 subjects × 212 items, 100% of cells evaluated.
Fit to width. Hover for subject & item; click a cell for details.

Correct (1)Incorrect (0)Unobserved
Scale: 1 = the generated code snippet passes its reference tests · 0 = it fails.
Sample items
What the questions look like — and how subjects answer.
A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.
Subjects
The models, agents, and reward models evaluated.
32 subjects, ranked by mean response (accuracy) across this benchmark's items.
- 14a938d770.642
- 21c618d9d0.594
- 3ac78f79a0.59
- 4b2436ce80.585
- 521a139c20.524
- 639edf3fb0.519
- 78d49a7a10.5
- 8bce3d3ee0.486
- 91872c1340.481
- 1022e9aca80.458
- 116439159a0.453
- 12a531ea370.429
- 13b97dccb60.425
- 14efdd1f090.425
- 157184624c0.41
- 168d84404d0.377
- 1792da83810.373
- 185a9db67a0.354
- 19d9512a7b0.335
- 20f89d3f720.311
- 214a199d310.307
- 2242d17e7b0.274
- 2309f4a41b0.259
- 240e68149d0.25
- 253f126a3b0.231
- 269dafe6210.222
- 272bfd4cd30.198
- 28f346f5a40.184
- 29be16a6de0.151
- 30c117f9210.123
- 310201bf340.08
- 329bbf8d460.014