Coding & Software

ResearchCodeBench

212 coding challenges from 20 recent ML papers testing LLMs on novel research code.

212items

32subjects

100%observed

Modelsubject type

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 32 subjects × 212 items, 100% of cells evaluated.

ResearchCodeBench response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = the generated code snippet passes its reference tests · 0 = it fails.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

32 subjects, ranked by mean response (accuracy) across this benchmark's items.

14a938d770.642
21c618d9d0.594
3ac78f79a0.59
4b2436ce80.585
521a139c20.524
639edf3fb0.519
78d49a7a10.5
8bce3d3ee0.486
91872c1340.481
1022e9aca80.458
116439159a0.453
12a531ea370.429
13b97dccb60.425
14efdd1f090.425
157184624c0.41
168d84404d0.377
1792da83810.373
185a9db67a0.354
19d9512a7b0.335
20f89d3f720.311
214a199d310.307
2242d17e7b0.274
2309f4a41b0.259
240e68149d0.25
253f126a3b0.231
269dafe6210.222
272bfd4cd30.198
28f346f5a40.184
29be16a6de0.151
30c117f9210.123
310201bf340.08
329bbf8d460.014

Full data on Hugging Face Back to the gallery