ML Engineering & Research

MLRC-Bench

7 ML research-competition tasks testing whether language agents invent novel methods.

7items

7subjects

100%observed

Agentsubject type

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 7 subjects × 7 items, 100% of cells evaluated.

MLRC-Bench response matrix: AI models (rows) against items (columns) — lowhighUnobserved
Scale: Improvement as a fraction (can be negative or > 1): absolute improvement over the baseline, or improvement relative to the human top solution — each on its own scale.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

7 subjects, ranked by mean response across this benchmark's items.

16bc8c1a40.171
2c2a34b6e0.134
31b2233840.13
45967999d0.123
5d63974f50.108
67505ef580.078
7159622a40.075

Full data on Hugging Face Back to the gallery