AIMSAI Measurement Science

Reasoning & Knowledge

ARC-AGI

ARC-AGI v1 + v2 per-model per-task solve data from the ARC Prize Foundation evaluation repos, plus leaderboard aggregate scores.

520items

73subjects

83%observed

Modelsubject type

Apache-2.0license

reasoningdomain

gridmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 73 subjects × 520 items, 83% of cells evaluated.

ARC-AGI response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Item 1 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

claude-opus-4-5-20251101-thinking-16k incorrect
claude-opus-4-5-20251101-thinking-1k incorrect
claude-opus-4-5-20251101-thinking-32k incorrect

Item 234% solve rate

Item 2 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

claude-opus-4-5-20251101-thinking-32k correct
gpt-5-2-2025-12-11-thinking-medium correct
claude-sonnet-4-5-20250929-thinking-32k incorrect

Item 353% solve rate

Item 3 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

claude-opus-4-5-20251101-thinking-16k correct
public_eval correct
gpt-3-5-turbo incorrect

Item 466% solve rate

Item 4 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

claude-opus-4-5-20251101-thinking-16k correct
gpt-5-4-mini-medium correct
gpt-4-1-mini-2025-04-14 incorrect

Item 579% solve rate

Item 5 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

claude-opus-4-5-20251101-thinking-16k correct
claude-haiku-4-5-20251001 correct
gpt-5-2-2025-12-11-thinking-none incorrect

Item 6100% solve rate

Item 6 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

claude-opus-4-5-20251101-thinking-1k correct
glm-5 correct
gpt-5-1-2025-11-13-thinking-low correct

Subjects

The models, agents, and reward models evaluated.

73 subjects, ranked by mean response (accuracy) across this benchmark's items.

1gpt-5-4-pro-xhigh0.983
2gpt-5-2-pro-2025-12-11-xhigh0.977
3grok-4.20-multi-agent-beta-0309-xhigh0.967
4claude-opus-4-8-max0.963
5gemini-3-1-pro-preview0.961
6gpt-5-4-high0.958
7grok-4.20-beta-0309b-reasoning0.955
8claude-opus-4-6-thinking-120K-max0.934
9claude-opus-4-6-thinking-120K-high0.933
10claude-opus-4-8-high0.927
11gpt-5-4-medium0.92
12claude-opus-4-8-medium0.917
13claude-opus-4-6-thinking-120K-medium0.908
14public_eval0.897
15gpt-5-2-2025-12-11-thinking-xhigh0.886
16claude-opus-4-8-low0.862
17gpt-5-2-pro-2025-12-11-high0.857
18claude-opus-4-6-thinking-120K-low0.838
19gemini-3-deep-think-preview0.837
20gpt-5-2-2025-12-11-thinking-high0.802
21gpt-5-4-low0.8
22gpt-5-2-pro-2025-12-11-medium0.792
23gpt-5-4-mini-xhigh0.774
24gpt-5-pro-2025-10-060.77
25gemini-3-flash-preview-thinking-high0.753
26claude-opus-4-5-20251101-thinking-32k0.744
27Gemini-3-Pro-Preview0.744
28claude-sonnet-4-5-20250929-thinking-32k0.74
29gpt-5-4-nano-xhigh0.738
30kimi-k2.50.733
31gpt-5-2-2025-12-11-thinking-medium0.698
32claude-opus-4-5-20251101-thinking-64k0.695
33gpt-5-4-mini-high0.665
34gpt-5-1-2025-11-13-thinking-high0.646
35claude-sonnet-4-5-20250929-thinking-16k0.637
36claude-haiku-4-5-20251001-thinking-32k0.632

+ 37 more subjects evaluated.

Full data on Hugging Face Back to the gallery