Skip to main content

Reasoning & Knowledge

ARC-AGI

ARC-AGI v1 + v2 per-model per-task solve data from the ARC Prize Foundation evaluation repos, plus leaderboard aggregate scores.

520items
73subjects
83%observed
Modelsubject type
Apache-2.0license
reasoningdomain
gridmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 73 subjects × 520 items, 83% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

ARC-AGI response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate
Item 1 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

  • claude-opus-4-5-20251101-thinking-16k incorrect
  • claude-opus-4-5-20251101-thinking-1k incorrect
  • claude-opus-4-5-20251101-thinking-32k incorrect
Item 234% solve rate
Item 2 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

  • claude-opus-4-5-20251101-thinking-32k correct
  • gpt-5-2-2025-12-11-thinking-medium correct
  • claude-sonnet-4-5-20250929-thinking-32k incorrect
Item 353% solve rate
Item 3 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

  • claude-opus-4-5-20251101-thinking-16k correct
  • public_eval correct
  • gpt-3-5-turbo incorrect
Item 466% solve rate
Item 4 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

  • claude-opus-4-5-20251101-thinking-16k correct
  • gpt-5-4-mini-medium correct
  • gpt-4-1-mini-2025-04-14 incorrect
Item 579% solve rate
Item 5 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

  • claude-opus-4-5-20251101-thinking-16k correct
  • claude-haiku-4-5-20251001 correct
  • gpt-5-2-2025-12-11-thinking-none incorrect
Item 6100% solve rate
Item 6 figure

ARC-AGI puzzle — infer the transformation rule from the training input/output grids, then predict the output for the test input.

Subject outcomes

  • claude-opus-4-5-20251101-thinking-1k correct
  • glm-5 correct
  • gpt-5-1-2025-11-13-thinking-low correct

Subjects

The models, agents, and reward models evaluated.

73 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gpt-5-4-pro-xhigh0.983
  2. 2gpt-5-2-pro-2025-12-11-xhigh0.977
  3. 3grok-4.20-multi-agent-beta-0309-xhigh0.967
  4. 4claude-opus-4-8-max0.963
  5. 5gemini-3-1-pro-preview0.961
  6. 6gpt-5-4-high0.958
  7. 7grok-4.20-beta-0309b-reasoning0.955
  8. 8claude-opus-4-6-thinking-120K-max0.934
  9. 9claude-opus-4-6-thinking-120K-high0.933
  10. 10claude-opus-4-8-high0.927
  11. 11gpt-5-4-medium0.92
  12. 12claude-opus-4-8-medium0.917
  13. 13claude-opus-4-6-thinking-120K-medium0.908
  14. 14public_eval0.897
  15. 15gpt-5-2-2025-12-11-thinking-xhigh0.886
  16. 16claude-opus-4-8-low0.862
  17. 17gpt-5-2-pro-2025-12-11-high0.857
  18. 18claude-opus-4-6-thinking-120K-low0.838
  19. 19gemini-3-deep-think-preview0.837
  20. 20gpt-5-2-2025-12-11-thinking-high0.802
  21. 21gpt-5-4-low0.8
  22. 22gpt-5-2-pro-2025-12-11-medium0.792
  23. 23gpt-5-4-mini-xhigh0.774
  24. 24gpt-5-pro-2025-10-060.77
  25. 25gemini-3-flash-preview-thinking-high0.753
  26. 26claude-opus-4-5-20251101-thinking-32k0.744
  27. 27Gemini-3-Pro-Preview0.744
  28. 28claude-sonnet-4-5-20250929-thinking-32k0.74
  29. 29gpt-5-4-nano-xhigh0.738
  30. 30kimi-k2.50.733
  31. 31gpt-5-2-2025-12-11-thinking-medium0.698
  32. 32claude-opus-4-5-20251101-thinking-64k0.695
  33. 33gpt-5-4-mini-high0.665
  34. 34gpt-5-1-2025-11-13-thinking-high0.646
  35. 35claude-sonnet-4-5-20250929-thinking-16k0.637
  36. 36claude-haiku-4-5-20251001-thinking-32k0.632

+ 37 more subjects evaluated.