Skip to main content

Mathematics

MathArena

MathArena: uncontaminated math competitions (27 datasets); final-answer accuracy + proof-judge per-criterion rubric scores across multiple attempts per problem.

448items
97subjects
32%observed
Modelsubject type
MITlicense
mathematicsdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 97 subjects × 448 items, 32% of cells evaluated. The heatmap shows a representative 291 of 448 items — evenly sampled across difficulty — so each cell stays square and legible.

MathArena response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: 0 to 1 — exact match (AIME) or fraction of rubric points (USAMO / IMO / …)

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: 2112
Consider a $2025 \times 2025$ grid of unit squares. Matilda wishes to place on the grid some rectangular tiles, possibly of different sizes, such that each side of every tile lies on a grid line and every unit square is covered by at most one tile.
Determine the minimum number of tiles Matilda needs to place so that each row and each column of the grid has exactly one unit square that is not covered by any tile.

How subjects answered

  • Kimi K2 Thinkingscore 0

    3037

  • Step 3.5 Flashscore 0

    3037

  • GPT OSS 120B (high)score 0

    4048

Item 238% solve rateanswer: \frac{97-3\sqrt{1001}}{2}
Let $\omega$ be a circle with diameter $\overline{A B}$, center $O$, and cyclic quadrilateral $A B C D$ inscribed in it, with $C$ and $D$ on the same side of $\overline{A B}$. Let $A B=20, B C=13, A D=7$. Let $\overleftrightarrow{B C}$ and $\overleftrightarrow{A D}$ intersect at $E$. Let the $E$-excircle of $E C D$ have its center at $L$. Find $O L$.

How subjects answered

  • Kimi K2 Thinkingscore 1

    97/2 - 3*sqrt(1001)/2

  • O4-Mini (High)score 1

    1.0408329997330663

  • O4-Mini (Medium)score 0

    OL

Item 369% solve rateanswer: 12\pi
Let $(X,\omega)$ be a compact Kähler surface (complex dimension $2$), and let $g_\omega$ be the associated Riemannian metric. Let $S(\omega):X\to\mathbb R$ denote the Chern scalar curvature of $\omega$ (so that $\operatorname{scal}(g_\omega)=2S(\omega)$), and assume $S(\omega)(x)>0$ for all $x\in X$. Define
\[
\sys_2(\omega):=\inf\{\operatorname{Mass}_{g_\omega}(Z):\ Z\text{ is an integral }2\text{-cycle with }[Z]\neq 0\in H_2(X;\mathbb Z)\}.
\]
Let $m:=\min_{x\in X} S(\omega)(x)$. Determine the smallest universal constant $C\in\mathbb R$ such that for every such $(X,\omega)$ one has $m\,\sys_2(\omega)\le C$.

How subjects answered

  • Kimi K2.5 (Think)score 1

    37.69911184307752

  • GPT-5.2 (high)score 1

    37.69911184307752

  • Qwen3-4B-2507-Thinkscore 0

    2

Item 492% solve rateanswer: 89
Define the function $f$ on positive integers

$$
f(n)= \begin{cases}\frac{n}{2} & \text { if } n \text { is even } \\ n+1 & \text { if } n \text { is odd }\end{cases}
$$

Let $S(n)$ equal the smallest positive integer $k$ such that $f^{k}(n)=1$. How many positive integers satisfy $S(n)=11$ ?

How subjects answered

  • Kimi K2 Thinkingscore 1

    89

  • GLM 4.6score 1

    89

  • DeepSeek-R1-Distill-14Bscore 0

    89

Item 599% solve rateanswer: 12
Suppose $n$ integers are placed in a circle such that each of the following conditions is satisfied:
\begin{itemize}
\item at least one of the integers is $0$;
\item each pair of adjacent integers differs by exactly $1$; and
\item the sum of the integers is exactly $24$.
\end{itemize}
Compute the smallest value of $n$ for which this is possible.

How subjects answered

  • Kimi K2 Thinkingscore 1

    12

  • Grok 4score 1

    12

  • Gemini 2.5 Proscore 0

    24

Item 6100% solve rateanswer: 178
Find the sum of the $10$th terms of all arithmetic sequences of integers that have first term equal to $4$ and include both $24$ and $34$ as terms.

How subjects answered

  • Kimi K2.5 (Think)score 1

    178

  • QED-Nanoscore 1

    178

  • Grok 4.1 Fast (Reasoning)score 1

    178

Subjects

The models, agents, and reward models evaluated.

97 subjects, ranked by mean response across this benchmark's items.

  1. 1Grok 4 Fast (Reasoning) (Selfcheck Agent)1
  2. 2GPT-5.2 (xhigh)0.991
  3. 3GLM 4.70.983
  4. 4Gemini 2.5 Pro (agent)0.968
  5. 5Gemini IMO Deep Think0.953
  6. 6Gemini 2.5 Pro (best-of-32)0.889
  7. 7gemini-2.5-pro*0.875
  8. 8DeepSeek-v3.2-Speciale Agent0.858
  9. 9Gemini 2.5 Pro (05-06)0.844
  10. 10Gemini Deep Think 3 (12/25)0.833
  11. 11Gemini 3.1 Pro Preview0.818
  12. 12GPT OSS 20B (high)0.813
  13. 13GPT-5.4 (xhigh)0.808
  14. 14GPT-5.2 (high)0.786
  15. 15Gemini 3 Pro (preview)0.781
  16. 16GLM 4.5 Air0.776
  17. 17Gemini 3 Flash0.776
  18. 18Grok 3 Mini (high)0.775
  19. 19DeepSeek-v3.2-Speciale0.769
  20. 20Qwen3-235B-A22B0.767
  21. 21Kimi K2.5 (Think)0.762
  22. 22O4-Mini (Medium)0.762
  23. 23K2-Think0.754
  24. 24GPT-5.1 (high)0.743
  25. 25Step 3.5 Flash0.741
  26. 26GPT-5 (high)0.739
  27. 27Qwen3-VL-235B Instruct0.735
  28. 28GPT-5.4-Pro (xhigh)0.732
  29. 29GPT-5-Pro0.732
  30. 30GLM 50.726
  31. 31GLM 4.60.715
  32. 32GPT-5-mini (high)0.71
  33. 33gemini-2.5-flash (think)0.708
  34. 34O3 (High)0.707
  35. 35Claude-Opus-4.0 (Think)0.706
  36. 36Kimi K2 Thinking0.706

+ 61 more subjects evaluated.