Skip to main content

Reasoning & Knowledge

MMLU-Pro

MMLU-Pro — 12K-question challenge MMLU variant across 14 subjects; per-question binary correctness + leaderboard per-category aggregates.

13,542items
48subjects
86%observed
Modelsubject type
MITlicense
generaldomain
reasoningdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 48 subjects × 13,542 items, 86% of cells evaluated. The heatmap shows a representative 144 of 13,542 items — evenly sampled across difficulty — so each cell stays square and legible.

MMLU-Pro response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: A
The Law of Diminishing Marginal Returns is responsible for

A: MC that first falls, but eventually rises, as output increases.
B: AP that first falls, but eventually rises, as output increases.
C: TC that first falls, but eventually rises, as output increases.
D: AVC that first rises, but eventually falls, as output increases.
E: MP that first falls, but eventually rises, as output increases.
F: AFC that first rises, but eventually falls, as output increases.
G: MP that first rises, but eventually falls, as output increases.
H: TP that first rises, but eventually falls, as output increases.
I: AFC that first falls, but eventually rises, as output increases.
J: AVC that first falls, but eventually rises, as output increases.

How subjects answered

  • Meta-Llama-3-70B-Instruct incorrect

    G

Item 216% solve rateanswer: F
Kirkwood gaps are observed in the main asteroid belt including at the position(s) where:

A: asteroids would orbit with a period that is three times that of Jupiter
B: asteroids would orbit with a period three times that of Mars
C: asteroids would orbit with a period twice that of Mars
D: asteroids would orbit with a period twice that of Jupiter
E: A and B
F: asteroids would orbit with a period half that of Jupiter
G: B and C
H: asteroids would orbit with a period that is equal to that of Mars
I: asteroids would orbit with a period half that of Mars
J: asteroids would orbit with a period that is equal to that of Jupiter

How subjects answered

  • deepseek correct

    F

  • arx_0314 correct

    {"response": "Let's think through this step-by-step:\n\n1. Kirkwood gaps are regions in the asteroid belt where there are fewer asteroids.\n\n2. These gaps are caused by orbital resonances with Jupiter.\n\n3. Orbital resonances occur when the orbital period of an asteroid is a simple fraction or multiple of Jupiter's orbital period.\n\n4. The most significant Kirkwood gaps correspond to 1:3, 2:5, and 1:2 resonances with Jupiter.\n\n5. A 1:2 resonance means an asteroid would orbit twice for every one orbit of Jupiter, or in other words, its orbital period would be half that of Jupiter.\n\n6. A …

  • Qwen1.5-14B-Chat incorrect

    D

Item 338% solve rateanswer: E
A plaintiff sued the insurer of her home after the insurer denied coverage for water damage to the home allegedly caused by a frozen plastic pipe that burst. At trial, the insurer called as an expert witness an engineer, who testified that the pipe had burst because of age rather than freezing. On cross-examination, the engineer admitted that five years earlier, he had been convicted of tax fraud, even though he had asserted that it was his accountant's error. In response, the insurer calls a witness, who is well acquainted with the engineer and his reputation, to testify that (1) in the witness's opinion, the engineer is a truthful person, and (2) the engineer's neighbors all describe him as a truthful person. How much, if any, of the witness's testimony is admissible?

A: None of the testimony is admissible, because it is collateral, having no bearing on the engineer's qualifications as an expert.
B: Only the portion concerning the engineer's reputation is admissible, because where both opinion and reputation evidence are available, only the latter is admissible under a rule of pre …

How subjects answered

  • Yi-34B correct

    E

  • Mistral-7B-v0.1 correct

    E

  • Mistral-7B-Instruct-v0.2 incorrect

    J

Item 459% solve rateanswer: I
A telescope is made from two lenses, having focal lengths of 5 and 1 cm. a) What should be the distance between lenses if the ini-tial object and the final virtual image are both very far away? b) What is the magnification under these circumstances?

A: 2 cm, -1 time (indicating an inverted image)
B: 4 cm, 2 times
C: 8 cm, 3 times
D: 10 cm, 10 times
E: 5 cm, 6 times
F: 7 cm, 4 times
G: 3 cm, 1 time
H: 11 cm, 0 times (indicating no magnification)
I: 6 cm, 5 times
J: 9 cm, 7 times

How subjects answered

  • deepseek-coder-v2 correct

    a) The distance between the lenses should be the sum of their focal lengths if both the initial object and the final virtual image are very far away. Therefore, the distance between the lenses is 5 cm + 1 cm = 6 cm. b) The magnification under these circumstances is given by the ratio of the focal lengths of the two lenses. Since the focal length of the objective lens is 5 cm and the focal length of the eyepiece lens is 1 cm, the magnification is 5 cm / 1 cm = 5 times. The answer is (I).

  • Qwen1.5-14B-Chat correct

    I

  • Yi-6B-Chat incorrect

    H

Item 581% solve rateanswer: C
Which of the following activities is not an element of a CPA firm's quality control system to be considered in establishing quality control policies and procedures?

A: Deciding whether to accept or continue a client relationship.
B: Selecting personnel for advancement who have the necessary qualifications.
C: Assessing a client's ability to establish effective internal controls.
D: Monitoring the effectiveness of professional development activities.

How subjects answered

  • deepseek-coder-v2 correct

    We refer to Wikipedia articles on accounting for help. A CPA firm's quality control system involves various aspects to ensure the quality of their services. Option A involves risk assessment and client acceptance, which is a standard practice. Option B focuses on personnel selection, which is crucial for maintaining quality. Option D involves continuous professional development, which is also a standard practice in maintaining quality. Option C, however, pertains to assessing a client's internal controls, which is not an element of a CPA firm's quality control system but rather a part of the a …

  • gemini-1.5-flash-002 correct

    Let's think step by step. A CPA firm's quality control system focuses on the firm's internal processes and procedures to ensure the quality of its audits and other services. Options A, B, and D all relate directly to internal processes: client selection, personnel management, and professional development. Option C, assessing a client's internal controls, is part of the audit process itself, not the firm's overall quality control system. The quality control system ensures the firm *has* the processes to properly assess client internal controls, but the assessment itself is not an element of t …

  • Yi-6B incorrect
Item 6100% solve rateanswer: E
Anelectrethas a permanent electric dipole moment in the absenceof free charges. Given anelectretsphere of radius R, with electricpolarisationvectorP= r̂ Po r. Determine the bound charge density \rho', electric displacementvectorD, and the electric fieldEas functions ofr.

A: ρ' = P_0 / 3, D = ε_0r̂P_0, E = r̂(3P_0 / ε_0) for r < R and E(r > R) = r̂(P_0 / ε_0)
B: ρ' = -P_0 / r^2, D = ε_0r̂P_0r, E = -r̂(P_0 / ε_0r^2) for r < R and E(r > R) = -r̂(P_0 / ε_0R^2)
C: ρ' = 2P_0, D = ε_0r̂P_0r, E = r̂(2P_0r / ε_0) for r < R and E(r > R) = r̂(2P_0R / ε_0)
D: ho' = - 3P_0, D = 0, E = - r̂(P_0r / \epsilon_0) for r > R and E(r < R) = 0
E: ho' = - 3P_0, D = 0, E = - r̂(P_0r / \epsilon_0) for r < R and E(r > R) = 0
F: ρ' = 0, D = ε_0r̂P_0, E = r̂(P_0r^2 / ε_0) for r < R and E(r > R) = 0
G: ρ' = P_0r, D = ε_0r̂(P_0r^2), E = r̂(P_0 / ε_0) for r < R and E(r > R) = r̂(P_0R / ε_0)
H: ρ' = -2P_0/r, D = ε_0r̂(P_0r), E = -r̂(2P_0 / ε_0r) for r < R and E(r > R) = -r̂(2P_0 / ε_0R)
I: ho' = 3P_0, D = 1, E = r̂(P_0r / \epsilon_0) for r < R and E(r > R) = 0
J: ho' = - 3P_0, D = 1, E = - r̂(P_0r / \epsilon_0) fo …

How subjects answered

  • gemini-3.1-pro correct

    The bound volume charge density $\rho'$ (written as ho' in the options) is given by the negative divergence of the polarization vector $\mathbf{P}$: $$\rho' = -\nabla \cdot \mathbf{P}$$ In spherical coordinates, the divergence of a purely radial vector field $\mathbf{P} = P_0 r \mathbf{\hat{r}}$ is: $$\nabla \cdot \mathbf{P} = \frac{1}{r^2} \frac{\partial}{\partial r} (r^2 P_r) = \frac{1}{r^2} \frac{\partial}{\partial r} (r^2 \cdot P_0 r) = \frac{1}{r^2} \frac{\partial}{\partial r} (P_0 r^3) = \frac{1}{r^2} (3 P_0 r^2) = 3 P_0$$ Therefore, the bound charge density is: $$\rho' = -3 P_0$$ Next, …

Subjects

The models, agents, and reward models evaluated.

48 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gemini-3.1-pro0.873
  2. 2arx_03140.83
  3. 3iask_pro0.812
  4. 4arx_30.782
  5. 5claude-3-5-sonnet-202410220.776
  6. 6claude-3.5-sonnet0.761
  7. 7gpt-4o-2024-08-060.747
  8. 8sonnet-3.50.728
  9. 9gpt4o(2024-05-13)0.721
  10. 10gemini-1.5-pro-0020.702
  11. 11opus0.685
  12. 12deepseek-chat-v2_50.658
  13. 13gemini-1.5-flash-0020.641
  14. 14deepseek-coder-v20.636
  15. 15gpt-4o-mini0.631
  16. 16Meta-Llama-3_1-70B-Instruct0.628
  17. 17claude-3-5-haiku-202410220.621
  18. 18flash0.591
  19. 19Meta-Llama-3-70B-Instruct0.556
  20. 20sonnet0.551
  21. 21deepseek0.525
  22. 22Meta-Llama-3_1-70B0.525
  23. 23Meta-Llama-3-70B0.52
  24. 24jamba-1.5-large0.495
  25. 25Qwen1.5-110B0.492
  26. 26qwen1.5-72b-chat0.471
  27. 27Meta-Llama-3_1-8B-Instruct0.442
  28. 28Yi-34B0.421
  29. 29mathstral-7B0.42
  30. 30Mixtral-8x7B-Instruct-v0.10.419
  31. 31Phi-3-mini-4k-instruct0.415
  32. 32Meta-Llama-3-8B-Instruct0.401
  33. 33Mixtral-8x7B-v0.10.399
  34. 34c4ai-command-r-v010.366
  35. 35Meta-Llama-3_1-8B0.365
  36. 36Llama-2-70b-hf0.363

+ 12 more subjects evaluated.