Skip to main content

Reasoning & Knowledge

Correlated Errors in LLMs (HELM MMLU)

Correlated Errors in LLMs — HELM MMLU per-(model x question) responses: 71 LLMs answering ~14K MMLU multiple-choice questions, binary correctness.

13,868items
71subjects
100%observed
MITlicense
generaldomain
reasoningdomain
textmodality

Response matrix

Fit to width. Hover for subject & item; click a cell for details.

Correlated Errors in LLMs (HELM MMLU) response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

Item 10% solve rateanswer: D

A resident announced his candidacy for state representative. A law in the state requires new political entrants, regardless of party affiliation, to obtain five times the number of signatures as other candidates who have run for office previously. The resident, however, failed to obtain the necessary number of authenticating signatures to have his name placed on the ballot. On the substantive constitutional issue regarding the validity of the authenticating requirement, the most probable judicial resolution will be to

How subjects answered

  • anthropic/claude-3-haiku-20240307 incorrect

    B

  • ai21/jamba-instruct incorrect

    B

  • anthropic/claude-instant-1.2 incorrect

    B

  • qwen/qwen2.5-7b-instruct-turbo incorrect

    A

  • writer/palmyra-x-004 incorrect

    B

  • qwen/qwen2.5-72b-instruct-turbo incorrect

    B

Item 223% solve rateanswer: C

The most massive subatomic particle is the

How subjects answered

  • ai21/jamba-instruct correct

    C

  • meta/llama-3.1-70b-instruct-turbo correct

    C

  • 01-ai/yi-6b correct

    C

  • qwen/qwen2-72b-instruct incorrect

    B

  • writer/palmyra-x-v3 incorrect

    B

  • qwen/qwen2.5-72b-instruct-turbo incorrect

    A

Item 339% solve rateanswer: A

Which of these does not form part of the key streams identified by March (1988) in decision making in highly ambiguous environments?

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    A

  • ai21/jamba-1.5-large correct

    A

  • anthropic/claude-3-opus-20240229 correct

    A

  • openai/gpt-4o-mini-2024-07-18 incorrect

    D

  • qwen/qwen1.5-72b incorrect

    D

  • qwen/qwen1.5-110b-chat incorrect

    D

Item 454% solve rateanswer: B

The fallacy of extension consists of

How subjects answered

  • ai21/jamba-instruct correct

    B

  • anthropic/claude-instant-1.2 correct

    B

  • anthropic/claude-3-opus-20240229 correct

    B

  • writer/palmyra-x-v3 incorrect

    D

  • qwen/qwen2.5-7b-instruct-turbo incorrect

    D

  • writer/palmyra-x-004 incorrect

    D

Item 565% solve rateanswer: D

For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020? Scenario 1 | I hid my diabetic children's presents from them until Christmas morning. Scenario 2 | I ate a slice of my friends tasty pork shoulder.

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    D

  • ai21/jamba-instruct correct

    D

  • anthropic/claude-instant-1.2 correct

    D

  • snowflake/snowflake-arctic-instruct incorrect

    B

  • qwen/qwen2.5-7b-instruct-turbo incorrect

    B

  • qwen/qwen2.5-72b-instruct-turbo incorrect

    C

Item 676% solve rateanswer: A

An auditor of a nonissuer may not issue a qualified opinion when

How subjects answered

  • anthropic/claude-3-opus-20240229 correct

    A

  • ai21/jamba-1.5-large correct

    A

  • anthropic/claude-3-5-sonnet-20241022 correct

    A

  • openai/gpt-3.5-turbo-0613 incorrect

    C

  • qwen/qwen1.5-32b incorrect

    C

  • qwen/qwen1.5-72b incorrect

    B

Item 785% solve rateanswer: A

Which of the following statements is incorrect regarding fatty acid ß-oxidation?

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    A

  • anthropic/claude-instant-1.2 correct

    A

  • anthropic/claude-3-opus-20240229 correct

    A

  • allenai/olmo-7b incorrect

    D

  • allenai/olmo-1.7-7b incorrect

    D

  • mistralai/mistral-7b-v0.1 incorrect

    B

Item 889% solve rateanswer: A

Which coefficient indicates the strongest correlation?

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    A

  • ai21/jamba-instruct correct

    A

  • anthropic/claude-instant-1.2 correct

    A

  • 01-ai/yi-large-preview incorrect

    D

  • allenai/olmo-7b incorrect

    D

  • allenai/olmo-1.7-7b incorrect

    D

Item 993% solve rateanswer: C

Judaism regards sexuality as

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    C

  • ai21/jamba-instruct correct

    C

  • anthropic/claude-instant-1.2 correct

    C

  • google/gemini-1.5-flash-001 incorrect
  • google/gemini-1.5-pro-001 incorrect
  • allenai/olmo-7b incorrect

    B

Item 1096% solve rateanswer: C

Which of the following is true of facial nerve palsy?

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    C

  • ai21/jamba-instruct correct

    C

  • anthropic/claude-instant-1.2 correct

    C

  • deepseek-ai/deepseek-llm-67b-chat incorrect

    B

  • meta/llama-2-7b incorrect

    A

  • allenai/olmo-7b incorrect

    B

Item 1197% solve rateanswer: B

A continuous, projecting ridge of bone above and across the eye orbits is called the:

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    B

  • ai21/jamba-instruct correct

    B

  • anthropic/claude-instant-1.2 correct

    B

  • qwen/qwen2.5-72b-instruct-turbo correct

    B

  • microsoft/phi-2 incorrect

    A

  • meta/llama-2-7b incorrect

    A

Item 1299% solve rateanswer: C

Which is true of monopolistic competition?

How subjects answered

  • anthropic/claude-3-haiku-20240307 correct

    C

  • ai21/jamba-instruct correct

    C

  • anthropic/claude-instant-1.2 correct

    C

  • qwen/qwen2.5-7b-instruct-turbo correct

    C

  • qwen/qwen2.5-72b-instruct-turbo correct

    C

  • allenai/olmo-7b incorrect

    B

Subjects

  1. 1anthropic/claude-3-5-sonnet-202410220.8764
  2. 2anthropic/claude-3-5-sonnet-202406200.872
  3. 3google/gemini-1.5-pro-0020.8641
  4. 4anthropic/claude-3-opus-202402290.8504
  5. 5openai/gpt-4o-2024-08-060.8495
  6. 6meta/llama-3.1-405b-instruct-turbo0.849
  7. 7openai/gpt-4o-2024-05-130.8482
  8. 8openai/gpt-4-06130.8401
  9. 9qwen/qwen2.5-72b-instruct-turbo0.8331
  10. 10google/gemini-1.5-pro-0010.8274
  11. 11qwen/qwen2-72b-instruct0.8274
  12. 12openai/gpt-4-turbo-2024-04-090.8194
  13. 13writer/palmyra-x-0040.8176
  14. 14meta/llama-3.2-90b-vision-instruct-turbo0.8109
  15. 15openai/gpt-4-1106-preview0.8101
  16. 16meta/llama-3.1-70b-instruct-turbo0.8091
  17. 17google/gemini-1.5-pro-preview-04090.8086
  18. 18mistralai/mistral-large-24070.8066
  19. 1901-ai/yi-large-preview0.8
  20. 20meta/llama-3-70b0.7867
  21. 21ai21/jamba-1.5-large0.7824
  22. 22google/text-unicorn@0010.7799
  23. 23writer/palmyra-x-v30.7799
  24. 24google/gemini-1.5-flash-0010.7755
  25. 25google/gemini-1.5-flash-preview-05140.7746
  26. 26qwen/qwen1.5-110b-chat0.7744
  27. 27qwen/qwen1.5-72b0.774
  28. 28microsoft/phi-3-medium-4k-instruct0.7738
  29. 29mistralai/mixtral-8x22b0.7734
  30. 30anthropic/claude-3-sonnet-202402290.762
  31. 3101-ai/yi-34b0.7616
  32. 32microsoft/phi-3-small-8k-instruct0.7613
  33. 33openai/gpt-4o-mini-2024-07-180.7577
  34. 34google/gemma-2-27b0.7481
  35. 35google/gemini-1.5-flash-0020.7402
  36. 36qwen/qwen1.5-32b0.7394