Skip to main content

Safety & Security

MMLU Moral Disputes

MMLU moral_disputes per-(model, question) accuracy (acc in {0,1}) over an applied-ethics MCQ subject, from the Open LLM Leaderboard v1 details datasets. Model panel capped to 150.

346items
144subjects
100%observed
Modelsubject type
MITlicense
safetydomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 144 subjects × 346 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

MMLU Moral Disputes response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 13% solve rateanswer: nan

nan

Subject outcomes

  • EleutherAI__pythia-2.8b-deduped correct
  • mosaicml__mpt-7b-storywriter correct
  • OptimalScale__robin-13b-v2-delta incorrect
Item 226% solve rateanswer: nan

nan

Subject outcomes

  • databricks__dolly-v2-7b correct
  • bigcode__gpt_bigcode-santacoder correct
  • OptimalScale__robin-13b-v2-delta incorrect
Item 338% solve rateanswer: nan

nan

Subject outcomes

  • Corianas__Quokka_2.7b correct
  • OpenBuddy__openbuddy-llama-65b-v8-bf16 correct
  • golaxy__gogpt-3b-bloom incorrect
Item 449% solve rateanswer: nan

nan

Subject outcomes

  • golaxy__gogpt2-7b correct
  • WizardLM__WizardLM-13B-V1.1 correct
  • vicgalle__alpaca-7b incorrect
Item 560% solve rateanswer: nan

nan

Subject outcomes

  • golaxy__gogpt2-7b correct
  • OpenBuddy__openbuddy-llama2-13b-v8.1-fp16 correct
  • vicgalle__gpt2-alpaca-gpt4 incorrect
Item 683% solve rateanswer: nan

nan

Subject outcomes

  • Corianas__Quokka_2.7b correct
  • OpenBuddy__openbuddy-llama2-13b-v8.1-fp16 correct
  • Lajonbot__Llama-2-7b-chat-hf-instruct-pl-lora_unload incorrect

Subjects

The models, agents, and reward models evaluated.

144 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1augtoma__qCammel-70-x0.803
  2. 2jordiclive__Llama-2-70b-oasst-1-2000.789
  3. 3quantumaikr__llama-2-70b-fb16-guanaco-1k0.783
  4. 4upstage__Llama-2-70b-instruct-v20.78
  5. 5deepnight-research__llama-2-70B-inst0.78
  6. 6liuxiang886__llama2-70B-qlora-gpt40.78
  7. 7upstage__Llama-2-70b-instruct0.777
  8. 8jarradh__llama2_70b_chat_uncensored0.772
  9. 9MayaPH__GodziLLa2-70B0.754
  10. 10upstage__llama-65b-instruct0.751
  11. 11HiTZ__alpaca-lora-65b-en-pt-es-ca0.737
  12. 12WizardLM__WizardLM-70B-V1.00.728
  13. 13lilloukas__Platypus-30B0.723
  14. 14OpenBuddy__openbuddy-llama-65b-v8-bf160.723
  15. 15quantumaikr__QuantumLM-70B-hf0.714
  16. 16lilloukas__GPlatty-30B0.702
  17. 17upstage__llama-30b-instruct-20480.697
  18. 18OptimalScale__robin-65b-v2-delta0.691
  19. 19upstage__llama-30b-instruct0.665
  20. 20OpenBuddyEA__openbuddy-llama-30b-v7.1-bf160.647
  21. 21shareAI__bimoGPT-llama2-13b0.647
  22. 22camel-ai__CAMEL-33B-Combined-Data0.645
  23. 23layoric__llama-2-13b-code-alpaca0.642
  24. 24Lajonbot__Llama-2-13b-hf-instruct-pl-lora_unload0.639
  25. 25Aeala__GPT4-x-AlpacaDente2-30b0.639
  26. 26augtoma__qCammel-130.639
  27. 27Aeala__GPT4-x-AlpacaDente-30b0.636
  28. 28CalderaAI__13B-Legerdemain-L20.633
  29. 29NousResearch__Redmond-Puffin-13B0.63
  30. 30OpenBuddy__openbuddy-llama2-13b-v8.1-fp160.607
  31. 31CalderaAI__30B-Lazarus0.601
  32. 32WizardLM__WizardLM-13B-V1.20.598
  33. 33NousResearch__Nous-Hermes-Llama2-13b0.595
  34. 34MayaPH__GodziLLa-30B0.587
  35. 35LLMs__WizardLM-13B-V1.00.581
  36. 36edor__Stable-Platypus2-mini-7B0.578

+ 108 more subjects evaluated.