Safety & Security

MMLU Moral Disputes

MMLU moral_disputes per-(model, question) accuracy (acc in {0,1}) over an applied-ethics MCQ subject, from the Open LLM Leaderboard v1 details datasets. Model panel capped to 150.

346items

144subjects

100%observed

Modelsubject type

MITlicense

safetydomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 144 subjects × 346 items, 100% of cells evaluated.

MMLU Moral Disputes response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 13% solve rateanswer: nan

nan

Subject outcomes

EleutherAI__pythia-2.8b-deduped correct
mosaicml__mpt-7b-storywriter correct
OptimalScale__robin-13b-v2-delta incorrect

Item 226% solve rateanswer: nan

nan

Subject outcomes

databricks__dolly-v2-7b correct
bigcode__gpt_bigcode-santacoder correct
OptimalScale__robin-13b-v2-delta incorrect

Item 338% solve rateanswer: nan

nan

Subject outcomes

Corianas__Quokka_2.7b correct
OpenBuddy__openbuddy-llama-65b-v8-bf16 correct
golaxy__gogpt-3b-bloom incorrect

Item 449% solve rateanswer: nan

nan

Subject outcomes

golaxy__gogpt2-7b correct
WizardLM__WizardLM-13B-V1.1 correct
vicgalle__alpaca-7b incorrect

Item 560% solve rateanswer: nan

nan

Subject outcomes

golaxy__gogpt2-7b correct
OpenBuddy__openbuddy-llama2-13b-v8.1-fp16 correct
vicgalle__gpt2-alpaca-gpt4 incorrect

Item 683% solve rateanswer: nan

nan

Subject outcomes

Corianas__Quokka_2.7b correct
OpenBuddy__openbuddy-llama2-13b-v8.1-fp16 correct
Lajonbot__Llama-2-7b-chat-hf-instruct-pl-lora_unload incorrect

Subjects

The models, agents, and reward models evaluated.

144 subjects, ranked by mean response (accuracy) across this benchmark's items.

1augtoma__qCammel-70-x0.803
2jordiclive__Llama-2-70b-oasst-1-2000.789
3quantumaikr__llama-2-70b-fb16-guanaco-1k0.783
4upstage__Llama-2-70b-instruct-v20.78
5deepnight-research__llama-2-70B-inst0.78
6liuxiang886__llama2-70B-qlora-gpt40.78
7upstage__Llama-2-70b-instruct0.777
8jarradh__llama2_70b_chat_uncensored0.772
9MayaPH__GodziLLa2-70B0.754
10upstage__llama-65b-instruct0.751
11HiTZ__alpaca-lora-65b-en-pt-es-ca0.737
12WizardLM__WizardLM-70B-V1.00.728
13lilloukas__Platypus-30B0.723
14OpenBuddy__openbuddy-llama-65b-v8-bf160.723
15quantumaikr__QuantumLM-70B-hf0.714
16lilloukas__GPlatty-30B0.702
17upstage__llama-30b-instruct-20480.697
18OptimalScale__robin-65b-v2-delta0.691
19upstage__llama-30b-instruct0.665
20OpenBuddyEA__openbuddy-llama-30b-v7.1-bf160.647
21shareAI__bimoGPT-llama2-13b0.647
22camel-ai__CAMEL-33B-Combined-Data0.645
23layoric__llama-2-13b-code-alpaca0.642
24Lajonbot__Llama-2-13b-hf-instruct-pl-lora_unload0.639
25Aeala__GPT4-x-AlpacaDente2-30b0.639
26augtoma__qCammel-130.639
27Aeala__GPT4-x-AlpacaDente-30b0.636
28CalderaAI__13B-Legerdemain-L20.633
29NousResearch__Redmond-Puffin-13B0.63
30OpenBuddy__openbuddy-llama2-13b-v8.1-fp160.607
31CalderaAI__30B-Lazarus0.601
32WizardLM__WizardLM-13B-V1.20.598
33NousResearch__Nous-Hermes-Llama2-13b0.595
34MayaPH__GodziLLa-30B0.587
35LLMs__WizardLM-13B-V1.00.581
36edor__Stable-Platypus2-mini-7B0.578

+ 108 more subjects evaluated.

Full data on Hugging Face Back to the gallery