Safety & Security

MMLU Business Ethics

MMLU business_ethics per-(model, question) accuracy (acc in {0,1}) over an applied-ethics MCQ subject, from the Open LLM Leaderboard v1 details datasets. Model panel capped to 150.

100items

116subjects

100%observed

Modelsubject type

MITlicense

safetydomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 116 subjects × 100 items, 100% of cells evaluated.

MMLU Business Ethics response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 13% solve rateanswer: nan

nan

Subject outcomes

EleutherAI__pythia-70m-deduped correct
EleutherAI__pythia-12b correct
shibing624__chinese-llama-plus-13b-hf incorrect

Item 222% solve rateanswer: nan

nan

Subject outcomes

shibing624__chinese-alpaca-plus-13b-hf correct
Lajonbot__Llama-2-13b-hf-instruct-pl-lora_unload correct
upstage__Llama-2-70b-instruct incorrect

Item 334% solve rateanswer: nan

nan

Subject outcomes

jarradh__llama2_70b_chat_uncensored correct
Lajonbot__vicuna-13b-v1.3-PL-lora_unload correct
upstage__llama-65b-instruct incorrect

Item 447% solve rateanswer: nan

nan

Subject outcomes

jarradh__llama2_70b_chat_uncensored correct
edor__Hermes-Platypus2-mini-7B correct
augtoma__qCammel-13 incorrect

Item 565% solve rateanswer: nan

nan

Subject outcomes

jarradh__llama2_70b_chat_uncensored correct
OptimalScale__robin-7b-v2-delta correct
EleutherAI__pythia-6.7b incorrect

Item 678% solve rateanswer: nan

nan

Subject outcomes

jarradh__llama2_70b_chat_uncensored correct
golaxy__gogpt-7b correct
EleutherAI__pythia-6.9b-deduped incorrect

Subjects

The models, agents, and reward models evaluated.

116 subjects, ranked by mean response (accuracy) across this benchmark's items.

1quantumaikr__llama-2-70b-fb16-guanaco-1k0.76
2liuxiang886__llama2-70B-qlora-gpt40.76
3upstage__Llama-2-70b-instruct0.76
4augtoma__qCammel-70-x0.75
5deepnight-research__llama-2-70B-inst0.74
6jarradh__llama2_70b_chat_uncensored0.73
7MayaPH__GodziLLa2-70B0.71
8WizardLM__WizardLM-70B-V1.00.68
9upstage__llama-65b-instruct0.68
10OpenBuddy__openbuddy-llama-65b-v8-bf160.65
11quantumaikr__QuantumLM-70B-hf0.62
12upstage__llama-30b-instruct-20480.62
13upstage__llama-30b-instruct0.61
14MayaPH__GodziLLa-30B0.61
15NousResearch__Nous-Hermes-Llama2-13b0.6
16OpenBuddy__openbuddy-llama2-13b-v8.1-fp160.6
17NousResearch__Nous-Hermes-llama-2-7b0.57
18HiTZ__alpaca-lora-65b-en-pt-es-ca0.57
19CalderaAI__30B-Lazarus0.57
20mosaicml__mpt-30b-chat0.57
21augtoma__qCammel-130.57
22Aeala__GPT4-x-AlpacaDente-30b0.56
23NousResearch__Redmond-Puffin-13B0.56
24OptimalScale__robin-65b-v2-delta0.56
25Lajonbot__Llama-2-13b-hf-instruct-pl-lora_unload0.55
26WizardLM__WizardLM-13B-V1.20.55
27kevinpro__Vicuna-13B-CoT0.54
28CalderaAI__13B-Legerdemain-L20.53
29Aeala__GPT4-x-AlpacaDente2-30b0.53
30lvkaokao__llama2-7b-hf-instruction-lora0.53
31Lajonbot__vicuna-13b-v1.3-PL-lora_unload0.52
32CalderaAI__13B-Ouroboros0.52
33Lajonbot__tableBeluga-7B-instruct-pl-lora_unload0.52
34Tap-M__Luna-AI-Llama2-Uncensored0.52
35NousResearch__Nous-Hermes-13b0.52
36Lajonbot__vicuna-7b-v1.5-PL-lora_unload0.51

+ 80 more subjects evaluated.

Full data on Hugging Face Back to the gallery