ML Engineering & Research

MoE-CAP

MoE-CAP: a benchmark and framework for sparse Mixture-of-Experts LLM serving systems, characterizing the Cost/Accuracy/Performance trade-off with sparsity-aware utilization metrics (S-MBU, S-MFU). This ingestion captures the released lm-eval-harness per-item correctness: each MoE model under a serving framework is graded per GSM8K / MMLU question.

7,736items

14subjects

53%observed

Modelsubject type

unknownlicense

mathematicsdomain

generaldomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 14 subjects × 7,736 items, 53% of cells evaluated.

MoE-CAP response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate

The sky is blue because

A.
the Sun mainly emits blue light.
B.
the atmosphere absorbs mostly blue light.
C.
molecules scatter red light more effectively than blue light.
D.
molecules scatter blue light more effectively than red light.

How subjects answered

mistralai/Mixtral-8x7B-Instruct-v0.1 (hf-chat) incorrect

Item 20% solve rate

For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020? Scenario 1 | My son is showing anomalous symptoms, so I figured I should take him to the doctor to get a medication prescription. Scenario 2 | I called the cops on my mom for loving me

A.
Wrong, Wrong
B.
Wrong, Not wrong
C.
Not wrong, Wrong
D.
Not wrong, Not wrong

How subjects answered

mistralai/Mixtral-8x22B-Instruct-v0.1 (hf-chat) incorrect

Item 356% solve rate

Consider the following code segment, which uses the variables r, s, and t. r ← 1 s ← 2 t ← 3 r ← s s ← t DISPLAY (r) DISPLAY (s)

What is displayed as a result of running the code segment?

A.
1 1
B.
1 2
C.
2 3
D.
3 2

How subjects answered

Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs) incorrect

Item 482% solve rate

The latitude and longitude coordinates of a place are an example of

A.
a formal region.
B.
a functional region.
C.
an absolute location.
D.
a relative location.

How subjects answered

Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
databricks/dbrx-instruct (hf-chat) incorrect

Item 5100% solve rate

The competitive market for gasoline, a normal good, is currently in a state of equilibrium. Which of the following would most likely increase the price of gasoline?

A.
Household income falls.
B.
Technology used to produce gasoline improves.
C.
The price of subway tickets and other public transportation falls.
D.
The price of crude oil, a raw material for gasoline, rises.

How subjects answered

Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs) correct

Item 6100% solve rate

Freud said that the penis first becomes the major focus of pleasure-seeking at:

A.
birth
B.
about age 1
C.
between the ages of 3 and 6
D.
puberty

How subjects answered

Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs) correct

Subjects

The models, agents, and reward models evaluated.

14 subjects, ranked by mean response (accuracy) across this benchmark's items.

1Qwen/Qwen3-30B-A3B (sglang)0.93
2mistralai/Mixtral-8x22B-Instruct-v0.1 (vllm_moe)0.771
3mistralai/Mixtral-8x22B-Instruct-v0.1 (vllm_moe_fixbs)0.771
4mistralai/Mixtral-8x22B-Instruct-v0.1 (tensorrt_llm)0.762
5databricks/dbrx-instruct (vllm_moe_fixbs)0.725
6databricks/dbrx-instruct (vllm_moe)0.724
7mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs)0.706
8mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe)0.705
9Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe)0.596
10Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat)0.593
11databricks/dbrx-instruct (tensorrt_llm)0.303
12mistralai/Mixtral-8x22B-Instruct-v0.1 (hf-chat)0.296
13mistralai/Mixtral-8x7B-Instruct-v0.1 (hf-chat)0.287
14databricks/dbrx-instruct (hf-chat)0.28

Full data on Hugging Face Back to the gallery