Skip to main content

ML Engineering & Research

MoE-CAP

MoE-CAP: a benchmark and framework for sparse Mixture-of-Experts LLM serving systems, characterizing the Cost/Accuracy/Performance trade-off with sparsity-aware utilization metrics (S-MBU, S-MFU). This ingestion captures the released lm-eval-harness per-item correctness: each MoE model under a serving framework is graded per GSM8K / MMLU question.

7,736items
14subjects
53%observed
Modelsubject type
unknownlicense
mathematicsdomain
generaldomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 14 subjects × 7,736 items, 53% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

MoE-CAP response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate

The sky is blue because

  • A.

    the Sun mainly emits blue light.

  • B.

    the atmosphere absorbs mostly blue light.

  • C.

    molecules scatter red light more effectively than blue light.

  • D.

    molecules scatter blue light more effectively than red light.

How subjects answered

  • mistralai/Mixtral-8x7B-Instruct-v0.1 (hf-chat) incorrect
Item 20% solve rate

For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020? Scenario 1 | My son is showing anomalous symptoms, so I figured I should take him to the doctor to get a medication prescription. Scenario 2 | I called the cops on my mom for loving me

  • A.

    Wrong, Wrong

  • B.

    Wrong, Not wrong

  • C.

    Not wrong, Wrong

  • D.

    Not wrong, Not wrong

How subjects answered

  • mistralai/Mixtral-8x22B-Instruct-v0.1 (hf-chat) incorrect
Item 356% solve rate

Consider the following code segment, which uses the variables r, s, and t. r ← 1 s ← 2 t ← 3 r ← s s ← t DISPLAY (r) DISPLAY (s)

What is displayed as a result of running the code segment?

  • A.

    1 1

  • B.

    1 2

  • C.

    2 3

  • D.

    3 2

How subjects answered

  • Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
  • Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
  • mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs) incorrect
Item 482% solve rate

The latitude and longitude coordinates of a place are an example of

  • A.

    a formal region.

  • B.

    a functional region.

  • C.

    an absolute location.

  • D.

    a relative location.

How subjects answered

  • Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
  • Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
  • databricks/dbrx-instruct (hf-chat) incorrect
Item 5100% solve rate

The competitive market for gasoline, a normal good, is currently in a state of equilibrium. Which of the following would most likely increase the price of gasoline?

  • A.

    Household income falls.

  • B.

    Technology used to produce gasoline improves.

  • C.

    The price of subway tickets and other public transportation falls.

  • D.

    The price of crude oil, a raw material for gasoline, rises.

How subjects answered

  • Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
  • Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
  • mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs) correct
Item 6100% solve rate

Freud said that the penis first becomes the major focus of pleasure-seeking at:

  • A.

    birth

  • B.

    about age 1

  • C.

    between the ages of 3 and 6

  • D.

    puberty

How subjects answered

  • Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe) correct
  • Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat) correct
  • mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs) correct

Subjects

The models, agents, and reward models evaluated.

14 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1Qwen/Qwen3-30B-A3B (sglang)0.93
  2. 2mistralai/Mixtral-8x22B-Instruct-v0.1 (vllm_moe)0.771
  3. 3mistralai/Mixtral-8x22B-Instruct-v0.1 (vllm_moe_fixbs)0.771
  4. 4mistralai/Mixtral-8x22B-Instruct-v0.1 (tensorrt_llm)0.762
  5. 5databricks/dbrx-instruct (vllm_moe_fixbs)0.725
  6. 6databricks/dbrx-instruct (vllm_moe)0.724
  7. 7mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe_fixbs)0.706
  8. 8mistralai/Mixtral-8x7B-Instruct-v0.1 (vllm_moe)0.705
  9. 9Qwen/Qwen1.5-MoE-A2.7B-Chat (vllm_moe)0.596
  10. 10Qwen/Qwen1.5-MoE-A2.7B-Chat (hf-chat)0.593
  11. 11databricks/dbrx-instruct (tensorrt_llm)0.303
  12. 12mistralai/Mixtral-8x22B-Instruct-v0.1 (hf-chat)0.296
  13. 13mistralai/Mixtral-8x7B-Instruct-v0.1 (hf-chat)0.287
  14. 14databricks/dbrx-instruct (hf-chat)0.28