Medical

MedAgentBoard

Multi-agent collaboration vs. single-LLM and conventional methods on medical tasks.

1,606items

13subjects

53%observed

Modelsubject type

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 13 subjects × 1,606 items, 53% of cells evaluated.

MedAgentBoard response matrix: AI models (rows) against items (columns) — lowhighUnobserved
Scale: 0 to 1, by task: medical-QA accuracy (0/1), lay-summary ROUGE-L F1, or predicted in-hospital-mortality probability. Mixed metrics, each already on a 0–1 scale.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

13 subjects, ranked by mean response across this benchmark's items.

1b3ba62130.834
2bf4b4ba90.798
3329114c30.783
4668ea5600.781
529bb30170.777
6e2c30d590.763
798a84b580.739
8a28d84370.738
9641850fe0.737
10db87218d0.588
11fbb7b5bf0.316
12d863de040.314
13e309757b0.307

Full data on Hugging Face Back to the gallery