Multimodal
MMDocRAG
4,055 expert-annotated QA pairs with multi-page, cross-modal evidence for document RAG.
2,000items
66subjects
99%observed
Modelsubject type
Response matrix
Every model, scored item by item.
Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 66 subjects × 2,000 items, 99% of cells evaluated.
Fit to width. Hover for subject & item; click a cell for details.

lowhighUnobserved
Scale: 0 to 1: answer F1 (÷100) or a 1–5 LLM-judge answer-quality score (÷5) on multimodal document QA.
Sample items
What the questions look like — and how subjects answer.
A spread of items across the difficulty range, each shown with a few subjects' actual answers.
Subjects
The models, agents, and reward models evaluated.
66 subjects, ranked by mean response across this benchmark's items.
- 15c51a66b0.821
- 2253e70ce0.794
- 330414bb20.788
- 413a647d40.784
- 5958197950.774
- 66d7340500.773
- 7afae74330.768
- 84d43a29a0.754
- 9b4441d4f0.752
- 106bc8dafd0.751
- 119d12a9fe0.75
- 12d73173a20.748
- 13a2a69af70.746
- 148a4342800.731
- 1501f1332c0.73
- 16de40daba0.73
- 17Qwen3-14B0.73
- 183164e2e30.728
- 19bc9263770.727
- 20e906d9610.727
- 21148a2a890.726
- 2283a2f8220.726
- 23245744820.725
- 24c7037e3b0.724
- 2541a4ee310.722
- 26dafdd2a20.718
- 273b6887a80.717
- 284eb35bd40.71
- 29337521320.708
- 30fd402b8d0.707
- 31d726e2c80.705
- 32f9b30c0d0.702
- 330feeb53c0.7
- 34a478891e0.699
- 35286161430.699
- 36Qwen2.5-VL-72B0.695
+ 30 more subjects evaluated.