Multimodal

MMDocRAG

4,055 expert-annotated QA pairs with multi-page, cross-modal evidence for document RAG.

2,000items

66subjects

99%observed

Modelsubject type

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 66 subjects × 2,000 items, 99% of cells evaluated.

MMDocRAG response matrix: AI models (rows) against items (columns) — lowhighUnobserved
Scale: 0 to 1: answer F1 (÷100) or a 1–5 LLM-judge answer-quality score (÷5) on multimodal document QA.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Subjects

The models, agents, and reward models evaluated.

66 subjects, ranked by mean response across this benchmark's items.

15c51a66b0.821
2253e70ce0.794
330414bb20.788
413a647d40.784
5958197950.774
66d7340500.773
7afae74330.768
84d43a29a0.754
9b4441d4f0.752
106bc8dafd0.751
119d12a9fe0.75
12d73173a20.748
13a2a69af70.746
148a4342800.731
1501f1332c0.73
16de40daba0.73
17Qwen3-14B0.73
183164e2e30.728
19bc9263770.727
20e906d9610.727
21148a2a890.726
2283a2f8220.726
23245744820.725
24c7037e3b0.724
2541a4ee310.722
26dafdd2a20.718
273b6887a80.717
284eb35bd40.71
29337521320.708
30fd402b8d0.707
31d726e2c80.705
32f9b30c0d0.702
330feeb53c0.7
34a478891e0.699
35286161430.699
36Qwen2.5-VL-72B0.695

+ 30 more subjects evaluated.

Full data on Hugging Face Back to the gallery