Skip to main content

Multimodal

MMDocRAG

4,055 expert-annotated QA pairs with multi-page, cross-modal evidence for document RAG.

2,000items
66subjects
99%observed
Modelsubject type

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 66 subjects × 2,000 items, 99% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

MMDocRAG response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: 0 to 1: answer F1 (÷100) or a 1–5 LLM-judge answer-quality score (÷5) on multimodal document QA.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Subjects

The models, agents, and reward models evaluated.

66 subjects, ranked by mean response across this benchmark's items.

  1. 15c51a66b0.821
  2. 2253e70ce0.794
  3. 330414bb20.788
  4. 413a647d40.784
  5. 5958197950.774
  6. 66d7340500.773
  7. 7afae74330.768
  8. 84d43a29a0.754
  9. 9b4441d4f0.752
  10. 106bc8dafd0.751
  11. 119d12a9fe0.75
  12. 12d73173a20.748
  13. 13a2a69af70.746
  14. 148a4342800.731
  15. 1501f1332c0.73
  16. 16de40daba0.73
  17. 17Qwen3-14B0.73
  18. 183164e2e30.728
  19. 19bc9263770.727
  20. 20e906d9610.727
  21. 21148a2a890.726
  22. 2283a2f8220.726
  23. 23245744820.725
  24. 24c7037e3b0.724
  25. 2541a4ee310.722
  26. 26dafdd2a20.718
  27. 273b6887a80.717
  28. 284eb35bd40.71
  29. 29337521320.708
  30. 30fd402b8d0.707
  31. 31d726e2c80.705
  32. 32f9b30c0d0.702
  33. 330feeb53c0.7
  34. 34a478891e0.699
  35. 35286161430.699
  36. 36Qwen2.5-VL-72B0.695

+ 30 more subjects evaluated.