Skip to main content

Multimodal

CharXiv

CharXiv: realistic chart understanding for multimodal LLMs, built from real arXiv figures. 1,000 validation charts, each with four descriptive questions (19 fixed templates) and one free-form reasoning question. Models are scored per (model, question); reasoning is exact-match, descriptive is GPT-4o-graded against a rubric.

5,000items
37subjects
100%observed
CC-BY-SA-4.0license
generaldomain
imagemodality
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 37 subjects × 5,000 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

CharXiv response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: simulated microlensing, 'random source, simulated CV

[Figure 1336] What are the names of the labels in the legend? (from top to bottom, then left to right)

How subjects answered

  • Cambrian-34B incorrect

    simulated microlesioning, simulated CV, random source

  • ChartGemma incorrect

    simulated microclensing, simulated CV, random source

  • Claude-3-5-Sonnet incorrect

    simulated microlensing, simulated CV, random source

  • Reka-Flash incorrect

    simulated microlensing, random source, simulated CV

  • SPHINX-V2-13B incorrect

    1.0, 1.2, 1.4

  • VILA-1-5-40B incorrect

    simulated microlesioning, random source

Item 28% solve rateanswer: Histogram of error by IHR (t_3)

Which subplot in the bottom row contains the bar with the highest density value?

How subjects answered

  • GPT-4o-mini correct

    Histogram of error by IHR (t₃)

  • GPT-4o correct

    Histogram of error by IHR ( _3")

  • Mini-Gemini-HD-Yi-34B correct

    Histogram of error by IHR (t(3))

  • Reka-Flash incorrect

    Histogram of error by IHR (t5) (C)

  • SPHINX-V2-13B incorrect

    Error (C)

  • VILA-1-5-40B incorrect

    Histogram of error by HR(c)

Item 316% solve rateanswer: 0.6

What is the approximate difference in key rate between n = 10^7 and n = 10^15 at a QBER of 0%?

How subjects answered

  • Cambrian-34B correct

    0.6

  • Claude-3-5-Sonnet correct

    0.6

  • Reka-Core correct

    0.6

  • Reka-Flash incorrect

    0.2

  • SPHINX-V2-13B incorrect
  • VILA-1-5-40B incorrect

    0.1

Item 427% solve rateanswer: Joint-CNN

Which model shows a greater decline in accuracy from Session 1 to Session 9 in the 5-way full-shot scenario?

How subjects answered

  • Cambrian-34B correct

    Joint-CNN

  • Claude-3-Opus correct

    Joint-CNN

  • InternLM-XComposer2-4KHD correct

    Joint-CNN

  • Reka-Edge incorrect

    iCaRL*

  • Reka-Flash incorrect

    Ft-CNN

  • SPHINX-V2-13B incorrect

    F-CNN

Item 535% solve rateanswer: 10^-2

[Figure 925] What is the spatially lowest labeled tick on the y-axis?

How subjects answered

  • Claude-3-Opus correct

    10^-2

  • Claude-3-Haiku correct

    10^-2

  • GPT-4o correct

    10^{-2}

  • Qwen-VL-Plus incorrect

    1e-08

  • SPHINX-V2-13B incorrect

    10

  • VILA-1-5-40B incorrect

    10

Item 643% solve rateanswer: 1

[Figure 722] What is the number of subplots?

How subjects answered

  • Gemini-1.0-Pro-Vision correct
  • Gemini-1-5-Pro correct
  • GPT-4o correct
  • Qwen-VL-Max incorrect
  • Reka-Flash incorrect
  • Reka-Edge incorrect
Item 751% solve rateanswer: Not Applicable

[Figure 1463] What is difference between consecutive numerical tick values on the y-axis?

How subjects answered

  • Claude-3-5-Sonnet correct

    Not Applicable

  • GPT-4o-mini correct
  • GPT-4o correct

    Not Applicable

  • Qwen-VL-Max incorrect
  • SPHINX-V2-13B incorrect
  • VILA-1-5-40B incorrect
Item 857% solve rateanswer: PPL (10^2)

[Figure 30] What is the label of the y-axis?

How subjects answered

  • Claude-3-Opus correct

    PPL (10^2)

  • GPT-4o correct

    PPL (10^2)

  • GPT-4V-Turbo correct

    PPL (10^2)

  • Qwen-VL-Plus incorrect

    PPL (102^2)

  • Reka-Edge incorrect

    PPL (102)

  • SPHINX-V2-13B incorrect

    PP (10^2)

Item 962% solve rateanswer: Yield

[Figure 1880] What is the label of the y-axis?

How subjects answered

  • Claude-3-Opus correct

    yield

  • Claude-3-Haiku correct

    Yield

  • GPT-4V-Turbo correct

    Yield

  • MoAI incorrect

    ylabel

  • Phi-3-Vision incorrect

    yield

  • SPHINX-V2-13B incorrect
Item 1070% solve rateanswer: 10

[Figure 1581] What is the rightmost labeled tick on the x-axis?

How subjects answered

  • Cambrian-34B correct

    10

  • Claude-3-Opus correct
  • Claude-3-Haiku correct
  • Reka-Edge incorrect

    1.0

  • SPHINX-V2-13B incorrect
  • VILA-1-5-40B incorrect
Item 1178% solve rateanswer: 0.2

[Figure 326] What is difference between consecutive numerical tick values on the x-axis?

How subjects answered

  • Cambrian-34B correct
  • Claude-3-5-Sonnet correct
  • Claude-3-Haiku correct
  • MiniCPM-V2 incorrect

    Not Applicable

  • Ovis1.5-Llama3-8B incorrect
  • SPHINX-V2-13B incorrect
Item 1286% solve rateanswer: 4

[Figure 1664] How many discrete labels are there in the legend?

How subjects answered

  • Cambrian-34B correct
  • ChartGemma correct
  • Claude-3-5-Sonnet correct
  • GPT-4V-Turbo incorrect
  • LLaVA-1-6-Yi-34B incorrect

    Not Applicable

  • SPHINX-V2-13B incorrect

Subjects

The models, agents, and reward models evaluated.

37 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1Claude-3-5-Sonnet0.7964
  2. 2GPT-4o0.7698
  3. 3GPT-4V-Turbo0.7136
  4. 4InternVL2-Pro0.6949
  5. 5InternVL2-76B0.6792
  6. 6GPT-4o-mini0.6676
  7. 7Gemini-1-5-Pro0.6624
  8. 8Claude-3-Sonnet0.654
  9. 9Claude-3-Opus0.6329
  10. 10Claude-3-Haiku0.5846
  11. 11InternVL2-26B0.566
  12. 12Ovis1.5-Gemma2-9B0.5582
  13. 13Phi-3-Vision0.547
  14. 14Ovis1.5-Llama3-8B0.5392
  15. 15Cambrian-34B0.5324
  16. 16InternVL-Chat-V1-50.5265
  17. 17MiniCPM-V2-50.5245
  18. 18Reka-Flash0.5051
  19. 19Reka-Core0.5028
  20. 20InternLM-XComposer2-4KHD0.4872
  21. 21Gemini-1.0-Pro-Vision0.4807
  22. 22Mini-Gemini-HD-Yi-34B0.4716
  23. 23LLaVA-1-6-Yi-34B0.4536
  24. 24DeepSeek-VL0.4008
  25. 25Mini-Gemini-HD-LLaMA3-8B0.3938
  26. 26Qwen-VL-Max0.3812
  27. 27IDEFICS-2-Chatty0.3682
  28. 28VILA-1-5-40B0.3574
  29. 29InternLM-XComposer20.3476
  30. 30MiniCPM-V20.324
  31. 31Reka-Edge0.3163
  32. 32LLaVA-1-6-Mistral-7B0.3116
  33. 33IDEFICS-20.2986
  34. 34SPHINX-V2-13B0.2744
  35. 35MoAI0.2647
  36. 36Qwen-VL-Plus0.2635