Multimodal

CharXiv

CharXiv: realistic chart understanding for multimodal LLMs, built from real arXiv figures. 1,000 validation charts, each with four descriptive questions (19 fixed templates) and one free-form reasoning question. Models are scored per (model, question); reasoning is exact-match, descriptive is GPT-4o-graded against a rubric.

5,000items

37subjects

100%observed

CC-BY-SA-4.0license

generaldomain

imagemodality

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 37 subjects × 5,000 items, 100% of cells evaluated.

CharXiv response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: simulated microlensing, 'random source, simulated CV

[Figure 1336] What are the names of the labels in the legend? (from top to bottom, then left to right)

How subjects answered

Cambrian-34B incorrect
simulated microlesioning, simulated CV, random source
ChartGemma incorrect
simulated microclensing, simulated CV, random source
Claude-3-5-Sonnet incorrect
simulated microlensing, simulated CV, random source
Reka-Flash incorrect
simulated microlensing, random source, simulated CV
SPHINX-V2-13B incorrect
1.0, 1.2, 1.4
VILA-1-5-40B incorrect
simulated microlesioning, random source

Item 28% solve rateanswer: Histogram of error by IHR (t_3)

Which subplot in the bottom row contains the bar with the highest density value?

How subjects answered

GPT-4o-mini correct
Histogram of error by IHR (t₃)
GPT-4o correct
Histogram of error by IHR ( _3")
Mini-Gemini-HD-Yi-34B correct
Histogram of error by IHR (t(3))
Reka-Flash incorrect
Histogram of error by IHR (t5) (C)
SPHINX-V2-13B incorrect
Error (C)
VILA-1-5-40B incorrect
Histogram of error by HR(c)

Item 316% solve rateanswer: 0.6

What is the approximate difference in key rate between n = 10^7 and n = 10^15 at a QBER of 0%?

How subjects answered

Cambrian-34B correct
0.6
Claude-3-5-Sonnet correct
0.6
Reka-Core correct
0.6
Reka-Flash incorrect
0.2
SPHINX-V2-13B incorrect
VILA-1-5-40B incorrect
0.1

Item 427% solve rateanswer: Joint-CNN

Which model shows a greater decline in accuracy from Session 1 to Session 9 in the 5-way full-shot scenario?

How subjects answered

Cambrian-34B correct
Joint-CNN
Claude-3-Opus correct
Joint-CNN
InternLM-XComposer2-4KHD correct
Joint-CNN
Reka-Edge incorrect
iCaRL*
Reka-Flash incorrect
Ft-CNN
SPHINX-V2-13B incorrect
F-CNN

Item 535% solve rateanswer: 10^-2

[Figure 925] What is the spatially lowest labeled tick on the y-axis?

How subjects answered

Claude-3-Opus correct
10^-2
Claude-3-Haiku correct
10^-2
GPT-4o correct
10^{-2}
Qwen-VL-Plus incorrect
1e-08
SPHINX-V2-13B incorrect
10
VILA-1-5-40B incorrect
10

Item 643% solve rateanswer: 1

[Figure 722] What is the number of subplots?

How subjects answered

Gemini-1.0-Pro-Vision correct
Gemini-1-5-Pro correct
GPT-4o correct
Qwen-VL-Max incorrect
Reka-Flash incorrect
Reka-Edge incorrect

Item 751% solve rateanswer: Not Applicable

[Figure 1463] What is difference between consecutive numerical tick values on the y-axis?

How subjects answered

Claude-3-5-Sonnet correct
Not Applicable
GPT-4o-mini correct
GPT-4o correct
Not Applicable
Qwen-VL-Max incorrect
SPHINX-V2-13B incorrect
VILA-1-5-40B incorrect

Item 857% solve rateanswer: PPL (10^2)

[Figure 30] What is the label of the y-axis?

How subjects answered

Claude-3-Opus correct
PPL (10^2)
GPT-4o correct
PPL (10^2)
GPT-4V-Turbo correct
PPL (10^2)
Qwen-VL-Plus incorrect
PPL (10 $^2$ )
Reka-Edge incorrect
PPL (102)
SPHINX-V2-13B incorrect
PP (10^2)

Item 962% solve rateanswer: Yield

[Figure 1880] What is the label of the y-axis?

How subjects answered

Claude-3-Opus correct
yield
Claude-3-Haiku correct
Yield
GPT-4V-Turbo correct
Yield
MoAI incorrect
ylabel
Phi-3-Vision incorrect
yield
SPHINX-V2-13B incorrect

Item 1070% solve rateanswer: 10

[Figure 1581] What is the rightmost labeled tick on the x-axis?

How subjects answered

Cambrian-34B correct
10
Claude-3-Opus correct
Claude-3-Haiku correct
Reka-Edge incorrect
1.0
SPHINX-V2-13B incorrect
VILA-1-5-40B incorrect

Item 1178% solve rateanswer: 0.2

[Figure 326] What is difference between consecutive numerical tick values on the x-axis?

How subjects answered

Cambrian-34B correct
Claude-3-5-Sonnet correct
Claude-3-Haiku correct
MiniCPM-V2 incorrect
Not Applicable
Ovis1.5-Llama3-8B incorrect
SPHINX-V2-13B incorrect

Item 1286% solve rateanswer: 4

[Figure 1664] How many discrete labels are there in the legend?

How subjects answered

Cambrian-34B correct
ChartGemma correct
Claude-3-5-Sonnet correct
GPT-4V-Turbo incorrect
LLaVA-1-6-Yi-34B incorrect
Not Applicable
SPHINX-V2-13B incorrect

Subjects

The models, agents, and reward models evaluated.

37 subjects, ranked by mean response (accuracy) across this benchmark's items.

1Claude-3-5-Sonnet0.7964
2GPT-4o0.7698
3GPT-4V-Turbo0.7136
4InternVL2-Pro0.6949
5InternVL2-76B0.6792
6GPT-4o-mini0.6676
7Gemini-1-5-Pro0.6624
8Claude-3-Sonnet0.654
9Claude-3-Opus0.6329
10Claude-3-Haiku0.5846
11InternVL2-26B0.566
12Ovis1.5-Gemma2-9B0.5582
13Phi-3-Vision0.547
14Ovis1.5-Llama3-8B0.5392
15Cambrian-34B0.5324
16InternVL-Chat-V1-50.5265
17MiniCPM-V2-50.5245
18Reka-Flash0.5051
19Reka-Core0.5028
20InternLM-XComposer2-4KHD0.4872
21Gemini-1.0-Pro-Vision0.4807
22Mini-Gemini-HD-Yi-34B0.4716
23LLaVA-1-6-Yi-34B0.4536
24DeepSeek-VL0.4008
25Mini-Gemini-HD-LLaMA3-8B0.3938
26Qwen-VL-Max0.3812
27IDEFICS-2-Chatty0.3682
28VILA-1-5-40B0.3574
29InternLM-XComposer20.3476
30MiniCPM-V20.324
31Reka-Edge0.3163
32LLaVA-1-6-Mistral-7B0.3116
33IDEFICS-20.2986
34SPHINX-V2-13B0.2744
35MoAI0.2647
36Qwen-VL-Plus0.2635

Full data on Hugging Face Back to the gallery