Multimodal

OCRBench v2

OCRBench v2: a large-scale bilingual (English + Chinese) text-centric benchmark for LMMs spanning eight capabilities (text recognition, referring, spotting, relation extraction, element parsing, mathematical calculation, visual text understanding, knowledge reasoning) with task-tailored metrics. Per-example predictions are ingested for InternVL2.5-26B (the one model with a released per-instance prediction file); response is a transparent per-example correctness flag and the raw model output is kept in the trace.

4,893items

1subjects

100%observed

Apache-2.0license

generaldomain

reasoningdomain

imagemodality

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 1 subjects × 4,893 items, 100% of cells evaluated.

OCRBench v2 response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: ["{\"Brand(s) Applicable\": [\"Newport Parent, Lights. & 120's\"], \"Coupon Expiration Date\": [\"9 /30 00\"], \"Circulation(#)\": [\"APPROX 600. 000\"], \"Est. Redemption\": [\"13%\"], \"IV\": [\"AR. AZ.A K CA CO FL ID IA. CT, ME, MASS, MN, MT, NE, NV, NM, NY ND, DK, OR, RL, SD, WA, DC, WY\"], \"From\": [\"Lynnette Stevens\"], \"CC\": [\"George Baroody\"], \"Code Assigned\": [\"05787\"], \"Media Name\": [\"Competitive 21- 34 years\"], \"Pack and/ or Carton?\": [\"PACK\"]}"]

[dataset=FUNSD; type=key information mapping en] According to the information in the image, please pair the corresponding keys and values below: Keys that need to be paired are 'From', 'CC', 'Coupon Expiration Date', 'Est. Redemption', 'Code Assigned', 'Brand(s) Applicable', 'Pack and/ or Carton?', 'Circulation(#)', 'Media Name', 'IV'. Values that need to be paired are "Newport Parent, Lights. & 120's", 'Competitive 21- 34 years', '13%', 'APPROX 600. 000', '9 /30 00', 'George Baroody', 'AR. AZ.A K CA CO FL ID IA. CT, ME, MASS, MN, MT, NE, NV, NM, NY ND, DK, OR, RL, SD, WA, DC, WY', '05787', 'PACK', 'Lynnette Stevens'. Extract the required information in the text, and return the result in a dict directly. Your answer should be in the JSON format: { "key1": "..", # The firt key and value "key2": ".." # The second key and value .. }

How subjects answered

InternVL2.5-26B incorrect

{
    "From": "Lynnette Stevens",
    "CC": "George Baroody",
    "Coupon Expiration Date": "9/30/00",
    "Est. Redemption": "13%",
    "Code Assigned": "05787",
    "Brand(s) Applicable": "Newport Parent, Lights. & 120's",
    "Pack and/ or Carton?": "PACK",
    "Circulation(#)": "APPROX 600.000",
    "Media Name": "Competitive 21- 34 years",
    "IV": "AR. AZ.A K CA CO FL ID IA. CT, ME, MASS, MN, MT, NE, NV, NM, NY ND, DK, OR, RL, SD,  WA, DC, WY"
}

Item 20% solve rateanswer: ["{'34': '9787535151308'}"]

[dataset=ICDAR23-SVRD; type=key information extraction cn] 从图中提取: 34, 并按json格式返回

How subjects answered

InternVL2.5-26B incorrect

{
  "序号": "34",
  "ISBN": "9788753515130",
  "书名": "好选择，好人生",
  "版别": "湖北教育出版社",
  "订数": "5"
}

Item 30% solve rateanswer: ["2110-111"]

[dataset=RVL_CDIP; type=cognition VQA en] What is the code for the address of the person mentioned in the document?

How subjects answered

InternVL2.5-26B incorrect
2101

Item 40% solve rateanswer: ["4", "884", "61", "930"]

[dataset=spotting_totaltext; type=text grounding en] Where is the region of the text 'VIS'? Output the normalized coordinates of the left-top and right-bottom corners of the bounding box. The coordinates should be normalized ranging from 0 to 1000 by the image width and height. Your answer should be in the following format: (x1, y1, x2, y2) # x1, y1, x2, y2 are the normalized coordinates of the bounding box.

How subjects answered

InternVL2.5-26B incorrect
(0, 0, 1000, 1000)

Item 50% solve rateanswer: ["Precipitation is more abundant near the equator and less abundant at the poles"]

[dataset=M6Doc; type=text translation cn] I would be grateful if you could translate the text from this image, specifically the portion within the coordinates [218, 285, 639, 311], into English. The coordinates have been normalized ranging from 0 to 1000 by the image width and height. Please provide the translated content directly.

How subjects answered

InternVL2.5-26B incorrect
(3) Factors affecting precipitation: latitude, altitude, and terrain.

Item 60% solve rate

[dataset=mmmu_pro; type=science QA en] image This population graphs shows demographic info mation for:

A.
Kenya
B.
Kuweit
C.
Kazakhstan
D.
Korea
E.
Kosovo
F.
Kyrgyz Republic
G.
Kuwait
H.
Kiribati I. Kyrgyzstan J. Kyrzstan Answer with the option's letter from the given choices directly.

How subjects answered

InternVL2.5-26B incorrect
C

Item 750% solve rateanswer: ["Hayao Miyazaki"]

[dataset=websrc; type=cognition VQA en] Who is responsible for the film's screenplay?

How subjects answered

InternVL2.5-26B correct
I. A. L. Diamond

Item 8100% solve rateanswer: ["B"]

[dataset=examination_paper; type=handwritten answer extraction cn] 在单项选择题第52题中，考生答题的手写内容是什么？请输出考生选择的选项

How subjects answered

InternVL2.5-26B correct
B

Item 9100% solve rateanswer: ["exit"]

[dataset=STVQA; type=cognition VQA en] What does the sign on the ceiling read?

How subjects answered

InternVL2.5-26B correct
Exit

Item 10100% solve rateanswer: ["Due to various intra-organizational matters."]

[dataset=RVL_CDIP; type=cognition VQA en] What was the reason for the termination of the arrangement between the American Tobacco Company and the American company?

How subjects answered

InternVL2.5-26B correct
Various intra-organizational matters

Item 11100% solve rateanswer: ["3.0", "version 3.0"]

[dataset=rico; type=APP agent en] Which type of version is used?

How subjects answered

InternVL2.5-26B correct
3.0

Item 12100% solve rateanswer: ["tv"]

[dataset=textvqa; type=VQA with position en] what is the word on the popcorn cup? Output the answer with 'answer' and 'bbox'. 'bbox' refers to the bounding box position of the 'answer' content in the image. The output format is "answer:gt, bbox:(x1,y1,x2,y2)", where the bbox is the coordinates of the top-left corner and the bottom-right corners. The 'bbox' should be normalized coordinates ranging from 0 to 1000 by the image width and height. Your answer should be in the JSON format: { "answer": "..", # The answer "bbox": "(x1,y1,x2,y2)" # The bounding box position of the 'answer' }

How subjects answered

InternVL2.5-26B correct
{ "answer": "MTV", "bbox": "(0.6,0.6,0.8,0.8)" }

Subjects

The models, agents, and reward models evaluated.

1 subjects, ranked by mean response (accuracy) across this benchmark's items.

1InternVL2.5-26B0.3715

Full data on Hugging Face Back to the gallery