Science & Engineering

CROP

CROP benchmark: 5,045 bilingual (Chinese/English) crop-science multiple-choice questions across three difficulty levels, derived from 2K+ crop-science academic papers. Each item asks the model to select the correct option; the response is whether the chosen option was correct. We ingest the authors' released per-(model, item) grading matrix for four commercial LLMs.

5,045items

4subjects

100%observed

CC-BY-NC-4.0license

agriculturedomain

sciencedomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 4 subjects × 5,045 items, 100% of cells evaluated.

CROP response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

CROP crop-science multiple-choice question (benchmark ID 5093; difficulty: difficult). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-3.5-turbo-0125 incorrect
gpt-4-turbo-2024-04-09 incorrect
claude-3-opus-20240229 incorrect
qwen-max incorrect

Item 225% solve rate

CROP crop-science multiple-choice question (benchmark ID 3718; difficulty: difficult). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

claude-3-opus-20240229 correct
gpt-3.5-turbo-0125 incorrect
gpt-4-turbo-2024-04-09 incorrect
qwen-max incorrect

Item 350% solve rate

CROP crop-science multiple-choice question (benchmark ID 2512; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
gpt-3.5-turbo-0125 incorrect
qwen-max incorrect

Item 475% solve rate

CROP crop-science multiple-choice question (benchmark ID 1363; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct
gpt-3.5-turbo-0125 incorrect

Item 575% solve rate

CROP crop-science multiple-choice question (benchmark ID 2904; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct
gpt-3.5-turbo-0125 incorrect

Item 675% solve rate

CROP crop-science multiple-choice question (benchmark ID 1932; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct
gpt-3.5-turbo-0125 incorrect

Item 775% solve rate

CROP crop-science multiple-choice question (benchmark ID 1209; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct
gpt-3.5-turbo-0125 incorrect

Item 875% solve rate

CROP crop-science multiple-choice question (benchmark ID 4891; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct
gpt-3.5-turbo-0125 incorrect

Item 975% solve rate

CROP crop-science multiple-choice question (benchmark ID 4889; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct
gpt-3.5-turbo-0125 incorrect

Item 10100% solve rate

CROP crop-science multiple-choice question (benchmark ID 1426; difficulty: easy). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-3.5-turbo-0125 correct
gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct

Item 11100% solve rate

CROP crop-science multiple-choice question (benchmark ID 4957; difficulty: easy). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-3.5-turbo-0125 correct
gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct

Item 12100% solve rate

CROP crop-science multiple-choice question (benchmark ID 4775; difficulty: easy). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

gpt-3.5-turbo-0125 correct
gpt-4-turbo-2024-04-09 correct
claude-3-opus-20240229 correct
qwen-max correct

Subjects

The models, agents, and reward models evaluated.

4 subjects, ranked by mean response (accuracy) across this benchmark's items.

1claude-3-opus-202402290.9001
2qwen-max0.8664
3gpt-4-turbo-2024-04-090.8569
4gpt-3.5-turbo-01250.3286

Full data on Hugging Face Back to the gallery