Skip to main content

Science & Engineering

CROP

CROP benchmark: 5,045 bilingual (Chinese/English) crop-science multiple-choice questions across three difficulty levels, derived from 2K+ crop-science academic papers. Each item asks the model to select the correct option; the response is whether the chosen option was correct. We ingest the authors' released per-(model, item) grading matrix for four commercial LLMs.

5,045items
4subjects
100%observed
CC-BY-NC-4.0license
agriculturedomain
sciencedomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 4 subjects × 5,045 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

CROP response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

CROP crop-science multiple-choice question (benchmark ID 5093; difficulty: difficult). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-3.5-turbo-0125 incorrect
  • gpt-4-turbo-2024-04-09 incorrect
  • claude-3-opus-20240229 incorrect
  • qwen-max incorrect
Item 225% solve rate

CROP crop-science multiple-choice question (benchmark ID 3718; difficulty: difficult). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • claude-3-opus-20240229 correct
  • gpt-3.5-turbo-0125 incorrect
  • gpt-4-turbo-2024-04-09 incorrect
  • qwen-max incorrect
Item 350% solve rate

CROP crop-science multiple-choice question (benchmark ID 2512; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • gpt-3.5-turbo-0125 incorrect
  • qwen-max incorrect
Item 475% solve rate

CROP crop-science multiple-choice question (benchmark ID 1363; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
  • gpt-3.5-turbo-0125 incorrect
Item 575% solve rate

CROP crop-science multiple-choice question (benchmark ID 2904; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
  • gpt-3.5-turbo-0125 incorrect
Item 675% solve rate

CROP crop-science multiple-choice question (benchmark ID 1932; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
  • gpt-3.5-turbo-0125 incorrect
Item 775% solve rate

CROP crop-science multiple-choice question (benchmark ID 1209; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
  • gpt-3.5-turbo-0125 incorrect
Item 875% solve rate

CROP crop-science multiple-choice question (benchmark ID 4891; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
  • gpt-3.5-turbo-0125 incorrect
Item 975% solve rate

CROP crop-science multiple-choice question (benchmark ID 4889; difficulty: moderate). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
  • gpt-3.5-turbo-0125 incorrect
Item 10100% solve rate

CROP crop-science multiple-choice question (benchmark ID 1426; difficulty: easy). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-3.5-turbo-0125 correct
  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
Item 11100% solve rate

CROP crop-science multiple-choice question (benchmark ID 4957; difficulty: easy). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-3.5-turbo-0125 correct
  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct
Item 12100% solve rate

CROP crop-science multiple-choice question (benchmark ID 4775; difficulty: easy). The model must select the single correct option (A-D); scored 1 if correct.

Subject outcomes

  • gpt-3.5-turbo-0125 correct
  • gpt-4-turbo-2024-04-09 correct
  • claude-3-opus-20240229 correct
  • qwen-max correct

Subjects

The models, agents, and reward models evaluated.

4 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1claude-3-opus-202402290.9001
  2. 2qwen-max0.8664
  3. 3gpt-4-turbo-2024-04-090.8569
  4. 4gpt-3.5-turbo-01250.3286