Skip to main content

Reasoning & Knowledge

OOD-Prediction (Internal Causal Mechanisms)

OOD-Prediction correctness dataset (Huang et al., ICML 2025): released prompts for 5 tasks (IOI, PriceTag, RAVEL, MMLU, Unlearn Harry Potter) under in-distribution and OOD settings, each prompt labelled correct/wrong by the target model. subject = target model; item = the prompt; response = 1 correct / 0 wrong.

142,805items
2subjects
50%observed
MITlicense
generaldomain
reasoningdomain
textmodality

Response matrix

Fit to width. Hover for subject & item; click a cell for details.

OOD-Prediction (Internal Causal Mechanisms) response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

Item 10% solve rate

<|start_header_id|>user<|end_header_id|>

[{"city": "Curico", "country": "Chile"}, {"city": "Osorio", "country": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct incorrect
Item 20% solve rate

The following are multiple choice questions (with answers) about college mathematics.

Let R be a ring with a multiplicative identity. If U is an additive subgroup of R such that ur in U for all u in U and for all r in R, then U is said to be a right ideal of R. If R has exactly two right ideals, which of the following must be true? I. R is commutative. II. R is a division ring (that is, all elements except the additive identity have multiplicative inverses). III. R is infinite.

  • A.

    I only

  • B.

    II only

  • C.

    III only

  • D.

    I and II only Answer:

Subject outcomes

  • Meta-Llama-3-8B-Instruct incorrect
Item 30% solve rate

<|start_header_id|>user<|end_header_id|>

[{"city": "Hubli", "country": "India"}, {"city": "Dhanbad", "country": "India"}, {"city": "Salama",<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"country": "

Subject outcomes

  • Meta-Llama-3-8B-Instruct incorrect
Item 40% solve rate

<|start_header_id|>user<|end_header_id|>

Does the following item cost between ₺2,73 and ₺8,18? Item: ₺8,65.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct incorrect
Item 50% solve rate

<|start_header_id|>user<|end_header_id|>

Does the item cost b/t $$5.16 && 8.8?!!Item:8.8?!! Item: 18.6.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Answer:

Subject outcomes

  • Meta-Llama-3-8B-Instruct incorrect
Item 60% solve rate

The following are multiple choice questions (with answers) about college medicine.

Oxygen is used:

  • A.

    in glycolysis.

  • B.

    in the conversion of fatty acids to acetyl CoA.

  • C.

    in the tricarboxylic acid cycle (Krebs' cycle).

  • D.

    in glycogenolysis. Answer:

Subject outcomes

  • Meta-Llama-3-8B-Instruct incorrect
Item 7100% solve rate

<|start_header_id|>user<|end_header_id|>

[{"city": "Estancia", "country": "Brazil"}, {"city": "Cebu", "country": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct correct
Item 8100% solve rate

<|start_header_id|>user<|end_header_id|>

Does the following item cost between ₺7,13 and ₺19,7? Item: ₺19.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct correct
Item 9100% solve rate

<|start_header_id|>user<|end_header_id|>

Does the following item cost between ₺4,47 and ₺10,2? Item: ₺3,62.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct correct
Item 10100% solve rate

<|start_header_id|>user<|end_header_id|>

Does the item cost b/t $$6.57 && 17.2?!!Item:17.2?!! Item: 13.8.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Answer:

Subject outcomes

  • Meta-Llama-3-8B-Instruct correct
Item 11100% solve rate

<|start_header_id|>user<|end_header_id|>

Does the following item cost between 5and5 and 13.8? Item: $9.98.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct correct
Item 12100% solve rate

<|start_header_id|>user<|end_header_id|>

Does the following item cost between 14.6and14.6 and 18.7? Item: $18.8.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Subject outcomes

  • Meta-Llama-3-8B-Instruct correct

Subjects

  1. 1Meta-Llama-3-8B-Instruct0.5105
  2. 2gpt20.5