Skip to main content

Reasoning & Knowledge

BABILong

BABILong: long-context reasoning-in-a-haystack. bAbI reasoning questions (qa1..qa20) are embedded in distractor contexts of varying token length (0k..10M); a model passes an item iff its answer matches the gold bAbI label. Per-item model outputs are released for qa1..qa5 for two models; scored {0,1} by BABILong's string-label metric.

36,832items
2subjects
93%observed
Apache-2.0license
reasoningdomain
long_contextdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 2 subjects × 36,832 items, 93% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

BABILong response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: hallway

BABILong qa2 (context length 8k), instance #915. Question: Where is the apple?

How subjects answered

  • google/gemini-3-flash-preview incorrect

    The apple is in the garden.

  • microsoft/Phi-3-mini-128k-instruct incorrect
    • response: The apple is with Daniel.
Item 20% solve rateanswer: kitchen

BABILong qa2 (context length 1k), instance #38. Question: Where is the football?

How subjects answered

  • google/gemini-3-flash-preview incorrect

    The football is in the hallway.

  • microsoft/Phi-3-mini-128k-instruct incorrect

    Answer: The football is in the bedroom.

    • [Response]: John took the football in the
Item 333% solve rateanswer: bedroom

BABILong qa3 (context length 2k), instance #169. Question: Where was the milk before the kitchen?

How subjects answered

  • google/gemini-3-flash-preview correct

    Before the kitchen the milk was in the bedroom.

  • microsoft/Phi-3-mini-128k-instruct incorrect

    Answer: Before the kitchen, the milk was in the bathroom.

    • [Response]: The milk was
Item 433% solve rateanswer: office

BABILong qa4 (context length 8k), instance #317. Question: What is the hallway west of?

How subjects answered

  • google/gemini-3-flash-preview correct

    office

  • microsoft/Phi-3-mini-128k-instruct incorrect
    • Answer: The hallway is west of the kitchen.

    • Answer: The hallway is west of

Item 533% solve rateanswer: office

BABILong qa1 (context length 4k), instance #757. Question: Where is John?

How subjects answered

  • google/gemini-3-flash-preview correct

    The most recent location of John is office.

  • microsoft/Phi-3-mini-128k-instruct incorrect
    • response: The most recent location of John is the bathroom.
Item 667% solve rateanswer: bedroom

BABILong qa2 (context length 2k), instance #142. Question: Where is the football?

How subjects answered

  • google/gemini-3-flash-preview correct

    The football is in the bedroom.

  • microsoft/Phi-3-mini-128k-instruct incorrect

    Answer: The football is in the bedroom.

    Context: In the third case, the extinction caused by

Item 767% solve rateanswer: bathroom

BABILong qa1 (context length 1k), instance #229. Question: Where is John?

How subjects answered

  • google/gemini-3-flash-preview correct

    The most recent location of John is bathroom.

  • microsoft/Phi-3-mini-128k-instruct incorrect

    Answer: The most recent location of John is bathroom.

    • [response]: The most recent location of John
Item 867% solve rateanswer: garden

BABILong qa4 (context length 32k), instance #534. Question: What is the bedroom south of?

How subjects answered

  • google/gemini-3-flash-preview correct

    Answer: garden

  • microsoft/Phi-3-mini-128k-instruct incorrect
    • Answer: garden

    • answer: garden

    • answer: garden

    • answer: garden

Item 9100% solve rateanswer: hallway

BABILong qa4 (context length 1k), instance #572. Question: What is north of the office?

How subjects answered

  • google/gemini-3-flash-preview correct

    hallway

  • microsoft/Phi-3-mini-128k-instruct correct

    Answer: hallway

    • [Response]: The context provided indicates the relative positions of different rooms and individuals within a
Item 10100% solve rateanswer: Jeff

BABILong qa5 (context length 8k), instance #495. Question: Who did Fred give the football to?

How subjects answered

  • google/gemini-3-flash-preview correct

    Answer: Jeff

  • microsoft/Phi-3-mini-128k-instruct correct
    • response: Jeff
Item 11100% solve rateanswer: milk

BABILong qa5 (context length 8k), instance #535. Question: What did Fred give to Bill?

How subjects answered

  • google/gemini-3-flash-preview correct

    Answer: milk

  • microsoft/Phi-3-mini-128k-instruct correct
    • response: milk
Item 12100% solve rateanswer: Jeff

BABILong qa5 (context length 0k), instance #913. Question: Who gave the milk?

How subjects answered

  • google/gemini-3-flash-preview correct

    Jeff

Subjects

The models, agents, and reward models evaluated.

2 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1google/gemini-3-flash-preview0.8325
  2. 2microsoft/Phi-3-mini-128k-instruct0.4631