Skip to main content

Reasoning & Knowledge

Few-Shot TTT (BIG-Bench Hard)

Per-item BIG-Bench Hard predictions for Llama-3.1-8B-Instruct under zero-shot, few-shot ICL, and test-time training (TTT) interventions.

1,310items
11subjects
50%observed
MITlicense
reasoningdomain
textmodality

Response matrix

Fit to width. Hover for subject & item; click a cell for details.

Few-Shot TTT (BIG-Bench Hard) response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

Item 10% solve rate

Alice, Bob, and Claire are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Sam, Bob is dancing with Jamie, and Claire is dancing with Melissa. Throughout the song, the dancers often trade partners. First, Alice and Claire switch partners. Then, Claire and Bob switch partners. Finally, Alice and Bob switch partners. At the end of the dance, Alice is dancing with Options:

  • A.

    Sam

  • B.

    Jamie

  • C.

    Melissa

How subjects answered

  • Llama-3.1-8B-Instruct (zero-shot) incorrect

    (b)

Item 20% solve rate

The following translations from German to English contain a particular error. That error will be one of the following types: Named Entities: An entity (names, places, locations, etc.) is changed to a different entity. Numerical Values: Numerical values (ordinals or cardinals), dates, and/or units are changed. Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are changed. Negation or Antonyms: Introduce or remove a negation or change comparatives to their antonyms. Facts: Trivial factual errors not pertaining to the above classes are introduced in the translations. Dropped Content: A significant clause in the translation is removed. Please identify that error. Source: Die Stockpeilung ist eine Methode, um die Höhe eines relativ nahestehenden Objektes zu schätzen. Translation: Stock bearing is a method of estimating the weight of a relatively close object. The translation contains an error pertaining to Options:

  • A.

    Modifiers or Adjectives

  • B.

    Numerical Values

  • C.

    Negation or Antonyms

  • D.

    Named Entities

  • E.

    Dropped Content

  • F.

    Facts

How subjects answered

  • Llama-3.1-8B-Instruct (zero-shot) incorrect

    (a)

Item 30% solve rate

Which sentence has the correct adjective order: Options:

  • A.

    massive cardboard good Thai sweater

  • B.

    good massive Thai cardboard sweater

How subjects answered

  • Llama-3.1-8B-Instruct (zero-shot) incorrect

    (a)

Item 40% solve rate

The following translations from German to English contain a particular error. That error will be one of the following types: Named Entities: An entity (names, places, locations, etc.) is changed to a different entity. Numerical Values: Numerical values (ordinals or cardinals), dates, and/or units are changed. Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are changed. Negation or Antonyms: Introduce or remove a negation or change comparatives to their antonyms. Facts: Trivial factual errors not pertaining to the above classes are introduced in the translations. Dropped Content: A significant clause in the translation is removed. Please identify that error. Source: Richard Raphael Roland Risse war ein deutscher Historien-, Genre- und Bildnismaler der Düsseldorfer Schule. Translation: Risse was a German historical, genre and portrait painter of the Düsseldorf School. The translation contains an error pertaining to Options:

  • A.

    Modifiers or Adjectives

  • B.

    Numerical Values

  • C.

    Negation or Antonyms

  • D.

    Named Entities

  • E.

    Dropped Content

  • F.

    Facts

How subjects answered

  • Llama-3.1-8B-Instruct (10-shot ICL) incorrect

    (f)

  • Llama-3.1-8B-Instruct (10-shot ICL, majority vote) incorrect

    (f)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) incorrect

    (e)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote) incorrect

    (e)

  • Llama-3.1-8B-Instruct + Shared E2E direct I/O (10-shot) incorrect

    (e)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) incorrect

    (e)

Item 50% solve rate

Alice, Bob, Claire, Dave, Eve, Fred, and Gertrude are holding a white elephant gift exchange. At the start of the event, they are each holding a present of a different color: Alice has a orange ball, Bob has a brown present, Claire has a pink ball, Dave has a blue present, Eve has a green present, Fred has a yellow present, and Gertrude has a white present. As the event progresses, pairs of people swap gifts. First, Dave and Gertrude swap their gifts. Then, Gertrude and Alice swap their gifts. Then, Claire and Bob swap their gifts. Then, Eve and Claire swap their gifts. Then, Fred and Alice swap their gifts. Then, Gertrude and Alice swap their gifts. Finally, Bob and Gertrude swap their gifts. At the end of the event, Gertrude has the Options:

  • A.

    orange ball

  • B.

    brown present

  • C.

    pink ball

  • D.

    blue present

  • E.

    green present

  • F.

    yellow present

  • G.

    white present

How subjects answered

  • Llama-3.1-8B-Instruct (10-shot ICL) incorrect

    (b)

  • Llama-3.1-8B-Instruct (10-shot ICL, majority vote) incorrect

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) incorrect

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote) incorrect

    (b)

  • Llama-3.1-8B-Instruct + Shared E2E direct I/O (10-shot) incorrect

    (b)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) incorrect

    (b)

Item 620% solve rate

On the table, you see a bunch of things arranged in a row: a pink crayon, a black pair of sunglasses, and an orange scrunchiephone charger. What is the color of the thing directly to the right of the crayon? Options:

  • A.

    red

  • B.

    orange

  • C.

    yellow

  • D.

    green

  • E.

    blue

  • F.

    brown

  • G.

    magenta

  • H.

    fuchsia (I) mauve (J) teal (K) turquoise (L) burgundy (M) silver (N) gold (O) black (P) grey (Q) purple (R) pink

How subjects answered

  • Llama-3.1-8B-Instruct (10-shot ICL) correct

    (o)

  • Llama-3.1-8B-Instruct (10-shot ICL, majority vote) correct

    (o)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) incorrect

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote) incorrect

    (b)

  • Llama-3.1-8B-Instruct + Shared E2E direct I/O (10-shot) incorrect

    (b)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) incorrect

    (b)

Item 740% solve rate

The following translations from German to English contain a particular error. That error will be one of the following types: Named Entities: An entity (names, places, locations, etc.) is changed to a different entity. Numerical Values: Numerical values (ordinals or cardinals), dates, and/or units are changed. Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are changed. Negation or Antonyms: Introduce or remove a negation or change comparatives to their antonyms. Facts: Trivial factual errors not pertaining to the above classes are introduced in the translations. Dropped Content: A significant clause in the translation is removed. Please identify that error. Source: Paul Böhringer war ein Schweizer Theologe. Translation: Paul Böhringer was a Swiss mathematician. The translation contains an error pertaining to Options:

  • A.

    Modifiers or Adjectives

  • B.

    Numerical Values

  • C.

    Negation or Antonyms

  • D.

    Named Entities

  • E.

    Dropped Content

  • F.

    Facts

How subjects answered

  • Llama-3.1-8B-Instruct + TTT (10-shot, loss on last output) correct

    (f)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) correct

    (f)

  • Llama-3.1-8B-Instruct + TTT (10-shot, majority vote) correct

    (f)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle) incorrect

    (d)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote) incorrect

    (d)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) incorrect

    (d)

Item 880% solve rate

Today, David went to the gas station. Between what times could they have gone? We know that: David woke up at 7am. Michael saw David waiting at the train station from 7am to 12pm. Thomas saw David sitting on a rooftop from 12pm to 2pm. Tiffany saw David getting a coffee at the cafe from 2pm to 3pm. James saw David fixing their computer at the electronic store from 3pm to 5pm. Mark saw David reading at the library from 5pm to 6pm. The gas station was closed after 8pm. Between what times could David have gone to the gas station? Options:

  • A.

    2pm to 3pm

  • B.

    6pm to 8pm

  • C.

    5pm to 6pm

  • D.

    7am to 12pm

How subjects answered

  • Llama-3.1-8B-Instruct + TTT (10-shot, loss on all tokens) correct

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) correct

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle) correct

    (b)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) correct

    (b)

  • Llama-3.1-8B-Instruct (10-shot ICL, majority vote) incorrect

    (d)

  • Llama-3.1-8B-Instruct (10-shot ICL) incorrect

    (d)

Item 995% solve rate

Which of the following is a humorous edit of this artist or movie name: 'full metal jacket'? Options:

  • A.

    fullmmetal jacket

  • B.

    full mpetal jacket

  • C.

    full metal jactet

  • D.

    dull metal jacket

How subjects answered

  • Llama-3.1-8B-Instruct (10-shot ICL) correct

    (d)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) correct

    (d)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle) correct

    (d)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote) correct

    (d)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) correct

    (d)

  • Llama-3.1-8B-Instruct (10-shot ICL, majority vote) incorrect

    (d)

Item 10100% solve rate

In the following sentences, explain the antecedent of the pronoun (which thing the pronoun refers to), or state that it is ambiguous. Sentence: The nurse was running from the laborer because she smelled awful. Options:

  • A.

    The nurse smelled awful

  • B.

    The laborer smelled awful

  • C.

    Ambiguous

How subjects answered

  • Llama-3.1-8B-Instruct (zero-shot) correct

    (c)

Item 11100% solve rate

Today is Sep 9, 1909. What is the date one year ago from today in MM/DD/YYYY? Options:

  • A.

    09/09/1920

  • B.

    09/09/1908

  • C.

    07/05/1908

  • D.

    09/09/1870

  • E.

    09/04/1908

  • F.

    09/13/1908

How subjects answered

  • Llama-3.1-8B-Instruct (10-shot ICL) correct

    (b)

  • Llama-3.1-8B-Instruct (10-shot ICL, majority vote) correct

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, main) correct

    (b)

  • Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote) correct

    (b)

  • Llama-3.1-8B-Instruct + Shared E2E direct I/O (10-shot) correct

    (b)

  • Llama-3.1-8B-Instruct + Shared TTT (10-shot) correct

    (b)

Item 12100% solve rateanswer: 7

I have a fridge, an oven, a car, a toaster, a microwave, a table, and a bed. How many objects do I have?

How subjects answered

  • Llama-3.1-8B-Instruct (zero-shot) correct

    7

Subjects

  1. 1Llama-3.1-8B-Instruct + TTT (10-shot, main)0.5822
  2. 2Llama-3.1-8B-Instruct + TTT (10-shot, majority vote)0.5822
  3. 3Llama-3.1-8B-Instruct + Shared TTT (10-shot)0.5763
  4. 4Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle)0.5644
  5. 5Llama-3.1-8B-Instruct + TTT (10-shot, no shuffle, majority vote)0.557
  6. 6Llama-3.1-8B-Instruct + TTT (10-shot, loss on all tokens)0.5467
  7. 7Llama-3.1-8B-Instruct (10-shot ICL)0.5274
  8. 8Llama-3.1-8B-Instruct + Shared E2E direct I/O (10-shot)0.5244
  9. 9Llama-3.1-8B-Instruct + TTT (10-shot, loss on last output)0.5215
  10. 10Llama-3.1-8B-Instruct (10-shot ICL, majority vote)0.5081
  11. 11Llama-3.1-8B-Instruct (zero-shot)0.4267