Skip to main content

Multimodal

EgoSchema

EgoSchema: a diagnostic long-form egocentric video question-answering benchmark derived from Ego4D, with 5-way multiple-choice questions about ~3-minute first-person videos. This build ingests real per-(model, item) predictions over the public 500-question label subset from the released LLoVi runs (CeeZh/LLoVi output.zip): 11 LLoVi configurations (LLM backbone + caption source + prompt strategy), each evaluated on the same 500 items. Response is per-item correctness (model choice == gold).

500items
11subjects
100%observed
mitlicense
video understandingdomain
generaldomain
videomodality
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 11 subjects × 500 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

EgoSchema response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate

Question: Identify the primary tools and ingredients c used throughout the video, and discuss how they contributed to the overall dessert preparation process.

  • A.

    The primary tools and ingredients c used throughout the video were a spatula, a strainer, a dessert bowl, cocoa powder, and tissue paper. the spatula was used to spread the batter in the bowl. the strainer was used to sieve the cocoa powder into the bowl. the dessert bowl was used to hold the batter. the cocoa powder was used to add flavor and color to the batter. the tissue paper was used to wipe the edges of the bowl.

  • B.

    The primary tools and ingredients c used throughout the video were a mixing bowl, a whisk, a baking dish, flour, sugar, eggs, and butter. the mixing bowl was used to mix the batter. the whisk was used to beat the eggs. the baking dish was used to bake the cake. the flour was used to add structure to the cake. the sugar was used to add sweetness to the cake. the eggs were used to add moisture to the cake. the butter was used to add flavor and richness to the cake.

  • C.

    The primary tools and ingredients c used throughout the video were a muffin tin, a spoon, chocolate chips, flour, sugar, eggs, and butter. the muffin tin was used to bake the muffins. the spoon was used to scoop the batter into the muffin tin. the chocolate chips were used to add flavor and color to the muffins. the flour was used to add structure to the muffins. the sugar was used to add sweetness to the muffins. the eggs were used to add moisture to the muffins. the butter was used to add flavor and richness to the muffins.

  • D.

    The primary tools and ingredients c used throughout the video were a cake pan, a spatula, chocolate frosting, flour, sugar, eggs, and butter. the cake pan was used to bake the cake. the spatula was used to spread the frosting on the cake. the chocolate frosting was used to add flavor and color to the cake. the flour was used to add structure to the cake. the sugar was used to add sweetness to the cake. the eggs were used to add moisture to the cake. the butter was used to add flavor and richness to the cake.

  • E.

    The primary tools and ingredients c used throughout the video were a mixing bowl, a whisk, a baking sheet, flour, sugar, eggs, and butter. the mixing bowl was used to mix the batter. the whisk was used to beat the eggs. the baking sheet was used to bake the cookies. the flour was used to add structure to the cookies. the sugar was used to add sweetness to the cookies. the eggs were used to add moisture to the cookies. the butter was used to add flavor and richness to the cookies.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) incorrect

    B

  • LLoVi (GPT-3.5-turbo, few-shot) incorrect

    B: The primary tools and ingredients c used throughout the video were a mixing bowl, a whisk, a baking dish, flour, sugar, eggs, and butter. the mixing bowl was used to mix the batter. the whisk was used to beat the eggs. the baking dish was used to bake the cake. the flour was used to add structure to the cake. the sugar was used to add sweetness to the cake. the eggs were used to add moisture to the cake. the butter was used to add flavor and richness to the cake.

  • LLoVi (GPT-3.5-turbo, few-shot + summary) incorrect

    C. The primary tools and ingredients C used throughout the video were a mixing bowl, a whisk, a baking dish, flour, sugar, eggs, and butter. The mixing bowl was used to mix the batter, the whisk was used to beat the eggs, the baking dish was used to bake the dessert, the flour was used to add structure, the sugar was used to add sweetness, the eggs were used to add moisture, and the butter was used to add flavor and richness to the dessert.

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) incorrect

    B

  • LLoVi (GPT-4, standard QA) incorrect

    B

  • LLoVi (GPT-4-1106, standard QA) incorrect

    C

Item 29% solve rate

Question: From a high-level perspective, how would you describe the interaction or relationship between c and the lady, and their respective roles in the process depicted in the video?

  • A.

    C is essentially the individual who supplies the dough paste; meanwhile, the lady acts as the person carefully wrapping the nylon paper around the freshly molten dough mixture.

  • B.

    C and the lady, working together, are both jointly responsible for carefully wrapping the nylon paper around the hot, molten dough.

  • C.

    Both c and the lady share equal responsibility for supplying the dough paste required.

  • D.

    C and the lady are not related to each other.

  • E.

    C is the person who wraps the nylon paper around the molten dough, while the lady is the person who provides the dough paste.

How subjects answered

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    E. C is the person who wraps the nylon paper around the molten dough, while the lady is the person who provides the dough paste.

  • LLoVi (GPT-3.5-turbo, chain-of-thought) incorrect

    B

  • LLoVi (GPT-3.5-turbo, few-shot + summary) incorrect

    D: C and the lady are not related to each other.

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) incorrect

    A

  • LLoVi (GPT-4, standard QA) incorrect

    D

  • LLoVi (GPT-4-1106, standard QA) incorrect

    D

Item 318% solve rate

Question: Based on the actions taken by c in this video, how would you concisely describe c's overall goal and the key tools used during the process?

  • A.

    The primary objective of c's overall goal is to precisely cut the wood pieces. the essential key tools used during this woodworking process are the saw and the chisel.

  • B.

    C's overall goal is to smooth and refine the wood pieces. the key tools used during the process are the hand planer and the sander.

  • C.

    C's overall primary goal is to efficiently assemble the wood pieces together. the essential key tools utilized during this assembly process are the hammer and nails.

  • D.

    C's overall goal is to paint the wood pieces. the key tools used during the process are the brush and the paint.

  • E.

    The primary objective of c's overall goal is to effectively varnish the wooden pieces. the crucial tools employed during this process are the brush and the varnish itself.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) correct

    B

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) correct

    B

  • LLoVi (GPT-3.5-turbo, few-shot) incorrect

    A: The primary objective of c's overall goal is to precisely cut the wood pieces. the essential key tools used during this woodworking process are the saw and the chisel.

  • LLoVi (GPT-3.5-turbo-1106, standard QA) incorrect

    C: C's overall primary goal is to efficiently assemble the wood pieces together. the essential key tools utilized during this assembly process are the hammer and nails.

  • LLoVi (GPT-4, standard QA) incorrect

    C

  • LLoVi (GPT-4-1106, standard QA) incorrect

    C

Item 418% solve rate

Question: What is the primary activity c is engaged in throughout the video, and how do his actions with various tools contribute to that task?

  • A.

    C is engaged in measuring various objects in the video. he uses a foldable ruler to measure the length, width, and height of several objects, including a piece of wood, a plank of wood, and a piece of trim.

  • B.

    Currently, c is actively engaged in the process of building a piece of furniture meticulously.

  • C.

    C is currently engaged in the process of repairing a damaged piece of furniture meticulously.

  • D.

    C is engaged in cleaning a piece of furniture.

  • E.

    Currently, c is actively engaged in organizing and assembling a specific piece of furniture.

How subjects answered

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    A: C is engaged in measuring various objects in the video. he uses a foldable ruler to measure the length, width, and height of several objects, including a piece of wood, a plank of wood, and a piece of trim.

  • LLoVi (GPT-3.5-turbo-1106, standard QA) correct

    A

  • LLoVi (GPT-3.5-turbo, chain-of-thought) incorrect

    B

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) incorrect

    B

  • LLoVi (GPT-4, standard QA) incorrect

    B

  • LLoVi (GPT-4-1106, standard QA) incorrect

    B

Item 536% solve rate

Question: In the context of the video, identify the main objective that c is working towards and explain how their interaction with various objects helps achieve that goal. please compress the information and avoid listing each action.

  • A.

    C is trying to set up a tent.

  • B.

    Currently, c is diligently attempting to repair a damaged, broken table.

  • C.

    Currently, c is actively attempting to construct a warm fire.

  • D.

    C is trying to cook a meal.

  • E.

    C is diligently attempting to clean up and organize the campsite area.

How subjects answered

  • LLoVi (GPT-3.5-turbo, standard QA) correct

    A

  • LLoVi (GPT-3.5-turbo-1106, QA + summary) correct

    A

  • LLoVi (GPT-3.5-turbo, QA + summary) correct

    A

  • LLoVi (GPT-3.5-turbo-1106, standard QA) incorrect

    C: Currently, C is actively attempting to construct a warm fire. C's interactions with various objects, such as looking at the roof, adjusting the car boot, holding the staircase, and lifting the metal, indicate efforts towards creating a warm fire for the campsite.

  • LLoVi (GPT-4, standard QA) incorrect

    B

  • LLoVi (GPT-4-1106, standard QA) incorrect

    B

Item 645% solve rate

Question: What is the overarching goal of c's actions throughout the video?

  • A.

    To adequately water and nourish the beautiful flowers.

  • B.

    To plant the flowers in the flower pots.

  • C.

    To effectively fertilize and nurture the growth of the flowers.

  • D.

    To prune the flowers.

  • E.

    Gently proceed to remove the dead leaves from the blooming flowers to maintain their appearance.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) correct

    B

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    B: To plant the flowers in the flower pots.

  • LLoVi (GPT-3.5-turbo, few-shot + summary) correct

    B: To plant the flowers in the flower pots.

  • LLoVi (GPT-3.5-turbo-1106, QA + summary) incorrect

    C: To effectively fertilize and nurture the growth of the flowers.

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) incorrect

    C

  • LLoVi (GPT-3.5-turbo-1106, standard QA) incorrect

    C

Item 764% solve rate

Question: Describe the primary goal or objective of the game being played, keeping the main actions in mind.

  • A.

    The primary goal or objective of the game being played is to have fun. the participants do this by playing the game and interacting with each other in a way that is enjoyable for them.

  • B.

    The primary goal or objective of the game being played is to win by getting the best hand of cards. the participants do this by carefully manipulating the cards and dice in order to try to get the best possible hand.

  • C.

    The primary goal or objective of the game being played is fundamentally to learn new skills effectively. the eager participants accomplish this by actively playing the game and diligently paying attention to how the game's mechanics are played.

  • D.

    The primary goal or objective of the game being played is to help players relax. the participants achieve this by engaging in the game and taking their mind off of their worries and troubles.

  • E.

    The primary goal or objective of the game being played involves fostering socialization. the participants achieve this by actively playing the game and interacting with each other within a friendly social setting.

How subjects answered

  • LLoVi (GPT-3.5-turbo, few-shot + summary) correct

    B: The primary goal or objective of the game being played is to win by getting the best hand of cards. The participants do this by carefully manipulating the cards and dice in order to try to get the best possible hand.

  • LLoVi (GPT-3.5-turbo, QA + summary) correct

    B

  • LLoVi (LLaMA backbone) correct

    [INST]<<SYS>> I want you to act as an expert in video analysis. I will provide video descriptions and you will provide a single-letter answer (A, B, C, D, E) to the following multiple-choice question about the video. Your answer must be one of the letters (A, B, C, D, or E). The video is 1.0 FPS and the descriptions are the captions from all frames.

    <</SYS>>

    Here are the descriptions: #C C picks a card. #O a lady X picks the card.. #C C plays card. #O a woman A picks a peg.. #C C picks a card. #O a lady X picks a token. #O a woman A plays the game.. #O A woman A places down a card.. #O a man X places the tile on the table. #O A woman X picks the token . #C C plays the cards.. #O a lady …

  • LLoVi (GPT-3.5-turbo, chain-of-thought) incorrect

    E

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) incorrect

    E

  • LLoVi (GPT-4-1106, standard QA) incorrect

    E

Item 873% solve rate

Question: What is the primary method c uses for preparing the leaves, and why might they need to complete this process multiple times?

  • A.

    C uses a knife to cut the leaves, and then wipes them with a paper towel.

  • B.

    C uses scissors to cut the leaves, and then wipes them with a cloth.

  • C.

    C uses a blender to cut the leaves, and then wipes them with a sponge.

  • D.

    C uses a food processor to cut the leaves, and then wipes them with a hand towel.

  • E.

    C uses a pair of chopsticks to cut the leaves, and then wipes them with a napkin.

How subjects answered

  • LLoVi (GPT-3.5-turbo, few-shot + summary) correct

    B: C uses scissors to cut the leaves, and then wipes them with a cloth.

    C might need to complete this process multiple times to ensure that the leaves are properly trimmed and prepared for their intended use or arrangement. This could involve removing any damaged or unwanted parts of the leaves, ensuring they are the desired size or shape, or preparing them for use in a specific arrangement or display.

  • LLoVi (GPT-3.5-turbo, QA + summary) correct

    B

  • LLoVi (LLaMA backbone) correct

    [INST]<<SYS>> I want you to act as an expert in video analysis. I will provide video descriptions and you will provide a single-letter answer (A, B, C, D, E) to the following multiple-choice question about the video. Your answer must be one of the letters (A, B, C, D, or E). The video is 1.0 FPS and the descriptions are the captions from all frames.

    <</SYS>>

    Here are the descriptions: #C C holds the flower. #C C puts the flowers in the bucket. #C C puts the flower in the plastic lid. #C C pours # unsure in the water . #C C picks a flower. #C C picks up a decorative paper . #C C places the white blanket in the bag. #C C picks the wooden board. #c c removes the wood. #C C picks the wood on …

  • LLoVi (GPT-3.5-turbo, chain-of-thought) incorrect

    D

  • LLoVi (GPT-3.5-turbo, few-shot) incorrect

    A: C uses a knife to cut the leaves, and then wipes them with a paper towel.

  • LLoVi (GPT-3.5-turbo-1106, QA + summary) incorrect

    A

Item 982% solve rate

Question: Please provide a succinct summary of the primary objective and the key actions c performs in this video.

  • A.

    C is preparing a salad.

  • B.

    Currently, c is in the process of making a delicious sandwich.

  • C.

    Currently, c is in the process of skillfully making a delicious smoothie.

  • D.

    C is making a stir-fry.

  • E.

    In the kitchen, c is currently preparing and making a delicious soup.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) correct

    A

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    A: C is preparing a salad.

  • LLoVi (LLaMA backbone) correct

    [INST]<<SYS>> I want you to act as an expert in video analysis. I will provide video descriptions and you will provide a single-letter answer (A, B, C, D, E) to the following multiple-choice question about the video. Your answer must be one of the letters (A, B, C, D, or E). The video is 1.0 FPS and the descriptions are the captions from all frames.

    <</SYS>>

    Here are the descriptions: #C C walks around the room. #C C moves around the kitchen. #C C opens the cabinet. #C C opens a cabinet. #C C picks a tin . #C C opens the fridge. #C C pushes the shelf. #C C picks a bunch of cucumber. #C C puts the cabbage in the fridge. #C C picks a bottle of lemon juice. #C C moves to the kitchen. #C C …

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) correct

    A

  • LLoVi (GPT-3.5-turbo, few-shot + summary) incorrect

    D: C is making a stir-fry.

  • LLoVi (GPT-3.5-turbo-1106, QA + summary) incorrect

    C: Currently, #C is in the process of skillfully making a delicious smoothie.

Item 1091% solve rate

Question: What was the primary tool used by c in the video, and how did c utilize this tool in order to create a desired outcome?

  • A.

    The primary tool used by c in the video is a rake.

  • B.

    The primary tool used by c in the video is a shovel.

  • C.

    In the video, the primary tool prominently utilized by character c is a gardening instrument known as a hoe.

  • D.

    In the video, the primary tool predominantly utilized by c is, interestingly, a pickaxe.

  • E.

    In the video, the primary tool utilized by individual c is indeed a wheelbarrow for transporting materials.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) correct

    A

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    A: The primary tool used by c in the video is a rake.

  • LLoVi (GPT-3.5-turbo, few-shot + summary) correct

    A: The primary tool used by c in the video is a rake.

  • LLoVi (GPT-4, standard QA) correct

    A

  • LLoVi (GPT-4-1106, standard QA) correct

    A

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) incorrect

    B

Item 11100% solve rate

Question: By analyzing c's activities in this video, determine their overall focus or intention, and describe the progression of tasks leading to its fulfillment.

  • A.

    C's overall focus or intention in this video is to set up their work station and start working on their computer.

  • B.

    The central objective in this video, c's overall focus or intention, is to engage in conversation with the person.

  • C.

    C's overall focus or intention in this video is to eat the wrapper.

  • D.

    C's overall focus or primary intention in this specific video content is to effectively play the guitar.

  • E.

    The primary objective and intention in this particular c's video is to demonstrate how to effectively clean the cloth.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) correct

    A

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    A: C's overall focus or intention in this video is to set up their work station and start working on their computer.

  • LLoVi (GPT-3.5-turbo, few-shot + summary) correct

    A: C's overall focus or intention in this video is to set up their work station and start working on their computer.

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) correct

    A

  • LLoVi (GPT-4, standard QA) correct

    A

  • LLoVi (GPT-4-1106, standard QA) correct

    A

Item 12100% solve rate

Question: Explain how c's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task. what events lead to this switch, and how did c manage his belongings throughout this process?

  • A.

    C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. he then put the tools away and picked up his jacket, and then he went to the bathroom.

  • B.

    C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. he then put the tools away and picked up his jacket, and then he went to the kitchen to make a sandwich.

  • C.

    C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. he then put the tools away and picked up his jacket.

  • D.

    C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. he then put the tools away and picked up his jacket, and then he went to the living room to watch tv.

  • E.

    C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. he then put the tools away and picked up his jacket, and then he went to the bedroom to take a nap.

How subjects answered

  • LLoVi (GPT-3.5-turbo, chain-of-thought) correct

    C

  • LLoVi (GPT-3.5-turbo, few-shot) correct

    C. C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. He then put the tools away and picked up his jacket.

  • LLoVi (GPT-3.5-turbo, few-shot + summary) correct

    C: C's actions transitioned from working with the wood and wooden shelf to handling his jacket and preparing his next task when he finished fixing the wood to the wooden shelf. He then put the tools away and picked up his jacket. Throughout this process, C managed his belongings by placing them on shelves, on the floor, or in pockets.

  • LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA) correct

    C

  • LLoVi (GPT-4, standard QA) correct

    C

  • LLoVi (GPT-4-1106, standard QA) correct

    C

Subjects

The models, agents, and reward models evaluated.

11 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1LLoVi (GPT-4-1106, standard QA)0.612
  2. 2LLoVi (GPT-3.5-turbo, few-shot + summary)0.602
  3. 3LLoVi (GPT-4, standard QA)0.59
  4. 4LLoVi (GPT-3.5-turbo-1106, QA + summary)0.588
  5. 5LLoVi (GPT-3.5-turbo, few-shot)0.576
  6. 6LLoVi (GPT-3.5-turbo, QA + summary)0.574
  7. 7LLoVi (LLaMA backbone)0.554
  8. 8LLoVi (GPT-3.5-turbo, chain-of-thought)0.552
  9. 9LLoVi (GPT-3.5-turbo-1106, standard QA)0.552
  10. 10LLoVi (GPT-3.5-turbo, standard QA)0.512
  11. 11LLoVi (GPT-3.5-turbo, BLIP-2 captions, standard QA)0.474