Skip to main content

Multimodal

Vript-Hard

Vript-Hard: video-understanding challenge tasks on densely-captioned clips. Vript-RR asks scene-specific multiple-choice questions given a hint; Vript-ERO asks for the correct temporal order of three scenes. Responses are per-(video-LLM, item) graded outcomes reproduced from the upstream verification pipeline.

286items
3subjects
51%observed
Apache-2.0license
generaldomain
videomodality
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 3 subjects × 286 items, 51% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

Vript-Hard response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: CAB

Below are the descriptions of three scenes in the video: [Scene (A)] The scene starts with a close-up of a person's hands pouring white glue into a blue measuring cup, indicating the beginning of a DIY slime-making process. The environment looks like a clean, bright kitchen with marble countertops. After the glue, shaving cream is added into an orange measuring cup; the texture is thick and foamy. The person's nails are painted in alternating green and blue colors, showing attention to detail and an interest in color. Following this, pink food coloring is held between fingers and squirted into the mixture; however, we do not see the actual mixing of food coloring with the other ingredients within these frames. The food coloring bottle is small and round with a dark label, indicating it might be a gel-based dye. The overall lighting is bright and evenly lit, making all details clearly visible.

[Scene (B)] In a bright kitchen setting with marble countertops, a person's hands are shown mixing ingredients in an orange mixing bowl. Their nails are painted teal. They add cornstarch to a white semi-liquid substance to thicken it into slime. Following this, they carefully drop two drops of pink food coloring into the mixture. The color starts to blend with the white base as they stir with a yellow spatula, resulting in a light pink hue. The person is off-camera; only their hands and the bowl are visible. The voice-over expresses excitement about the resulting color, comparing it to a 'bubblegum pink rug', suggesting a vibrant and playful tone for the activity of making slime.

[Scene (C)] The video features close-up shots of a person's hands working on a textured mat covered in pink pom poms. The mat is densely packed with these soft decorations, giving it a plush, uneven surface. The pom poms are a bubblegum pink color, which the voice-over humorously compares to chewed bubblegum, albeit not in an unappealing way. In the first two frames, the hands are seen pressing and running over the pom poms, showcasing their texture and how they fill the mat. The third frame transitions to a medium shot of the person at a crafting station, where more pom poms and DIY materials are visible. The background is vibrant with a warm orange color, creating a cheerful and creative atmosphere. The person's nails are painted in a bright turquoise shade, providing a pop of contrasting color against the pink mat.

What is the correct temporal order of the Scenes (A), (B), (C) based on the video?

How subjects answered

  • Claude-Opus incorrect

    Based on the order of the scenes shown in the video, the correct temporal sequence is:

    Scene (B) -> Scene (A) -> Scene (C)

    In Scene (B), the person mixes ingredients including cornstarch and pink food coloring to create a light pink slime base.

    This is followed by Scene (A), where the person adds white glue and shaving cream to the mixture, further developing the slime. The pink food coloring is added last in this scene.

    Finally, in Scene (C), the finished pink slime is shown being pressed onto a mat covered in matching pink pom poms, showcasing the end result and texture of the DIY craft project.

    So in summary, the slime making process starts in Scene (B), continues in Scene (A), and …

Item 20% solve rate

Hint: In the video, there is mention of a dimly lit stage with stage lights that creates a dramatic atmosphere appropriate for a talent competition. Characters are wearing casual and stylish clothing with variations of blues, blacks, and whites. Celebratory gestures like raised hands can be observed. Question: What is event that they are experiencing? Options:

  • A.

    The band wins the championship of the show.

  • B.

    The band has a new member to be added in the show.

  • C.

    The band moves to the next stage in the show.

  • D.

    The band has released a new album in the show.

How subjects answered

  • VideoChat incorrect

    (A) The band wins the championship of the show.

  • VideoChat2 incorrect

    (A) The band wins the championship of the show.

Item 30% solve rateanswer: ABC

Below are the descriptions of three scenes in the video: [Scene (A)] The video begins with a medium shot of a bright yellow sports car in a spacious garage filled with various classic and modern cars. The garage has a high ceiling with string lights and an industrial aesthetic. As the camera moves, the focus shifts to a person walking towards another person standing among the cars. The interior is dimly lit, except for natural light entering from the side and the festive string lights above. We see sports cars of various colors like red and white. Two men are having a casual conversation in a friendly manner. The camera then switches to a selfie-style rotation, showing two individuals from a closer perspective. Their attire is casual; one wears a patterned black hoodie, while the other has a brown leather jacket. The scene conveys an informal gathering of car enthusiasts in a private club-like setting.

[Scene (B)] The video begins with a first-person perspective of someone moving along a city pathway, possibly riding a bike given the speed and height of the camera. The environment is an urban setting with buildings, leafless trees, overcast skies, and a few cars and pedestrians. The scene transitions to an indoor setting with warm lighting, cluttered with various objects such as monitors showing CCTV footage, vintage memorabilia on the walls, a bottle of green liquid, and an orange strap on the table. A person wearing a red and black jacket and camouflage pants is seen from behind and then turns around to interact with something off-camera. Next, there's a close-up of their hands as they hand over a closed laptop with a red sticker on it.

[Scene (C)] The scene takes place at a cluttered table, likely within a workshop or creative space given the informal setting and various items scattered around. A Sony camera is centrally positioned on the table, with its lens facing the viewer. The subject's hands are interacting with the camera, applying what appears to be matches to its surface, possibly as part of a customization or decorative process. The hands are deliberate in their actions, suggesting a level of care or attention to detail. Surrounding the camera are other small objects, including what looks like a battery and possibly other electronic components, hinting at some technical activity taking place. The environment is warmly lit, casting soft shadows and highlighting the colors, such as the red circular design on the table, and the multi-colored hoodie sleeves of the subject.

What is the correct temporal order of the Scenes (A), (B), (C) based on the video?

How subjects answered

  • Claude-Opus incorrect

    Based on my analysis of the video scenes and the provided descriptions, the correct temporal order of the scenes appears to be:

    Scene (B) Scene (C) Scene (A)

    Scene (B) comes first, showing a person moving outdoors and then transitioning to an indoor setting where they hand over a laptop.

    Scene (C) follows, depicting hands working on customizing or decorating a camera on a cluttered table in a workshop-like space.

    Scene (A) comes last, with two men having a friendly conversation in a spacious garage filled with various classic and modern cars, suggesting an informal gathering of car enthusiasts.

Item 40% solve rateanswer: CAB

Below are the descriptions of three scenes in the video: [Scene (A)] The clip features a close-up shot of a person's hand holding an iPhone. The phone is displaying a live feed or recorded footage from a surveillance camera app named 'Cameras'. The app interface shows the name 'Jackson Khal' at the top, signaling it may be the name of the camera or account. It's 3:50 PM according to the phone, and the battery is almost full. The video feed on the phone displays what appears to be a baby's crib in a moderately lit room, with no baby present at the moment. Two timestamped events are listed below the feed, both indicating '0 events' on 'July 23, 2016'. The environment suggests a domestic setting, likely a parent monitoring their child's room. Colors are neutral with white dominating the crib and walls. No drastic actions or movements occur in these frames.

[Scene (B)] The video features two individuals in a warmly lit hallway with yellow walls, engaging in an activity together. One individual, wearing a gray top and light-colored shorts, is seen mimicking the other smaller individual's movements. Both are on their hands and knees on a carpeted floor. The environment has a domestic feel, likely a home, with an open door at the end of the hallway revealing a room with visible furnishings. A green toy is seen abandoned on the left side of the frame. As they move down the hallway, they approach a louvered door. The light is consistent and ambient, suggesting an indoor setting with artificial lighting. The color palette is warm, dominated by yellows and neutrals. Their interaction seems playful and mimicry-based, suggesting a learning or bonding experience.

[Scene (C)] The video clip seems to take place in a cozy, well-lit living room during the day, given the natural light coming through the windows. A woman's voice narrates her recent activities, mentioning someone named Jackson had lunch while she had hot Cheetos. A packet of Hot Cheetos can be seen resting on the couch, which is white and paired with matching pillows. There is a smartphone lying near the Hot Cheetos. The room appears lived-in and comfortable, with a child's toy visible in the background suggesting a family environment. The walls are painted in a neutral color, complementing the dark wood furniture and lighter upholstery. Overall, the scene depicts a casual, relaxed setting with a homely atmosphere.

What is the correct temporal order of the Scenes (A), (B), (C) based on the video?

How subjects answered

  • Claude-Opus incorrect

    Based on the details shown in the video, the correct temporal order of the scenes is:

    Scene (C), Scene (B), Scene (A)

    In Scene (C), the woman mentions having hot Cheetos while the baby had lunch, suggesting it is midday.

    Scene (B) then shows the playful interaction in the hallway, which likely takes place sometime after lunch.

    Finally, Scene (A) shows the phone surveillance feed of the empty crib room at 3:50 PM, indicating it occurs later in the afternoon after the earlier two scenes.

    So in chronological sequence, the video depicts Scene (C) (lunchtime), followed by Scene (B) (post-lunch playtime), and ends with Scene (A) (late afternoon crib monitoring).

Item 50% solve rateanswer: BCA

Below are the descriptions of three scenes in the video: [Scene (A)] In an indoor basketball court, presumably at Duke given the logo on the floor, multiple players are engaged in a game. The court is marked with blue and white, reflecting Duke University's colors. Players are dressed in various casual sports attire with no uniformity, suggesting an informal game or practice session. In the foreground, a player in yellow shorts attempts to guard another player who is dribbling a basketball. Other players are scattered around, watching or waiting for their turn. The lighting is bright, likely from overhead gym lights, casting a clear view of the action with minimal shadows. Around the court, some onlookers sit against the wall, observing the game. The general atmosphere is one of casual competition typical of a summer camp setting.

[Scene (B)] The video clip appears to be from a location that showcases memorabilia or trophies, likely related to sports history. The first frame features a white and blue basketball jersey with 'DUKE' and the number '32' prominently displayed, suggesting it belongs to Duke University. The jersey is encased in a glass display alongside other sports garments, indicating an exhibition of athletic achievements. The next two frames reveal a large, ornate silver trophy placed on a wooden pedestal within a similar glass display case. Surrounding this trophy are plaques, photographs, and other items that exude prestige and honor. The environment suggests an indoor setting, likely a hall of fame or sports museum. The lighting is soft and focused on the displays to highlight the items' significance. The colors are mostly neutral with blue accents from the Duke memorabilia.

[Scene (C)] The video depicts a scene inside a well-lit indoor basketball court with visible markings and a hoop. A group of young male players in various casual sports attires, such as shorts and T-shirts in different colors like orange, blue, black, and green, are present. They are gathered around a person wearing a blue shirt and black shorts who seems to be instructing or coaching the players. The person in blue appears to be addressing the team in a semi-circle formation. The walls are padded with blue mats for safety, and 'COURT 1' signage is visible on the wall. The players are attentive, suggesting an ongoing discussion or strategy session. The environment suggests an organized team activity, possibly a practice session or a summer camp basketball event.

What is the correct temporal order of the Scenes (A), (B), (C) based on the video?

How subjects answered

  • Claude-Opus incorrect

    Based on my analysis of the video frames provided, the correct temporal order of the scenes appears to be:

    Scene (C), Scene (A), Scene (B)

    The video starts with Scene (C), showing the man in the blue shirt and hat addressing a group of young basketball players in an indoor court, likely providing instructions or coaching before a practice or game.

    This is followed by Scene (A), which depicts the actual basketball game or practice taking place on the indoor court, with players engaged in casual competition.

    Finally, Scene (B) comes last, showing what seems to be a display of Duke University basketball memorabilia and trophies, possibly visited by the man and boys after their basketball …

Item 60% solve rateanswer: ACB

Below are the descriptions of three scenes in the video: [Scene (A)] In a dark and eerie room lit with soft blueish tones, presumably during nighttime, we see a close-up of a bed with white and grey bedding. The camera holds steady, capturing the suspenseful atmosphere of a horror setting. The bed occupies most of the frame, and we see a figure pulling the sheets away, revealing no one underneath. Despite the absence of a visible character, the sheets move as if manipulated by an unseen force, hinting at supernatural elements. The environment is typical for a horror film - minimalistic with an unsettling feeling of emptiness and silence. The room seems to be a bedroom with plain walls and little to no decoration, emphasizing the starkness and isolation. Overall, the scene uses visual tension and minimalism to create an atmosphere of anticipation and dread.

[Scene (B)] The video frames depict a melancholic atmosphere, likely reflecting the protagonist's troubled state. In the first frame, the character is seen in a dimly lit room with sunlight peeking through the window, suggesting daytime. He's wearing a plaid shirt and facing away from the camera, creating a sense of distance. The second frame shows him sitting alone at a bar, which is empty except for scattered bottles and glasses, conveying loneliness or contemplation. The bar's darkness contrasts with the daylight visible through the door. In the third frame, we see an outdoor night scene at a high school named 'Bishop Hayes.' Parking lot lights illuminate the area with cars parked, indicating an event or end of day. The environment in all scenes uses muted colors and shadow to evoke a somber mood reflective of the character's life 'fallen to pieces.'

[Scene (C)] The clip appears to be from an animated feature, presumably taking place in a room resembling a wizard's study, filled with books and mystical artifacts. The center of attention is an animated character, who seems to be performing a magical spell. The character has pointy ears, suggesting they could be an elf or similar mythical being. The room is dimly lit, primarily illuminated by the glowing light of the spell being cast. The walls are adorned with shelves carrying various objects that reflect the light from the spell, casting dynamic shadows and creating an atmosphere of wonder and anticipation. The magic emits a radiant purple hue that saturates the surroundings, contrasting with the warm tones of the room. As the camera pans, objects in the room such as books, papers, and items on a table subtly react to the magical energy emanating from the spell.

What is the correct temporal order of the Scenes (A), (B), (C) based on the video?

How subjects answered

  • Claude-Opus incorrect

    Based on the images and descriptions provided, the correct temporal order of the scenes appears to be:

    Scene (B), Scene (C), Scene (A)

    The video starts by introducing the main character, an actor, who mentions being "new at home". This likely corresponds to Scene (B) which depicts the character in a troubled, contemplative state in dimly lit interior settings, suggesting his life has "fallen to pieces".

    Next, an animated clip is shown with magical elements, which matches the description of Scene (C) taking place in a wizard's study with a mythical character casting a glowing spell.

    Finally, the video transitions to a dark, suspenseful bedroom scene with horror elements and an unseen …

Item 725% solve rate

Hint: In the video, there's a moment where the focus is on the scoreboard displaying the standings of the women's team final at the Commonwealth Games. The leading country is visible along with two other countries following. The scoreboard seems to be a digital display with clear and legible text, providing real-time updates to the audience and athletes in attendance. Question: What is the name of the country that is leading on this scoreboard? Options:

  • A.

    India.

  • B.

    England.

  • C.

    Australia.

  • D.

    Canada.

How subjects answered

  • VideoChat incorrect

    (A) India.

  • VideoChat2 incorrect

    (C) Australia.

Item 825% solve rate

Hint: In the scene, there is an individual who seems to be in a position of authority, wearing a striped blue-and-white outfit. He is seen gesticulating and displaying a demeanor that suggests he is upset as he wipes the window, possibly reprimanding someone. Question: Can you identify what is the reason of his frustration? Options:

  • A.

    The dissatisfaction with the cleaning job.

  • B.

    The lack of proper equipment for cleaning.

  • C.

    The sunshine reflected from the window flashes his eyes.

  • D.

    The window is locked that he can't clean the inside.

How subjects answered

  • VideoChat incorrect

    (B) The lack of proper equipment for cleaning.

  • VideoChat2 incorrect

    (A) The dissatisfaction with the cleaning job.

Item 950% solve rate

Hint: In the video, there is a spacious garage that has a high ceiling adorned with string lights and an industrial look. It is dimly lit apart from the natural light entering from the side and the overhead festive string lights. This garage is filled with various cars, both classic and modern. Two individuals are seen engaging in a friendly conversation amidst these vehicles. Question: What is the color of the car with a brand in a 'V' shape? Options:

  • A.

    Black.

  • B.

    White.

  • C.

    Red.

  • D.

    Yellow.

How subjects answered

  • VideoChat correct

    (A) Black. The car with the brand in a 'V' shape is black. It is not clear what type of car it is.

  • VideoChat2 correct

    (C) Red.

Item 1050% solve rate

Hint: In the video, there is a person attending to a component of a boat's engine in a well-lit engine room. The person, wearing blue gloves, focuses on two filters – one green and the other white with a label. The person is seen opening the top of one filter, which shows some liquid and involves an action to remove air that causes a spurt. Bubbles are visible during this process until they subside. Question: What action does the person's left hand perform that results in the spurt of liquid? Options:

  • A.

    Turning the filter clockwise.

  • B.

    Pumping the filter.

  • C.

    Pulling the filter upwards.

  • D.

    Shaking the filter.

How subjects answered

  • VideoChat correct

    (B) Pumping the filter. The person's left hand is pumping the filter, as seen by the action of the arm moving up and down while holding the filter. The spurt of liquid is likely caused by air pockets within the filter that are expelled as the pumping motion continues.

  • VideoChat2 incorrect

    (A) Turning the filter clockwise.

Item 1175% solve rate

Hint: In the video, there is a character who has a distinctive appearance, wearing a suit with a pinstripe pattern that is unique and suggests a formal yet eccentric fashion sense. His hair is untamed and gray. This character is seen making a theatrical hand gesture, with his fingers interlaced. Question: Why this character do this hand gesture in the context of the scene? Options:

  • A.

    He is checking something written on his palm.

  • B.

    He is casting a spell to summon something.

  • C.

    He is blowing a whistle with his fingers to get someone's attention.

  • D.

    He is covering his face in shock because of something he saw.

How subjects answered

  • VideoChat correct

    (A) He is checking something written on his palm. The character is wearing a suit with a pinstripe pattern that is unique and suggests a formal yet eccentric fashion sense. He is holding his palm and making a theatrical hand gesture with his fingers. He is not casting a spell to summon something or blowing a whistle with his fingers. Therefore, option (A) is the best option that accurately addresses the question.

  • VideoChat2 incorrect

    (A) He is checking something written on his palm.

Item 12100% solve rateanswer: CBA

Below are the descriptions of three scenes in the video: [Scene (A)] The clip seems to be a playful reenactment using Teenage Mutant Ninja Turtles action figures. The voice-over indicates a victorious moment followed by sudden confusion as the characters collapse. The figures are depicted with cartoonish proportions and bright, primary colors. They are accessorized with weapons, hinting at their readiness for battle. In the background, there are two larger robotic figures, possibly enemies. The tabletop environment is simple, with a vibrant pink backdrop that contrasts the figures. Lighting appears even, avoiding harsh shadows and highlighting the scene's playful nature.

[Scene (B)] In the brightly colored interior of what appears to be a toy vehicle, we see an action figure of Raphael from the Teenage Mutant Ninja Turtles. Raphael is green with a red bandana and brown belts, holding silver sai weapons. He seems to be standing inside a yellow and green compartment, which resembles the Turtle Van, with circular patterns resembling tires on either side. The environment suggests a playful setting typical of children's action figure playsets. The light is consistent and bright, highlighting the vibrant colors of the toy figures and their surroundings.

[Scene (C)] In a vibrantly colored setting with a blue and pink gradient background, action figures resembling characters from the Teenage Mutant Ninja Turtles universe are involved in a storyline. Two figures are inside a jail cell within a plastic playset resembling a high-tech facility with metallic gray and blue accents. The hand of a person is seen interacting with the scene, placing another action figure into the cell through the black barred door. The lighting is bright and even, casting no shadows and highlighting the colorful environment and the detailed designs on the figures.

What is the correct temporal order of the Scenes (A), (B), (C) based on the video?

How subjects answered

  • Claude-Opus correct

    Based on my analysis of the video, the correct temporal order of the scenes is:

    Scene (C) Scene (B) Scene (A)

    In Scene (C), the Teenage Mutant Ninja Turtles action figures are placed inside a jail cell playset by a person's hand.

    This is followed by Scene (B), which shows the Raphael figure alone inside what appears to be the Turtle Van vehicle.

    Finally, Scene (A) depicts the Turtles figures in a victorious pose before collapsing, with enemy robot figures in the background.

    So the chronological sequence of events in the video is C, then B, and lastly A.

Subjects

The models, agents, and reward models evaluated.

3 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1VideoChat20.4704
  2. 2VideoChat0.2796
  3. 3Claude-Opus0.2313