Skip to main content

Agents & Tool Use

EmbodiedBench

MLLMs act as vision-driven embodied agents across four simulated environments; response is per-episode binary task success.

969items
11subjects
93%observed
MITlicense
reasoningdomain
tool_usedomain
textmodality
imagemodality

Response matrix

Fit to width. Hover for subject & item; click a cell for details.

EmbodiedBench response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

Item 10% solve rate

Move a strawberry from the right counter to the right receptacle of the left counter.

How subjects answered

  • Claude 3.5 Sonnet incorrect

    [{"visual_description": "The image shows a kitchen environment with various cabinets and counters. There appears to be a right counter area and a left counter area with storage spaces. The environment has wooden flooring and white cabinets mounted on the walls.", "reasoning_and_reflection": "This is a new task without any previous interaction history. The task requires moving a strawberry from the right counter to a specific part of the left counter. Based on the instruction mentioning the 'right receptacle of the left counter', we'll need to use the left drawer of the kitchen counter as the destination.", "language_plan": "1. Navigate to the right counter in the kitchen where the …

  • Claude 3.7 Sonnet incorrect

    [{"visual_description": "The image shows a hallway or corridor with light wood flooring. On the right side, there appears to be a blue and white object that might be a bottle or container. The space has light-colored walls and what looks like a door or cabinet on the left side. This appears to be a narrow passage in what might be a kitchen or home environment.", "reasoning_and_reflection": "Based on the image, I can see what appears to be a hallway or corridor in a home environment. The instruction asks me to move a strawberry from the right counter to the right receptacle of the left counter. However, I cannot clearly see a strawberry or counters in this particular view. I will need to …

  • Gemini 1.5 Pro incorrect

    [{"visual_description": "I can see a strawberry on the right counter. The right drawer of the left counter is closed but I can navigate there and place the strawberry.", "reasoning_and_reflection": "The task is to move a strawberry from the right counter to the right drawer of the left counter. I'll navigate to the right counter, pick up the strawberry and go to the right drawer of the kitchen counter and place it.", "language_plan": "First navigate to the right counter. Then pick up the strawberry. Finally, navigate to the right drawer of the kitchen counter and place the strawberry there.", "executable_plan": [{"step_id": 1, "img_path": …

  • Ovis2-34B incorrect

    [{"visual_description": "The image shows a hallway with a wooden floor and a door on the left. A robotic arm is visible on the right side, suggesting the robot is in the scene. The environment appears to be a home setting, but the specific location of the strawberry or the right receptacle of the left counter is not visible in this image.", "reasoning_and_reflection": "Based on the image, the robot is in a hallway, and the strawberry is not visible. To complete the task, the robot must first navigate to the right counter to locate the strawberry. Once the strawberry is found, the robot should pick it up and then navigate to the left counter to place it in the right receptacle. The plan …

  • Qwen-VL-Max incorrect

    [{"visual_description": "The image shows a room with a wooden floor, a closed door, and a cabinet. There is no visible strawberry or right counter in the current view. The robot appears to be near a doorway.", "reasoning_and_reflection": "Since the strawberry and the right counter are not visible in the current visual state, the first step should be to navigate to the right counter where the strawberry might be located. Once there, the robot can pick up the strawberry if it's present. After picking up the strawberry, the robot needs to move to the left counter and place the strawberry in its right receptacle. If the strawberry is not found at the right counter, the robot may need to explore …

  • Qwen2.5-VL-72B-Instruct incorrect

    [{"visual_description": "The image shows a room with a wooden floor, a closed door, and a cabinet. There is no visible strawberry or right counter in the current view. The robot appears to be near a doorway.", "reasoning_and_reflection": "Since the strawberry and the right counter are not visible in the current visual state, the first step should be to navigate to the right counter where the strawberry might be located. Once there, the robot can pick up the strawberry if it's present. After picking up the strawberry, the robot needs to move to the left counter's right receptacle to place the strawberry. If the strawberry is not found at the right counter, further exploration may be …

Item 20% solve rate

Put a pan with an apple in it in the fridge.

How subjects answered

  • Claude 3.5 Sonnet incorrect

    [{"visual_description": "The image shows what appears to be a kitchen environment with a stainless steel refrigerator visible. The floor appears to be tiled and there are some shadows cast on it. The refrigerator has a water/ice dispenser on the front.", "reasoning_and_reflection": "To complete this task, we need to first locate and pick up an apple, then find a pan, put the apple in the pan, and finally place the pan with the apple in the refrigerator. The task requires careful sequencing of actions to ensure each object is properly handled and placed.", "language_plan": "1. Find an apple in the environment\n2. Pick up the apple\n3. Locate a pan\n4. Put down the apple\n5. Pick up the …

  • Claude 3.7 Sonnet incorrect

    [{"visual_description": "The image shows a kitchen environment with a stainless steel refrigerator visible in the center. The refrigerator has a water/ice dispenser on its door. To the left side of the image, there appears to be a wooden cabinet or door panel. On the right side, there seems to be a dark-colored cabinet or appliance. The floor is tiled in a light brown color. There appears to be some light coming in, creating shadow patterns on the floor. The kitchen appears to be modern and clean.", "reasoning_and_reflection": "Based on the image, I can see a refrigerator which will be needed for the final step of the task. To complete the instruction 'Put a pan with an apple in it in the …

  • Gemini 1.5 Pro incorrect

    [{"visual_description": "I see a fridge, a table, cabinets, and part of an oven.", "reasoning_and_reflection": "I need to find a pan and an apple. Then I will put the apple in the pan and place the pan with the apple into the fridge.", "language_plan": "First, I'll find a pan and pick it up. Then, I'll find an apple and pick it up. I'll put the apple in the pan. Finally, I'll find the fridge, open it, put the pan with the apple inside, and close the fridge.", "executable_plan": [{"step_id": 1, "img_path": "images/gemini-1.5-pro_vlm_subset_10shots_imgsize500/base/episode_35/episode_35_step_1.png", "action": [7, "find a Pan"], "action_success": 1.0, "env_feedback": "Last action executed …

  • Ovis2-34B incorrect

    [{"visual_description": "The image shows a kitchen scene with a refrigerator, a stove, and a countertop. There is a blue plate on the countertop, and a shadow is cast on the floor. The refrigerator is closed, and there is a towel hanging on the oven door.", "reasoning_and_reflection": "To complete the task, I need to locate a pan and an apple, place the apple in the pan, and then put the pan in the fridge. I will start by finding the pan and the apple, then proceed to place the apple in the pan, and finally, place the pan in the fridge.", "language_plan": "First, locate a pan and pick it up. Next, find an apple and pick it up. Place the apple in the pan. Then, find the refrigerator, open …

  • Qwen-VL-Max incorrect

    [{"visual_description": "The image shows a kitchen scene with a refrigerator, a countertop, and a stove. The floor is tiled, and there are cabinets on the left side of the image. No visible pan or apple is present in this view.", "reasoning_and_reflection": "To put a pan with an apple in it into the fridge, I need to first locate a pan and an apple. Since neither is visible in the current view, I will start by finding a pan. Once the pan is found, I can then find an apple and place it inside the pan. After that, I will open the fridge and place the pan inside before closing the fridge. If the pan or apple is not immediately visible, I may need to explore other parts of the room or check …

  • Qwen2.5-VL-72B-Instruct incorrect

    [{"visual_description": "The image shows a kitchen scene with a refrigerator, a countertop, and a stove. The floor is tiled, and there are cabinets on the left side of the image. No visible objects like a pan or an apple are present in this view.", "reasoning_and_reflection": "To put a pan with an apple in it into the fridge, I need to first locate a pan and an apple. Since neither is visible in the current view, I should start by finding them. Once found, I can pick up the apple, place it in the pan, and then move towards the fridge to open it and place the pan inside. If the pan or apple is not immediately visible, I may need to explore other parts of the room or check different …

Item 30% solve rate

Get near the flat, round item with a shiny golden surface.

How subjects answered

  • Claude 3.5 Sonnet incorrect

    [{"visual_description": "I can see a dining area with a table, and there appears to be a round, metallic golden plate or decorative item on the table. The item is positioned slightly to the front and right of my current viewpoint.", "reasoning_and_reflection": "From the examples provided, I learned that the most efficient approach is to use forward and sideways movements as the primary navigation strategy, while avoiding unnecessary rotations. The target object is visible and accessible, so I can start moving toward it directly. Following Example 0's approach of getting as close as possible, I should start with forward movement to approach the table.", "language_plan": "1. Move forward to …

  • Claude 3.7 Sonnet incorrect

    [{"visual_description": "I can see a dining area with a wooden table in the center. On the table, there appears to be a blue object and what looks like a round, flat item with a golden/shiny surface - likely a plate or decorative dish. The room has light-colored walls, wooden flooring, and some framed artwork visible on the wall to the left. There are chairs positioned around the table, and the space appears to be part of a home interior.", "reasoning_and_reflection": "Based on the image, I can identify that the target object - a flat, round item with a shiny golden surface (likely a decorative plate) - is located on the dining table in front of me. Following the strategy provided, I should …

  • Gemini 1.5 Pro incorrect

    [{"visual_description": "I am in a room with light blue walls. In front of me is a partially visible dining table with white chairs. To my left, there's a framed picture on the wall, and what appears to be the edge of some furniture, possibly a changing table or the back of a sofa. The lighting in the room is soft and even.", "reasoning_and_reflection": "I've analyzed the provided examples and understand the importance of efficient navigation and obstacle avoidance. The strategy emphasizes using forward and lateral movements as the primary means of navigation, reserving rotations for when the target object is out of view. Currently, I cannot locate the "flat, round item with a shiny …

  • Ovis2-34B incorrect

    [{"visual_description": "The image shows a dining area with a table in the center. On the table, there is a blue vase and a flat, round item with a shiny golden surface. The item is located near the center of the table.", "reasoning_and_reflection": "Based on the visual observation, the target object, a flat, round item with a shiny golden surface, is on the table. To approach it, I need to move forward towards the table. There are no immediate obstacles in the way, so I can proceed directly.", "language_plan": "1. Move forward towards the table to get closer to the flat, round item with a shiny golden surface.", "executable_plan": [{"step_id": 1, "img_path": …

  • Qwen-VL-Max incorrect

    [{"visual_description": "The current view shows a dining area with a wooden table, chairs, and a blue vase on the table. The room has light-colored walls, a framed picture on the left wall, and a window on the right side. There is no visible flat, round item with a shiny golden surface in the immediate field of view.", "reasoning_and_reflection": "Based on the visual state, the target object (flat, round item with a shiny golden surface) is not currently visible. Following the strategy, I need to rotate my view to locate the target object. Since rotation changes the field of view, I will plan one rotation at a time until the object appears. No previous actions have been executed yet, so …

  • Qwen2.5-VL-72B-Instruct incorrect

    [{"visual_description": "The current view shows a room with a dining table in the center, surrounded by chairs. There is a blue vase on the table and framed artwork on the walls. The floor is wooden, and there are no visible flat, round items with a shiny golden surface.", "reasoning_and_reflection": "Based on the visual observation, the target object (a flat, round item with a shiny golden surface) is not currently visible. To locate it, I need to rotate my view to explore other parts of the room. Since rotation changes the field of view, I will perform a single rotation action at a time until the target object appears.", "language_plan": "Step 1: Rotate to the right by 90 degrees to …

Item 413% solve rate

Stack the maroon cylinder and the navy cylinder in sequence.

How subjects answered

  • Qwen2.5-VL-72B-Instruct correct

    [{"visual_description": "From left to right, I can see a green cylinder at [63, 32, 17], a navy cylinder at [51, 42, 18], a maroon cylinder at [63, 58, 17] and a red cylinder at [37, 64, 17].", "reasoning_and_reflection": "To stack the objects in sequence, the maroon cylinder at [63, 58, 17] should be at the bottom and the navy cylinder at [51, 42, 18] should be placed on top. The plan involves moving the gripper to the navy cylinder's position, closing the gripper to pick it up, moving to the top of the maroon cylinder, and opening the gripper to release the navy cylinder onto the maroon cylinder.", "language_plan": "Step 1: Move to the navy cylinder\nStep 2: Pick up the navy …

  • Claude 3.5 Sonnet incorrect

    [{"visual_description": "From left to right, I can see a green cylinder at [63, 32, 17], a navy cylinder at [51, 42, 18], a maroon cylinder at [63, 58, 17], and a red cylinder at [37, 64, 17].", "reasoning_and_reflection": "I need to understand the instruction first. To stack the objects in sequence, the maroon cylinder at [63, 58, 17] should be at the bottom and the navy cylinder at [51, 42, 18] should be placed on top. The plan involves moving the gripper to the navy cylinder's position, closing the gripper to pick it up, moving to the top of the maroon cylinder, and opening the gripper to release the navy cylinder onto the maroon cylinder.", "language_plan": "1. Move gripper above navy …

  • Gemini 2.0 Flash incorrect

    [{"visual_description": "From left to right, I can see a lime cylinder at [63, 32, 17], a blue cylinder at [51, 42, 18], a maroon cylinder at [63, 58, 17] and a navy cylinder at [37, 64, 17].", "reasoning_and_reflection": "I need to understand the instruction first. To stack the objects in sequence, the navy cylinder at [37, 64, 17] should be at the bottom and the maroon cylinder at [63, 58, 17] should be placed on top. The plan involves moving the gripper to the maroon cylinder's position, closing the gripper to pick it up, moving to the top of the navy cylinder, and opening the gripper to release the maroon cylinder onto the navy cylinder.", "language_plan": "1. Move to the maroon …

  • InternVL2.5-78B incorrect

    [{"visual_description": "From left to right, I can see a green cylinder at [63, 32, 17], a blue cylinder at [51, 42, 18], a red cylinder at [63, 58, 17], and a maroon cylinder at [37, 64, 17].", "reasoning_and_reflection": "I need to understand the instruction first. To stack the objects in sequence, the maroon cylinder at [37, 64, 17] should be at the bottom and the navy cylinder at [51, 42, 18] should be placed on top. The plan involves moving the gripper to the navy cylinder's position, closing the gripper to pick it up, moving to the top of the maroon cylinder, and opening the gripper to release the navy cylinder onto the maroon cylinder.", "language_plan": "1. Move gripper to the navy …

  • Llama-3.2-90B-Vision-Instruct incorrect

    [{"visual_description": "From left to right, I can see a red cylinder at [63, 32, 17], a navy cylinder at [51, 42, 18], a maroon cylinder at [63, 58, 17] and a green cylinder at [37, 64, 17].", "reasoning_and_reflection": "I need to understand the instruction first. To stack the objects in sequece, the maroon cylinder at [63, 58, 17] should be at the bottom and the navy cylinder at [51, 42, 18] should be placed on top. The plan involves moving the gripper to the navy cylinder's position, closing the gripper to pick it up, moving to the top of the maroon cylinder, and opening the gripper to release the navy cylinder onto the maroon cylinder.", "language_plan": "[[51, 42, 26, 0, 60, 34, 1], …

  • Qwen-VL-Max incorrect

    [{"visual_description": "From left to right, I can see a green cylinder at [63, 32, 17], a navy cylinder at [51, 42, 18], a maroon cylinder at [63, 58, 17] and a red cylinder at [37, 64, 17].", "reasoning_and_reflection": "To stack the objects in sequence, the maroon cylinder at [63, 58, 17] should be at the bottom and the navy cylinder at [51, 42, 18] should be placed on top. The plan involves moving the gripper to the navy cylinder's position, closing the gripper to pick it up, moving to the top of the maroon cylinder, and opening the gripper to release the navy cylinder onto the maroon cylinder.", "language_plan": "Step 1: Move to the navy cylinder\nStep 2: Pick up the navy …

Item 518% solve rate

Move all the hammer from the brown table to the right counter.

How subjects answered

  • GPT-4o correct

    [{"visual_description": "The image shows a room with a kitchen area in the background. There is a brown table visible with some objects on it. The right counter in the kitchen is also visible.", "reasoning_and_reflection": "The task is to move all hammers from the brown table to the right counter. The brown table is identified as table 2. The robot needs to navigate to table 2, pick up any hammers present, and then navigate to the right counter to place them there. The plan involves checking table 2 for hammers and moving them to the right counter.", "language_plan": "First, navigate to table 2 to check for hammers. If a hammer is found, pick it up and navigate to the right counter to place …

  • Qwen-VL-Max correct

    [{"visual_description": "The image shows a room with various pieces of furniture, including a sofa, a chair, a table (possibly the brown table), and a TV stand. There is no visible hammer on the brown table in this view. The floor has a patterned rug, and there are other objects like a lamp and some cabinets in the background.", "reasoning_and_reflection": "To move all hammers from the brown table to the right counter, I need to first locate the hammer. Since it's not visible in the current visual state, I should navigate to the brown table (table 2) to check if the hammer is there. If found, I will pick it up and then move to the right counter to place it. If the hammer is not at the brown …

  • Claude 3.5 Sonnet incorrect

    [{"visual_description": "The image shows a living room environment with various furniture pieces including a brown table (table 2) visible in the scene. From this viewpoint, I can see a hammer on the brown table.", "reasoning_and_reflection": "Based on the visual information, I can see a hammer on the brown table (table 2). The task requires moving this hammer to the right counter in the kitchen. This is a straightforward task that requires navigating to the hammer's location, picking it up, and then moving it to the target location.", "language_plan": "1. Navigate to table 2 (brown table) where the hammer is located\n2. Pick up the hammer\n3. Navigate to the right counter in the …

  • Llama-3.2-90B-Vision-Instruct incorrect

    [{"visual_description": "The robot is currently in a room with a brown table and a right counter. The brown table has a hammer on it, and the right counter is empty.", "reasoning_and_reflection": "To move all the hammers from the brown table to the right counter, the robot needs to first navigate to the brown table, pick up the hammer, and then move to the right counter to place it. Since there is only one hammer on the brown table, the robot only needs to perform this sequence of actions once.", "language_plan": "The plan is to navigate to the brown table, pick up the hammer, and then move to the right counter to place it.", "executable_plan": [{"step_id": 1, "img_path": …

  • Ovis2-34B incorrect

    [{"visual_description": "The image shows a room with a brown table, a sofa, and a chair. There is a blue robotic arm visible, suggesting the robot is in the room. The brown table is in the foreground, and the sofa is in the background. The right counter is not visible in the image.", "reasoning_and_reflection": "The task is to move all hammers from the brown table to the right counter. The image shows the brown table, but the hammers are not visible. The robot must navigate to the brown table to locate the hammers. Once the hammers are found, the robot will pick them up and navigate to the right counter to place them there.", "language_plan": "First, navigate to the brown table to locate …

  • Qwen2.5-VL-72B-Instruct incorrect

    [{"visual_description": "The image shows a living room with various pieces of furniture. There is a sofa, an armchair, a table (possibly the brown table), and a TV stand. The floor has two rugs, and there is a lamp near the TV stand. No hammer is visible on the brown table in this view.", "reasoning_and_reflection": "Since the hammer is not visible on the brown table in the current visual state, it may be necessary to navigate closer to the table or explore other areas where the hammer might be located. Once the hammer is found, it can be picked up and moved to the right counter. If the hammer is indeed on the brown table but not visible from this angle, navigating directly to the table …

Item 627% solve rate

Put a glass with a butter knife in it and put them in a sink.

How subjects answered

  • Claude 3.5 Sonnet correct

    [{"visual_description": "The image shows a corner of what appears to be a room with beige or tan colored walls and tiled flooring. The walls meet at a corner and appear clean and plain.", "reasoning_and_reflection": "To complete this task, we need to handle multiple objects (a glass and a butter knife) and place them together in a sink. We'll need to first locate both objects, then combine them, and finally place them in the sink. Since we can't directly put the knife into a glass that we're not holding, we'll need to pick up the glass first, then the knife, and finally move both to the sink.", "language_plan": "1. Find a glass bottle to use as the container. 2. Pick up the glass bottle. 3. …

  • Claude 3.7 Sonnet correct

    [{"visual_description": "The image shows a corner of a room with a plain brown wall and a tiled floor. There appears to be a wall junction visible. The space is empty with no visible objects, furniture, or appliances in the frame.", "reasoning_and_reflection": "Based on the visual input, I can see an empty room corner with no objects visible. To complete the task of putting a glass with a butter knife in a sink, I will need to first locate these objects since they are not visible in the current view. I will need to find a glass bottle (as there is no specific 'glass' in the action list), a butter knife, and a sink. Then I will need to arrange them as instructed.", "language_plan": "1. Find …

  • GPT-4o correct

    [{"visual_description": "The image shows a section of a room with a yellow wall and a tiled floor. There are no visible objects or furniture in the immediate view.", "reasoning_and_reflection": "The task requires placing a glass with a butter knife inside it into a sink. The current visual state does not show any objects, so the first step is to locate the necessary items: a glass, a butter knife, and a sink. The previous plan failed because there was no attempt to locate these objects first.", "language_plan": "First, locate a glass and pick it up. Then, find a butter knife and place it inside the glass. Next, search for a sink and place the glass with the butter knife inside it.", …

  • Ovis2-34B incorrect

    [{"visual_description": "The image shows a plain wall with a tiled floor, suggesting a room with minimal visible objects. The perspective is from a low angle, possibly indicating the robot is positioned near the floor or a lower surface.", "reasoning_and_reflection": "The current visual state does not provide information about the location of the glass or the butter knife. To accomplish the task, the robot must first locate these items. The plan will involve finding the glass and butter knife, picking them up, and then placing them in the sink. The robot should ensure that it follows the action rules and does not perform invalid actions, such as attempting to pick up an object without first …

  • Qwen-VL-Max incorrect

    [{"visual_description": "The image shows a corner of a room with a plain wall and tiled floor. There are no visible objects, furniture, or receptacles in the immediate view.", "reasoning_and_reflection": "Since there are no visible objects or receptacles in the current view, I need to locate the glass and butter knife first. The task requires finding these items, placing the butter knife inside the glass, and then putting both into the sink. Given the lack of visible objects, I should start by searching for the glass and butter knife using the 'find' action.", "language_plan": "First, find the glass. Once located, pick it up. Next, find the butter knife and place it inside the glass. Then, …

  • Qwen2.5-VL-72B-Instruct incorrect

    [{"visual_description": "The image shows a corner of a room with a plain wall and tiled floor. There are no visible objects, furniture, or receptacles in the immediate view.", "reasoning_and_reflection": "Since there are no visible objects or receptacles in the current visual state, I need to locate the glass and butter knife first. The task requires finding these items before putting them in the sink. Given the lack of visible objects, I will start by searching for the glass and then the butter knife.", "language_plan": "First, find the glass. Once located, pick it up. Next, find the butter knife and place it inside the glass. Then, find the sink and put the glass with the butter knife …

Item 736% solve rate

Move closer to the small round object with a green surface and a cylindrical shape.

How subjects answered

  • Qwen-VL-Max correct

    [{"visual_description": "The current view shows a kitchen with tiled walls, wooden cabinets, and a countertop. A small round green object with a cylindrical shape is visible on the right side of the counter near the sink.", "reasoning_and_reflection": "Based on the visual observation, the target object (small round green object) is located to the right of the current position. The previous actions have not been executed yet, so there is no feedback to reflect upon. To achieve the goal, I need to move closer to the object by navigating towards it using forward and rightward movements.", "language_plan": "Step 1: Move rightward by 0.25\nStep 2: Move forward by 0.25", "executable_plan": …

  • Qwen2.5-VL-72B-Instruct correct

    [{"visual_description": "The current view shows a kitchen with tiled walls, wooden cabinets, and various appliances. A small round object with a green surface and cylindrical shape is visible on the countertop near the sink.", "reasoning_and_reflection": "Based on the visual observation, the target object (small round green object) is located on the countertop near the sink. The previous actions have not been executed yet, so there is no feedback to reflect upon. To achieve the goal, I need to move closer to the object by navigating towards it using forward and right/leftward movements.", "language_plan": "Step 1: Move forward by 0.25\nStep 2: Move rightward by 0.25", "executable_plan": …

  • GPT-4o correct

    [{"visual_description": "The kitchen is visible with a stove on the left, a sink in the center, and a counter with various objects. A small round object with a green surface and cylindrical shape is on the counter near the sink.", "reasoning_and_reflection": "The target object, a small round object with a green surface and cylindrical shape, is located on the counter near the sink, slightly to the right. There are no immediate obstacles blocking the path to the object. Following the strategy, I will move forward and slightly to the right to get closer to the object.", "language_plan": "1. Move forward by 0.25. 2. Move rightward by 0.25.", "executable_plan": [{"step_id": 1, "img_path": …

  • InternVL2.5-78B incorrect

    [{"visual_description": "I am currently in a small kitchen with wooden cabinets and a tiled floor. There is a gas stove in front of me to the left and a kitchen counter to the right. Various kitchen items are scattered on the counter. A fridge is visible in the far corner near the sink, which has a window above it. The sky visible through the window indicates daytime.", "reasoning_and_reflection": "Based on the visual state, I need to locate the small round object with a green surface and a cylindrical shape. My current view does not include this object, so I will first need to locate it by rotating. Once found, I can plan a navigation path to move closer to it. My main focus is on …

  • Gemini 2.0 Flash incorrect

    [{"visual_description": "I can see the small round object with a green surface and a cylindrical shape in the right side of the image, it is on the counter. I need to move closer to it.", "reasoning_and_reflection": "The object is on the right side in front of me. I will move forward and rightward to get closer to it. I am following the strategy of using move forward and move right/left as main strategy.", "language_plan": "1: Move forward by 0.25. 2: Move rightward by 0.25.", "executable_plan": [{"step_id": 1, "img_path": "images/gemini-2.0-flash_additional_nav_no_c_h/visual_appearance/episode_15/episode_15_step_1.png", "action": [0, "Move forward by 0.25"], "action_success": true, …

  • Llama-3.2-90B-Vision-Instruct incorrect

    [{"visual_description": "The current visual state shows a kitchen scene with various objects, including a sink, stove, and cabinets. The small round object with a green surface and a cylindrical shape is located on the counter near the sink.", "reasoning_and_reflection": "Based on the previous examples and the strategy provided, I understand that I need to navigate to the target object by moving forward, backward, left, or right, and rotating to change my view. Since the target object is in front of me, I will start by moving forward to get closer to it.", "language_plan": "Step 1: Move forward by 0.25 to approach the counter where the target object is located. Step 2: Move leftward by 0.25 …

Item 845% solve rate

When the fridge is open, relocate an plate to the black table; if it's closed, shift an screwdriver to the black table.

How subjects answered

  • Claude 3.5 Sonnet correct

    [{"visual_description": "In the current scene, I can see a refrigerator in a kitchen environment. The refrigerator appears to be closed based on the visual.", "reasoning_and_reflection": "Since the refrigerator is currently closed, according to the conditional instruction, I should focus on moving a screwdriver to the black table (table 1). This requires first finding the screwdriver, then navigating to the black table to place it.", "language_plan": "Since the refrigerator is closed, the plan is to move a screwdriver to the black table. The steps are: 1. Navigate to the TV stand where the screwdriver might be located, 2. Pick up the screwdriver, 3. Navigate to table 1 (the black table), 4. …

  • Claude 3.7 Sonnet correct

    [{"visual_description": "The image shows a kitchen environment with light gray cabinets along the wall. There's a countertop with drawers underneath. The floor is wooden with a light finish. A small blue rug or mat is visible on the floor. On the right side of the image, there appears to be part of a blue and white object. The refrigerator is visible but appears to be closed in this view. The kitchen has a clean, minimalist design with neutral colors.", "reasoning_and_reflection": "Based on the visual information, I can see that the refrigerator appears to be closed. According to the instruction, I need to determine whether the refrigerator is open or closed, and then perform the …

  • Gemini 1.5 Pro correct

    [{"visual_description": "The refrigerator is closed. There is a brown table (table 2), a black table (table 1), a right counter, and a left counter. The screwdriver is on the right counter.", "reasoning_and_reflection": "The task is conditional, depending on the state of the refrigerator. Since the refrigerator is currently closed, the second part of the instructions (moving the screwdriver) is triggered. I will execute the plan to move the screwdriver to table 1. If the refrigerator were open, the plan would involve moving a plate to table 1 instead. ", "language_plan": "I observe that the refrigerator is closed. Therefore, I should locate a screwdriver and move it to the black table …

  • InternVL2.5-78B incorrect

    [{"visual_description": "The image shows a living room with a sofa, a table with bananas, and a cabinet. There is no visible refrigerator or plate in the image.", "reasoning_and_reflection": "The task requires checking the state of the refrigerator and then moving either a plate or a screwdriver to the black table. Since the refrigerator is not visible in the image, the first step is to navigate to the refrigerator to check its state.", "language_plan": "1. Navigate to the refrigerator to check if it is open or closed. 2. If the refrigerator is open, locate and pick up a plate. 3. If the refrigerator is closed, locate and pick up a screwdriver. 4. Navigate to the black table (table 1) and …

  • Ovis2-34B incorrect

    [{"visual_description": "The image shows a kitchen scene with a refrigerator on the left, a counter with a hammer on it, and a robot arm in the foreground. The refrigerator appears to be closed.", "reasoning_and_reflection": "The refrigerator is currently closed, so the task requires relocating a screwdriver to the black table. The robot should first navigate to the location of the screwdriver, pick it up, and then move to the black table to place it there.", "language_plan": "The plan is to navigate to the location of the screwdriver, pick it up, and then move to the black table to place it there.", "executable_plan": [{"step_id": 1, "img_path": …

  • Llama-3.2-90B-Vision-Instruct incorrect

    [{"visual_description": "The current visual state shows a room with various objects and furniture. The refrigerator is open, and there is a plate on the left counter. There is also a screwdriver on the table 2. The black table is empty.", "reasoning_and_reflection": "Based on the human instruction, if the fridge is open, the robot should relocate the plate to the black table. Since the fridge is indeed open, the robot will focus on this task. The robot will first navigate to the left counter to pick up the plate, then move to the black table (identified as table 1) to place the plate there.", "language_plan": "To achieve the goal, the robot must locate the plate on the left counter and then …

Item 955% solve rate

Grab an item for securing the loose screw within the left counter.

How subjects answered

  • Claude 3.5 Sonnet correct

    [{"visual_description": "The scene shows a kitchen environment with various storage areas including cabinets, counters, and a refrigerator. There's a left counter visible in the kitchen area, and various tools and items appear to be scattered across different surfaces.", "reasoning_and_reflection": "To secure a loose screw, we would need a tool like a screwdriver. The most appropriate tool for this task would be a screwdriver, which should be located within the left counter according to the instruction. We need to navigate to the left counter first to access the tool.", "language_plan": "1. Navigate to the left counter in the kitchen to locate the screwdriver\n2. Pick up the screwdriver …

  • InternVL3-78B correct

    [{"visual_description": "The image shows a room with a wooden floor, a sofa, a table, a TV stand, and a kitchen area with cabinets and a refrigerator. The robot is positioned near the kitchen area.", "reasoning_and_reflection": "The task is to grab an item for securing a loose screw within the left counter. The robot needs to navigate to the left counter to find a suitable tool, such as a screwdriver or wrench, to secure the screw.", "language_plan": "First, navigate to the left counter in the kitchen. Then, pick up a suitable tool, such as a screwdriver or wrench, from the counter. Finally, use the tool to secure the loose screw.", "executable_plan": [{"step_id": 1, "img_path": …

  • Gemini 2.0 Flash correct

    [{"visual_description": "The image shows a room with a chair, a blue cabinet, a table, a TV stand, and a partial view of a door. There is a white bowl on top of the blue cabinet. A gray chair is in the middle of the scene. I do not see any tool now.", "reasoning_and_reflection": "The user is asking me to find an item to secure a loose screw inside the left counter. I should navigate to the left counter first. Then, I need to pick an item from the counter. I will pick up the screwdriver since it can be used to secure loose screws.", "language_plan": "The robot needs to navigate to the left counter in the kitchen and then pick up the screwdriver.", "executable_plan": [{"step_id": 1, …

  • GPT-4o incorrect

    [{"visual_description": "The image shows a kitchen area with a counter and some utensils on it. There is a visible screwdriver on the counter.", "reasoning_and_reflection": "The task requires securing a loose screw, which typically involves using a screwdriver. The screwdriver is visible on the counter, making it accessible for picking up. The robot needs to navigate to the left counter, pick up the screwdriver, and then proceed to secure the screw.", "language_plan": "To secure the loose screw, the robot should first navigate to the left counter where the screwdriver is located. Then, it should pick up the screwdriver to use it for securing the screw.", "executable_plan": [{"step_id": 1, …

  • InternVL2.5-78B incorrect

    [{"visual_description": "The image shows a kitchen with a wooden floor, cabinets, a sink, and various items on the counters. There is a blue mat on the floor near the sink.", "reasoning_and_reflection": "The task is to grab an item for securing a loose screw within the left counter. The most suitable item for this task would be a screwdriver. The screwdriver is likely to be found in a drawer or cabinet. The robot should navigate to the left counter and check the drawers or cabinets for the screwdriver.", "language_plan": "1. Navigate to the left counter in the kitchen. 2. Open the left drawer of the kitchen counter. 3. Pick up the screwdriver from the drawer. 4. Close the left drawer of the …

  • Llama-3.2-90B-Vision-Instruct incorrect

    [{"visual_description": "The image shows a kitchen with various objects, including a left counter, a right counter, a sink, and several cabinets. On the left counter, there is a sponge, a plate, and a bowl. On the right counter, there is a can, a peach, and a spoon. In the background, there is a refrigerator, a stove, and a table with a chair.", "reasoning_and_reflection": "To secure the loose screw within the left counter, we need to find an item that can be used for this purpose. Based on the visual state, we can see that there is a sponge on the left counter, which could potentially be used to secure the screw. However, it's not clear if the sponge is the best option or if there are …

Item 1064% solve rate

The morning light streaming in through the window is soft and gentle. A small plant on the windowsill adds a touch of green to the room. Could you please move close to the alarm clock and remain there? The quietness of the room is perfect for a relaxing start to the day.

How subjects answered

  • Claude 3.7 Sonnet correct

    [{"visual_description": "I'm in what appears to be a bedroom. There's a desk with a computer setup on the left side, a bed on the right side, and a ceiling fan above. The room has light-colored walls with some framed items hanging on them. There's a small nightstand visible near the bed which likely contains the alarm clock mentioned in the instruction.", "reasoning_and_reflection": "Based on the strategy and examples provided, I need to first locate the target object (alarm clock) and then navigate toward it using primarily forward and lateral movements, minimizing rotations. From my current position, I can see the bedroom layout but cannot clearly identify the alarm clock. Following the …

  • InternVL2.5-78B correct

    [{"visual_description": "The image shows a room with a desk, chair, and a bed with a nightstand. The alarm clock is on the nightstand next to the bed.", "reasoning_and_reflection": "I need to locate the alarm clock which is on the nightstand beside the bed. The nightstand is to the right side of the room, and I see the bed and nightstand in the front-right direction.", "language_plan": "Step 1: Move forward twice.\nStep 2: Move rightward twice.", "executable_plan": [{"step_id": 1, "img_path": "images/InternVL2_5-78B_additional_nav_no_c_h/complex_instruction/episode_42/episode_42_step_1.png", "action": [0, "Move forward by 0.25"], "action_success": true, "env_feedback": "Last action …

  • GPT-4o correct

    [{"visual_description": "The room contains a desk with books and a chair, a bed, a nightstand with a lamp, and a ceiling fan. The alarm clock is on the nightstand next to the bed.", "reasoning_and_reflection": "The alarm clock is located on the nightstand to the right of the bed. To reach it, I need to navigate around the bed and nightstand. The strategy suggests using forward and side movements to approach the target object. I will move forward and then rightward to get closer to the alarm clock.", "language_plan": "1. Move forward by 0.25. 2. Move rightward by 0.25.", "executable_plan": [{"step_id": 1, "img_path": …

  • Claude 3.5 Sonnet incorrect

    [{"visual_description": "From my current position, I can see a desk with a computer setup on the left side of the room. There appears to be a bed on the right side, and I can see what looks like a nightstand or small table near the bed which likely has the alarm clock.", "reasoning_and_reflection": "Learning from the provided strategy and examples, I understand that I should first locate the target object and then use primarily forward and sideways movements to reach it. Since I can't directly see the alarm clock from my current position, I should first rotate to get it in view, following Example 2's approach of using rotation sparingly and only when necessary to locate the target.", …

  • Gemini 2.0 Flash incorrect

    [{"visual_description": "I can see the alarm clock on the nightstand to my right. There is also a lamp on the nightstand. I am currently facing the desk and chair.", "reasoning_and_reflection": "The alarm clock is to my right. I need to turn to the right and then move forward to get closer to it. I will use the strategy of rotating and then moving forward.", "language_plan": "Step 1: Rotate to the right by 90 degrees. Step 2: Move forward by 0.25.", "executable_plan": [{"step_id": 1, "img_path": "images/gemini-2.0-flash_additional_nav_no_c_h/complex_instruction/episode_42/episode_42_step_1.png", "action": [4, "Rotate to the right by 90 degrees."], "action_success": true, "env_feedback": …

  • Qwen-VL-Max incorrect

    [{"visual_description": "The room is well-lit with soft morning light coming through a window. A small plant sits on the windowsill, adding greenery to the space. The alarm clock is located on a bedside table near the bed, and there are various objects like a desk, chair, floor lamp, and books in the room.", "reasoning_and_reflection": "Based on the current visual state, the alarm clock is positioned on the bedside table near the bed. To reach it, I need to move towards the right side of the room. There are no immediate obstacles blocking the path. From previous examples, I understand that moving forward and adjusting my position using left/right movements will help me get closer to the …

Item 1182% solve rate

The soft crackling of a candle fills the room with a soothing background sound. There's a pile of mail sitting on the table, unopened. Please move towards the pot for me. Everything feels calm and in place.

How subjects answered

  • Claude 3.5 Sonnet correct

    [{"visual_description": "I observe a kitchen environment with white cabinets and countertops. There appears to be a pot visible on the counter or stove area in the front-right portion of my view.", "reasoning_and_reflection": "From the examples provided, I learned that the most efficient approach is to use forward and sideways movements as the primary navigation strategy, while using rotation sparingly. The examples also showed that it's better to plan 1-2 actions initially to assess the situation before making longer sequences of moves. Since the pot is visible in my front-right field of view, I can start moving towards it without needing any rotation.", "language_plan": "1. Move forward …

  • Claude 3.7 Sonnet correct

    [{"visual_description": "I can see a kitchen area with white cabinets and countertops. There's a small dining table or counter with what appears to be some items on it in the foreground. The kitchen has standard appliances including what looks like a stove or cooktop on the right side. The walls are a light beige color, and there's a ceiling light visible. The floor appears to be tiled.", "reasoning_and_reflection": "Based on the human instruction, I need to navigate toward a pot in the kitchen. From the current view, I can see the kitchen area, but I need to look more carefully to locate the pot. Following the strategy provided, I should use forward and lateral movements as my main …

  • Gemini 1.5 Pro correct

    [{"visual_description": "I am in a kitchen. I see a table in front of me with a plate, a loaf of bread, a wine glass, and a bottle of wine. To my right is a counter with a microwave, sink, and stove. There are cabinets above the counter. To my left is a blank wall.", "reasoning_and_reflection": "The instruction is to move towards the pot. I can see the pot on the stove to my right. Currently, there are no obstacles between me and the stove. I have learned from the examples to prioritize moving forward and sideways and to use rotation sparingly. I also learned to assess the environment and adjust my actions based on obstacles and the location of the target object. I will move to the right …

  • Qwen2.5-VL-72B-Instruct correct

    [{"visual_description": "The current view shows a kitchen with various appliances and furniture. A dining table is visible on the left side, and a stove with a pot on it is located towards the right side of the room. The floor has a tiled pattern, and there are cabinets and countertops along the walls.", "reasoning_and_reflection": "Based on the visual observation, the target object (the pot) is located on the stove to the right side of the room. There are no immediate obstacles in the path towards the pot. Following the strategy, I should move forward and slightly to the right to approach the pot. Since the pot is clearly visible, rotation is not necessary at this step.", "language_plan": …

  • InternVL2.5-78B incorrect

    [{"visual_description": "I am in a well-lit kitchen area. The room contains a sink, a stove, a microwave on the counter, and a table with a chair. The pot mentioned is not visible in this initial view.", "reasoning_and_reflection": "My initial view does not include the pot. According to the provided strategy, I should aim to rotate until I can see the pot. This strategy suggests using rotation only until the target object appears in the field of view. I have not yet executed any actions.", "language_plan": "1. Rotate right by 90 degrees to get a new perspective. 2. Look for any visible signs of the target pot. If the pot is not found, repeat the rotation until the pot is visible in the …

  • GPT-4o incorrect

    [{"visual_description": "The kitchen is visible with a table on the left, a counter with a sink and various items on the right, and a pot on the stove in the back right corner.", "reasoning_and_reflection": "The task is to move towards the pot, which is located on the stove in the back right corner of the kitchen. The path seems clear, with the table on the left and the counter on the right. Following the strategy, I will move forward and slightly to the right to approach the pot.", "language_plan": "1. Move forward by 0.25. 2. Move forward by 0.25. 3. Move rightward by 0.25.", "executable_plan": [{"step_id": 1, "img_path": …

Item 1291% solve rate

When organizing reading spaces, kindly move the book to the couch. It’s perfect for when the next chapter awaits.

How subjects answered

  • Claude 3.5 Sonnet correct

    [{"visual_description": "The scene shows a living room setup with a coffee table in the center and a couch/sofa with brown cushions. There's also a TV mounted on a white cabinet and some orange-colored chairs. The room has hardwood flooring with a light-colored area rug.", "reasoning_and_reflection": "This is a new task that requires moving a book to the couch. The task is straightforward and can be accomplished in a few steps since both the book and couch are common furniture items that should be accessible in the living room setting.", "language_plan": "1. First, locate the book in the room. 2. Once found, pick up the book carefully. 3. Next, find the sofa's location. 4. Finally, place …

  • Claude 3.7 Sonnet correct

    [{"visual_description": "The image shows a living room with wooden furniture. There is a coffee table in the center with what appears to be a statue or decorative object on it. There are two orange/wooden armchairs visible, and a couch or sofa with a brown/patterned cushion. A white TV stand with a flat-screen TV is mounted on the wall. The floor has a light gray carpet area over what looks like wooden flooring. There appears to be a book on the coffee table that needs to be moved to the couch according to the instruction.", "reasoning_and_reflection": "Based on the visual information, I can see a living room setup with a coffee table that likely has a book on it, and a couch where the book …

  • Gemini 1.5 Pro correct

    [{"visual_description": "The image shows a living room scene. There's a wooden coffee table in the center with a statue, a cardboard box, and some other small items on it. Two wooden armchairs are present, one of which has a brown cushion. A TV is mounted on a white cabinet against the wall. A trash can is visible near the wall. The floor is wooden, and a large rug covers a portion of it.", "reasoning_and_reflection": "The task is to move a book to the couch. To achieve this, I need to first find the book, then pick it up. After that, I need to locate the couch and finally put the book down on it. This is a straightforward pick-and-place task.", "language_plan": "First, I'll locate the book …

  • Ovis2-34B correct

    [{"visual_description": "The image shows a living room with a coffee table in the center, a TV on a white cabinet, and two armchairs. There is a box on the coffee table, and a statue is visible. A couch is partially visible in the foreground. The task is to move a book to the couch.", "reasoning_and_reflection": "The image does not show a book, so the first step is to locate the book. Once the book is found, it can be picked up and moved to the couch. The plan should ensure that the book is placed on the couch, as instructed.", "language_plan": "First, locate the book in the room. Once the book is found, pick it up. Then, find the couch and place the book on it.", "executable_plan": …

  • Qwen2.5-VL-72B-Instruct correct

    [{"visual_description": "The image shows a living room with various pieces of furniture. There is a coffee table in the center with some objects on it, including a statue and possibly a book. A couch is visible on the right side of the image. The room also contains an armchair, a TV on a stand, and a rug covering part of the floor.", "reasoning_and_reflection": "To move the book to the couch, I need to first locate the book. It might be on the coffee table or another surface in the room. Once found, I will pick up the book and then move towards the couch to place it there. If the book is not immediately visible, I may need to explore other surfaces like the TV stand or shelves.", …

  • InternVL2.5-78B incorrect

    [{"visual_description": "The room contains a coffee table with a box and a statue, a TV on a white cabinet, two wooden armchairs, and a brown couch. There is no visible book in the scene.", "reasoning_and_reflection": "The task requires moving a book to the couch. However, there is no visible book in the current scene. The next step is to locate the book.", "language_plan": "1. Locate the book in the room. 2. Pick up the book. 3. Find the couch. 4. Place the book on the couch.", "executable_plan": [{"step_id": 1, "img_path": "images/InternVL2_5-78B_vlm_subsets_step30_10examples_size500/complex_instruction/episode_25/episode_25_step_1.png", "action": [54, "find a Book"], "action_success": …

Subjects

  1. 1Claude 3.7 Sonnet0.5522
  2. 2Claude 3.5 Sonnet0.5121
  3. 3GPT-4o0.4832
  4. 4InternVL3-78B0.4556
  5. 5Gemini 1.5 Pro0.4149
  6. 6Qwen-VL-Max0.3843
  7. 7Gemini 2.0 Flash0.3792
  8. 8Ovis2-34B0.3767
  9. 9Qwen2.5-VL-72B-Instruct0.3594
  10. 10InternVL2.5-78B0.3449
  11. 11Llama-3.2-90B-Vision-Instruct0.2805