Skip to main content

Multimodal

WAGIBench

WAGIBench: egocentric multimodal goal inference for assistive wearable agents. Given video + audio + digital + longitudinal context from smart glasses, a VLM must infer the user goal. 8 VLMs evaluated on the generative (open-set goal generation) task, scored per recording by an LLM judge. Item text is the real multimodal multiple-choice goal question; per-recording predictions are released for 7 input modalities.

2,921items
8subjects
100%observed
Modelsubject type
CC-BY-NC-4.0license
generaldomain
gui_agentdomain
textmodality
imagemodality
audiomodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 8 subjects × 2,921 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

WAGIBench response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: 0 to 1 (per modality combination): multiple-choice accuracy or mean LLM-judge score across vision / audio / language inputs.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate

Infer the wearable-agent user's goal from the observed egocentric context.

Observed context:

  • The user searches the room for something. [vision]
  • The user says "I can't find my keys". [audio]

Which goal is the user most likely pursuing?

  • A.

    Do a search for 'how to brew counter culture using drip brew'

  • B.

    Remember I ate 12 total grams of protein

  • C.

    Do a search for 'where are my keys'

  • D.

    Send a message to Dana saying 'thank you note for gift'

How subjects answered

  • InternVL-2Bscore 0

    Based on the frames provided:

    • The user is in a kitchen and kitchen island area. A digital action they might want to perform could be "search" related to their current needs, such as getting knives or kitchen utensils.

    Fill in the TEMPLATE with this information:

    {
        "action": "search",
        "query_item": "knives",
        "source": "kitchen",
        "content": "What knives are available?"
    }

    This indicates the user is likely in a need-sensing scenario, possibly deciding where to search for specific kitchen items.

  • Qwen2.5-VL-72Bscore 0
    {
        "search": {
            "type": "search",
            "source": ["world"],
            "query_item": "backpack",
            "query": "items needed for hiking"
        }
    }

    Reasoning: The user appears to be preparing for an outdoor activity, possibly hiking, as indicated by the backpack and items like bananas and snacks on the table. They may want to search for information related to hiking essentials or checklists on their phone.

  • Qwen2.5-VL-3Bscore 0

    { "communication": { "type": "communication", "action": "message", "target": "<name of person>", "content": "Hey, I'm leaving soon. Can you take care of the kids while I'm away?" } }

Item 220% solve rate

Infer the wearable-agent user's goal from the observed egocentric context.

Observed context:

  • The user takes Omega-3 (Fish Oil). [vision]
  • Reminder to "take Omega-3 (Fish Oil)" goes off. [digital]

Which goal is the user most likely pursuing?

  • A.

    Shop for zip up shirt medium

  • B.

    Set a timer/reminder for daily with the note 'set a reminder to take Omega-3 (Fish Oil)'

  • C.

    Do a search for 'who is cynthia erivo'

  • D.

    Remember I ate 8 total grams of protein

How subjects answered

  • Llama-4-Scoutscore 1

    To determine the digital action the user might want to do on their phone, let's analyze the context provided by the images.

    1. Contextual Clues: The user is in a dimly lit room, sitting on what appears to be a bed, with a red blanket or sheet. There's a phone on the bed, and the user interacts with it and a piece of paper. The room has a calm and possibly a pre-sleep or relaxation atmosphere.

    2. User Actions: The user picks up the phone, looks at it, and then seems to be interacting with it directly. They also handle a piece of paper, possibly a note or a ticket.

    3. Possible Intentions: Given the calm atmosphere and the actions, the user might be looking to wind down or prepare for sleep. The presence of a piece of paper that resembles a lottery ticket or a similar document suggests they might be checking something.

    4. Digital Actions Based on Context:

      • Search: The user might want to search for information related to the paper they are holding, possibly to verify a code or check a result.
      • Communication: There's no clear indication they want to communicate with someone immediately.
      • Store Memory: They might want to save the information on the paper to their memory or notes.
      • Temporal Attention: There's no clear indication of setting a reminder.
      • Guided Activity: The atmosphere suggests relaxation but no specific guided activity is indicated.
      • Control Environment: There's no indication of controlling smart devices.
      • Translate: Not indicated.
      • Find Directions: Not indicated.
      • Shop: Not indicated.
      • Entertain: The user might look for something entertaining, like music or videos, but the primary focus seems to be on the paper.

    Given these considerations, the most likely action seems to be related to either checking information on the paper through search or saving it for later reference.

    Proposed Action:

    {
        "search": {
            "type": "search",
            "source": ["world"],
            "query_item": "lottery ticket",
            "query": "check lottery ticket"
        }
    }

    Or,

    {
        "store_memory": {
            "type": "store_memory",
            "content": "lottery ticket details"
        }
    }

    However, since the exact nature of the paper (whether it's a lottery ticket, a note, etc.) isn't clear but seems to be something the user is focused on and possibly needs to verify or remember, the search action seems more universally applicable for checking or verifying information related to the paper.

    Thus, focusing on a likely scenario:

    {
        "search": {
            "type": "search",
            "source": ["world"],
            "query_item": "verify lottery ticket",
            "query": "check if I won"
        }
    }
  • InternVL-78Bscore 1
    {
        "store_memory": {
            "type": "store_memory",
            "content": "Took pills and had a drink of water"
        }
    }
  • Qwen2.5-VL-7Bscore 0

    Based on the images provided, it seems that the user is in a bedroom and is interacting with their phone. The user picks up the phone, which appears to be displaying a message or notification. After checking the phone, the user then picks up a water bottle and a piece of paper, possibly a calendar or schedule.

    Given these observations, a plausible digital action the user might want to perform could be related to setting a reminder or managing their daily tasks. Here's a suggested digital action:

    {
        "store_memory": {
            "type": "store_memory",
            content: "Check the calendar for important dates.",
            # e.g., "Add bowtie pasta to my grocery list"
        }
    }

    This action suggests that the user might want to store a memory of checking the calendar for important dates, which aligns with the context of picking up a calendar-like item.

Item 331% solve rate

Infer the wearable-agent user's goal from the observed egocentric context.

Observed context:

  • The user has a purchase receipt. [vision]
  • friend says, “thanks for getting breakfast, let me know how much I owe you”. [audio]

Which goal is the user most likely pursuing?

  • A.

    Do a search for 'how much #x owes you for breakfast cost of breakfast'

  • B.

    Do a search for 'nearby movie times for Mufasa: The Lion King'

  • C.

    Send a message to Ed saying 'thank you note for gift'

  • D.

    Remember I ate 6 total grams of protein

How subjects answered

  • InternVL-8Bscore 1
    {
        “search” : {
            type: “search”,
            source: ["world", "memory"],
            query_item: "Costco receipt",
            query: "Costco product list"
        }
    }
  • InternVL-78Bscore 1
    {
        "search": {
            "type": "search",
            "source": ["world"],
            "query_item": "receipt",
            "query": "itemized list of products"
        }
    }
  • Qwen2.5-VL-7Bscore 0

    Based on the context provided in the video, it appears that the user is holding a Costco receipt and a pen, possibly reviewing or planning something related to the receipt. The person sitting on the couch seems to be engaged with their phone. Given this scenario, a plausible digital action could be related to organizing or managing the receipt information.

    Here's a suggested digital action:

    {
        "shop": {
            "type": "shop",
            "content": "Reviewing Costco receipt"
        }
    }

    This action suggests that the user might want to review the items purchased at Costco and possibly plan future shopping trips based on the receipt.

Item 439% solve rate

Infer the wearable-agent user's goal from the observed egocentric context.

Observed context:

  • The user walks towards the front door. [vision]
  • The user looks at different pairs of shoes. [vision]

Which goal is the user most likely pursuing?

  • A.

    Do a search for 'information about the nearby forecast'

  • B.

    Launch a guided activity to do elliptical recovery stretching routine with a duration of 0:03:18

  • C.

    Do a search for 'fact check what animal does ham come from'

  • D.

    Shop for glasses

How subjects answered

  • Llama-4-Scoutscore 1

    Based on the images provided, which show a person organizing their closet, the digital action they might want to do on their phone could be related to managing their daily schedule or tasks, possibly triggered by the act of organizing or planning. Given the context, a likely action could be:

    {
        "temporal_attention": {
            "action": "set",
            "time": "later today",
            "content": "Get shoes for the kids"
        }
    }

    This action suggests that while organizing the closet, the user might be reminded of a task they need to accomplish, such as getting shoes for the kids, and decides to set a reminder for later in the day. This action aligns with the context of organizing and possibly running errands or planning for the future based on the visual cues from the closet organization.

  • Qwen2.5-VL-3Bscore 0

    { "type": "communication", "action": "message", "target": "<name of person>", "content": "Hey, I'm packing my bag and need some help with organizing my clothes." }

  • InternVL-2Bscore 0
    {
        "search": {
            "type": "search",
            "source": ["world"],
            "query_item": "smartphones"
        },
        "store_memory": {
            "type": "store_memory",
            "content": "Fetching phone update",
            "time": "Now"
        },
        "temporal_attention": {
            "type": "temporal_attention",
            "action": "set",
            "time": "10 min"
        },
        "guided_activity": {
            "type": "guided_activity",
            "content": "Fishing activity",
            "time": "10 min"
        },
        "control_environment": {
            "type": "control_environment",
            "target": "smart lights in kitchen",
            "value": "on"
        },
        "communication": {
            "type": "communication",
            "action": "message",
            "target": "<name of person>",
            "content": "Fetching update for smartphones"
        },
        "translate": {
            "type": "translate",
            "src_language": "en"
        },
        "find_directions": {
            "type": "find_directions",
            "to": "charging station"
        },
        "shop": {
            "type": "shop",
            "content": "Battery pack"
        },
        "entertain": {
            "type": "entertain",
            "modality": "vision",
            "content": "Music playlist"
        }
    }
Item 547% solve rate

Infer the wearable-agent user's goal from the observed egocentric context.

Observed context:

  • The user struggles to do Drawing. [vision]

Which goal is the user most likely pursuing?

  • A.

    Do a search for 'fact check tupac is the greates rapper of all time'

  • B.

    Translate vision from dutch to English

  • C.

    Shop for hand lotion?

  • D.

    Do a search for 'tutorials for Drawing'

How subjects answered

  • Llama-4-Scoutscore 1

    Based on the images provided, it seems that the user is in a living room, sitting at a coffee table with a notebook, colored pencils, and other items. The user appears to be writing or drawing.

    Given this context, a likely digital action the user might want to do on their phone is to access their Notes app to potentially save or view some notes.

    Here is the digital action in the required template format:

    { "search": { "type": "search", "source": ["world"], "query_item": "Notes app", "query": "open notes" } }

  • Qwen2.5-VL-72Bscore 1

    { "search": { "type": "search", "source": ["world"], "query_item": "drawing techniques", "query": "basic drawing tips" } }

  • InternVL-2Bscore 0

    Based on the sequence of images provided, it appears that the user is using a green pen to write or draw in a notebook at the table.

    Given the context of the scene:

    • The primary focus is on the act of writing in the notebook.
    • There is no mention of scheduling, memory, or environmental control actions in the timeline.
    • The user appears to focus on their task, utilizing the notebook and stationery.

    Based on these observations, the appropriate digital action from the provided template seems:

    {
        "entertain": {
            "modality": ["vision"],
            "content": "Note-taking"
        }
    }

    This fills in the template by emphasizing the act of writing, which is central to the activity. Additionally, it specifies that the activity is for entertainment, focusing on the act of note-taking in the visual mode.

Item 686% solve rate

Infer the wearable-agent user's goal from the observed egocentric context.

Observed context:

  • Trying to solve a Rubik's cube. [vision]

Which goal is the user most likely pursuing?

  • A.

    Remember I ate 10 total grams of protein

  • B.

    Do a search for 'tutorials for Rubiks cube'

  • C.

    Do a search for 'information about hope studio'

  • D.

    Control thermostat, set to 78

How subjects answered

  • Llama-4-Scoutscore 1

    Based on the images provided, which show a person holding and manipulating a Rubik's Cube in a room, the digital action they might want to do on their phone could be related to solving the cube or learning new techniques. Here is the action:

    { "search": { "type": "search", "source": ["world"], "query_item": "Rubik's Cube solution", "query": "How to solve a Rubik's Cube" } }

  • Qwen2.5-VL-7Bscore 1

    Based on the images provided, it appears that the user is holding a Rubik's cube and seems to be in the process of solving it. Given this context, a likely digital action the user might want to perform on their phone could be related to finding instructions or tips for solving the Rubik's cube.

    Here is the corresponding digital action:

    { "search": { "type": "search", "source": ["world"], "query_item": "Rubik's cube", "query": "solving guide" } }

  • InternVL-2Bscore 0

    r

Subjects

The models, agents, and reward models evaluated.

8 subjects, ranked by mean response across this benchmark's items.

  1. 1Qwen2.5-VL-72B0.467
  2. 2InternVL-78B0.454
  3. 3Llama-4-Scout0.414
  4. 4Qwen2.5-VL-7B0.389
  5. 5InternVL-8B0.325
  6. 6Qwen2.5-VL-3B0.273
  7. 7Llama-3.2-11B-Vision0.255
  8. 8InternVL-2B0.231