Skip to main content

Agents & Tool Use

Embodied Web Agents

Embodied Web Agents (indoor cooking, text-based): per-task pass/fail for the GPT (gpt-4o) run. Each task pairs a web subtask (find/buy recipe items) with an embodied subtask (achieve the recipe object final states in AI2-THOR). Released as two per-task score files.

876items
1subjects
100%observed
MITlicense
gui_agentdomain
reasoningdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 1 subjects × 876 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

Embodied Web Agents response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Cook Apple and Tomato Sandwich. Difficulty level: Simple. Include ingredient: Bread. Recipe steps:

  • Slice the apple.
  • Place the sliced apple on the bread and serve them on a plate.
  • Put tomato slices on top of the bread.

Subject outcomes

  • GPT-4o incorrect
Item 20% solve rate

Cook Garden Salad. Exclude ingredient: Egg. Recipe steps:

  • Pick up the Apple and slice it.
  • Place the sliced Apple into the Bowl.
  • Pick up the Lettuce and slice it.
  • Place the sliced Lettuce into the Bowl.
  • Add the tomato by placing it into the Bowl.

Subject outcomes

  • GPT-4o incorrect
Item 30% solve rate

Cook Rustic Vegetable Medley. Use equipment: Pan. Recipe steps:

  • Slice the apple, bread, and potato.
  • Place sliced apple, bread, and potato into the pan.
  • Cook the apple, bread, and potato in the pan on the stove.
  • Slice the tomato.
  • Combine all ingredients in the bowl.

Subject outcomes

  • GPT-4o incorrect
Item 40% solve rate

Cook Fresh Apple Salad. Include ingredient: Egg. Recipe steps:

  • Slice the apple.
  • Slice the bread.
  • Slice the potato.
  • Crack the egg and cook it.
  • Mix all sliced ingredients in a bowl.
  • Add the cooked egg on top of the salad.

Subject outcomes

  • GPT-4o incorrect
Item 50% solve rate

Cook Fresh Apple Salad. Include ingredient: Potato. Recipe steps:

  • Slice the apple.
  • Slice the bread.
  • Slice the potato.
  • Crack the egg and cook it.
  • Mix all sliced ingredients in a bowl.
  • Add the cooked egg on top of the salad.

Subject outcomes

  • GPT-4o incorrect
Item 650% solve rate

Cook Fresh Apple Salad. Diet type: Non-vegetarian. Use equipment: StoveBurner. Recipe steps:

  • Slice the apple.
  • Slice the bread.
  • Slice the potato.
  • Crack the egg and cook it.
  • Mix all sliced ingredients in a bowl.
  • Add the cooked egg on top of the salad.

Subject outcomes

  • GPT-4o correct
Item 750% solve rate

Cook Apple and Egg Breakfast Bowl. Difficulty level: Hard. Recipe steps:

  • Slice the apple.
  • Slice the bread.
  • Crack the egg.
  • Place the sliced bread in the pan.
  • Cook the bread in the pan.
  • Place the cracked egg in the pan.
  • Cook the egg in the pan.
  • Place the sliced apple, cooked bread, and cooked egg into the bowl.
  • Add lettuce to the bowl.
  • Serve with a plate underneath the bowl.

Subject outcomes

  • GPT-4o correct
Item 850% solve rate

Cook Rustic Vegetables. Difficulty level: Hard. Include ingredient: Potato. Recipe steps:

  • Slice the apple, potato, and tomato.
  • Place the sliced apple, potato, and tomato into the bowl.
  • Transfer the sliced potato into the pan.
  • Cook the sliced potato in the pan on the stove burner.
  • Serve everything in the bowl.

Subject outcomes

  • GPT-4o correct
Item 950% solve rate

Cook Simple Vegetable Omelette. Use equipment: Pan. Recipe steps:

  • Slice the apple and potato, then break the egg into a bowl.
  • Cook the sliced potato in the pan until soft, then add the egg, cooking until the egg is set.
  • Serve the omelette with the sliced apple on a plate.

Subject outcomes

  • GPT-4o correct
Item 1050% solve rate

Cook Spanish Potato Omelette. Diet type: Non-vegetarian. Use equipment: Plate. Recipe steps:

  • Slice the potato.
  • Crack the egg into the pan.
  • Slice the tomato.
  • Place the sliced potato into the pan with the egg.
  • Cook the mixture on the stove burner until the potato is cooked and the egg is set.

Subject outcomes

  • GPT-4o correct
Item 1150% solve rate

Cook Fresh Garden Salad. Exclude ingredient: Egg. Recipe steps:

  • Slice the apple, lettuce, and tomato.
  • Mix the sliced apple, lettuce, and tomato in the bowl.
  • Serve the salad on the plate.

Subject outcomes

  • GPT-4o correct
Item 1250% solve rate

Cook Simple Apple and Potato Dish. Include ingredient: Apple. Use equipment: Bowl. Recipe steps:

  • Slice the apple and potato.
  • Cook the potato in the pan.
  • Serve the apple and potato in the bowl.

Subject outcomes

  • GPT-4o correct

Subjects

The models, agents, and reward models evaluated.

1 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1GPT-4o0.339