Skip to main content

Agents & Tool Use

REAL

REAL: autonomous web agents on deterministic simulations of real websites. Tasks across high-fidelity website clones (Staynb, DashDish, ...), each with a deterministic success checker made of one or more evals. Response is the per-attempt binary task success (1 iff all of the task's evals pass).

233items
44subjects
42%observed
unknownlicense
gui_agentdomain
textmodality
gui_screenshotmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 44 subjects × 233 items, 42% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

REAL response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Show me places in Provence, France for 2 adults and 1 child for the dates August 1st-4th. I need wifi.

Subject outcomes

  • amazon/nova-act-v1.0 incorrect
  • anthropic-computer_use incorrect
  • deepseek/deepseek-v3.2-exp incorrect
  • x-ai/grok-4-fast incorrect
  • anthropic/claude-sonnet-4.5 incorrect
  • google/gemini-2.5-pro incorrect
Item 20% solve rate

Compare all the stays listed by review quality and price and rank them by affordability. Please provide a ranked list showing each stay's name, rating, and price.

Subject outcomes

  • amazon/nova-act-v1.0 incorrect
  • GBOX incorrect
  • Test incorrect
  • Llama 4 maverick incorrect
  • o1 mini incorrect
  • Gemma 3 27b it incorrect
Item 34% solve rate

Create a job post for a UI/UX Designer with expertise in Figma, Sketch, and Adobe Creative Suite, including project details, timeline, and required skills (Wireframing, Prototyping, Responsive Design).

Subject outcomes

  • AGI agent 0 correct
  • GBOX incorrect
  • amazon/nova-act-v1.0 incorrect
  • Llama 4 maverick incorrect
  • o1 mini incorrect
  • Gemma 3 27b it incorrect
Item 413% solve rate

Find and book the stay with the best value for money (cheapest stay with the best reviews) for 1 day. For fields you don't know the answer for, just fill them in with anything of your choice.

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • AGI agent 0 correct
  • Claude 3.7 Sonnet:thinking correct
  • Llama 4 maverick incorrect
  • o1 mini incorrect
  • Gemma 3 27b it incorrect
Item 520% solve rate

Find a home in San Diego priced under $150,000 with at least 2 bedrooms and request a tour. Use these details: Contact Name: John Doe, Email: johndoe@example.com, Phone: 555-123-4567, Tour Time: 2:00 PM, Tour Date: First available.

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • GBOX correct
  • KISS-1 correct
  • Llama 4 maverick incorrect
  • o1 mini incorrect
  • Gemma 3 27b it incorrect
Item 628% solve rate

Filter the listings in San Jose to display only "Townhomes" within a price range of 750,000to750,000 to 1,000,000. How many results are displayed?

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • GBOX correct
  • KISS-1 correct
  • GPT 4o incorrect
  • Llama 4 maverick incorrect
  • Gemma 3 27b it incorrect
Item 740% solve rate

Mark the first email in the Inbox as "read".

Subject outcomes

  • GBOX correct
  • KISS-1 correct
  • Test correct
  • Llama 4 maverick incorrect
  • o1 mini incorrect
  • Gemma 3 27b it incorrect
Item 850% solve rate

Show me places in San Francisco with wifi for the dates September 27-29th for 2 people.

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • anthropic-computer_use correct
  • anthropic/claude-sonnet-4 correct
  • anthropic/claude-3.7-sonnet:thinking incorrect
  • openai/gpt-5 incorrect
  • google/gemini-2.5-flash incorrect
Item 963% solve rate

Email Brian King and tell him to send me the meeting notes.

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • anthropic-computer_use correct
  • anthropic/claude-sonnet-4.5 correct
  • openai/gpt-5-nano incorrect
  • anthropic/claude-sonnet-4 incorrect
  • google/gemini-2.5-flash incorrect
Item 1076% solve rate

List the top 5 most recently shared posts on the homepage.

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • GBOX correct
  • KISS-1 correct
  • Qwen2.5 vl 32b instruct incorrect
  • Llama 3.1 8b instruct incorrect
  • Gemma 3 27b it incorrect
Item 1186% solve rate

Book me a ride from 1001 Castro Street to 1030 Post Street Apartments.

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • anthropic-computer_use correct
  • anthropic/claude-sonnet-4 correct
  • google/gemini-2.5-flash correct
  • anthropic/claude-sonnet-4.5 correct
  • meta-llama/llama-4-maverick incorrect
Item 12100% solve rate

DashDish: Place an order for any type of sub-sandwich, keep the total under $30

Subject outcomes

  • amazon/nova-act-v1.0 correct
  • anthropic-computer_use correct
  • deepseek/deepseek-v3.2-exp correct
  • google/gemini-2.5-flash correct
  • openai/gpt-5 correct
  • anthropic/claude-sonnet-4.5 correct

Subjects

The models, agents, and reward models evaluated.

44 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1anthropic/claude-sonnet-4.50.7473
  2. 2anthropic-computer_use0.7273
  3. 3anthropic/claude-3.7-sonnet:thinking0.7179
  4. 4google/gemini-2.5-pro0.6782
  5. 5anthropic/claude-sonnet-40.6667
  6. 6google/gemini-2.5-flash0.6604
  7. 7x-ai/grok-4-fast0.6429
  8. 8amazon/nova-act-v1.00.6352
  9. 9openai/gpt-50.618
  10. 10GBOX0.5982
  11. 11KISS-10.5714
  12. 12brassbunny0.4902
  13. 13Anthropic computer use0.4783
  14. 14deepseek/deepseek-v3.2-exp0.4754
  15. 15AGI agent 00.4595
  16. 16meta-llama/llama-4-maverick0.4444
  17. 17Web_Agent_GPT-4o0.4348
  18. 18Claude 3.7 Sonnet:thinking0.4107
  19. 19Claude-Opus-4:Thinking0.4107
  20. 20Sonnet-4:thinking0.3929
  21. 21Gemini 2.5 pro0.3839
  22. 22gpt-4o-prsm0.3768
  23. 23Magellanes0.375
  24. 24Browser Use Claude 3.7 Sonnet:thinking0.3514
  25. 25o30.3482
  26. 26Claude 3.7 Sonnet0.3393
  27. 27openai/gpt-5-nano0.3382
  28. 28Browser Use GPT 4o0.3119
  29. 29GPT 4.10.2812
  30. 30Update Eval0.2667
  31. 31o3 mini0.25
  32. 32Stagehand Open Operator GPT 4o0.2018
  33. 33Deepseek chat v3 03240.1964
  34. 34o10.1607
  35. 35o1 mini0.1518
  36. 36GPT 4o0.1351