Agents & Tool Use

REAL

REAL: autonomous web agents on deterministic simulations of real websites. Tasks across high-fidelity website clones (Staynb, DashDish, ...), each with a deterministic success checker made of one or more evals. Response is the per-attempt binary task success (1 iff all of the task's evals pass).

233items

44subjects

42%observed

unknownlicense

gui_agentdomain

textmodality

gui_screenshotmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 44 subjects × 233 items, 42% of cells evaluated.

REAL response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Show me places in Provence, France for 2 adults and 1 child for the dates August 1st-4th. I need wifi.

Subject outcomes

amazon/nova-act-v1.0 incorrect
anthropic-computer_use incorrect
deepseek/deepseek-v3.2-exp incorrect
x-ai/grok-4-fast incorrect
anthropic/claude-sonnet-4.5 incorrect
google/gemini-2.5-pro incorrect

Item 20% solve rate

Compare all the stays listed by review quality and price and rank them by affordability. Please provide a ranked list showing each stay's name, rating, and price.

Subject outcomes

amazon/nova-act-v1.0 incorrect
GBOX incorrect
Test incorrect
Llama 4 maverick incorrect
o1 mini incorrect
Gemma 3 27b it incorrect

Item 34% solve rate

Create a job post for a UI/UX Designer with expertise in Figma, Sketch, and Adobe Creative Suite, including project details, timeline, and required skills (Wireframing, Prototyping, Responsive Design).

Subject outcomes

AGI agent 0 correct
GBOX incorrect
amazon/nova-act-v1.0 incorrect
Llama 4 maverick incorrect
o1 mini incorrect
Gemma 3 27b it incorrect

Item 413% solve rate

Find and book the stay with the best value for money (cheapest stay with the best reviews) for 1 day. For fields you don't know the answer for, just fill them in with anything of your choice.

Subject outcomes

amazon/nova-act-v1.0 correct
AGI agent 0 correct
Claude 3.7 Sonnet:thinking correct
Llama 4 maverick incorrect
o1 mini incorrect
Gemma 3 27b it incorrect

Item 520% solve rate

Find a home in San Diego priced under $150,000 with at least 2 bedrooms and request a tour. Use these details: Contact Name: John Doe, Email: johndoe@example.com, Phone: 555-123-4567, Tour Time: 2:00 PM, Tour Date: First available.

Subject outcomes

amazon/nova-act-v1.0 correct
GBOX correct
KISS-1 correct
Llama 4 maverick incorrect
o1 mini incorrect
Gemma 3 27b it incorrect

Item 628% solve rate

Filter the listings in San Jose to display only "Townhomes" within a price range of $750,000 to$ 1,000,000. How many results are displayed?

Subject outcomes

amazon/nova-act-v1.0 correct
GBOX correct
KISS-1 correct
GPT 4o incorrect
Llama 4 maverick incorrect
Gemma 3 27b it incorrect

Item 740% solve rate

Mark the first email in the Inbox as "read".

Subject outcomes

GBOX correct
KISS-1 correct
Test correct
Llama 4 maverick incorrect
o1 mini incorrect
Gemma 3 27b it incorrect

Item 850% solve rate

Show me places in San Francisco with wifi for the dates September 27-29th for 2 people.

Subject outcomes

amazon/nova-act-v1.0 correct
anthropic-computer_use correct
anthropic/claude-sonnet-4 correct
anthropic/claude-3.7-sonnet:thinking incorrect
openai/gpt-5 incorrect
google/gemini-2.5-flash incorrect

Item 963% solve rate

Email Brian King and tell him to send me the meeting notes.

Subject outcomes

amazon/nova-act-v1.0 correct
anthropic-computer_use correct
anthropic/claude-sonnet-4.5 correct
openai/gpt-5-nano incorrect
anthropic/claude-sonnet-4 incorrect
google/gemini-2.5-flash incorrect

Item 1076% solve rate

List the top 5 most recently shared posts on the homepage.

Subject outcomes

amazon/nova-act-v1.0 correct
GBOX correct
KISS-1 correct
Qwen2.5 vl 32b instruct incorrect
Llama 3.1 8b instruct incorrect
Gemma 3 27b it incorrect

Item 1186% solve rate

Book me a ride from 1001 Castro Street to 1030 Post Street Apartments.

Subject outcomes

amazon/nova-act-v1.0 correct
anthropic-computer_use correct
anthropic/claude-sonnet-4 correct
google/gemini-2.5-flash correct
anthropic/claude-sonnet-4.5 correct
meta-llama/llama-4-maverick incorrect

Item 12100% solve rate

DashDish: Place an order for any type of sub-sandwich, keep the total under $30

Subject outcomes

amazon/nova-act-v1.0 correct
anthropic-computer_use correct
deepseek/deepseek-v3.2-exp correct
google/gemini-2.5-flash correct
openai/gpt-5 correct
anthropic/claude-sonnet-4.5 correct

Subjects

The models, agents, and reward models evaluated.

44 subjects, ranked by mean response (accuracy) across this benchmark's items.

1anthropic/claude-sonnet-4.50.7473
2anthropic-computer_use0.7273
3anthropic/claude-3.7-sonnet:thinking0.7179
4google/gemini-2.5-pro0.6782
5anthropic/claude-sonnet-40.6667
6google/gemini-2.5-flash0.6604
7x-ai/grok-4-fast0.6429
8amazon/nova-act-v1.00.6352
9openai/gpt-50.618
10GBOX0.5982
11KISS-10.5714
12brassbunny0.4902
13Anthropic computer use0.4783
14deepseek/deepseek-v3.2-exp0.4754
15AGI agent 00.4595
16meta-llama/llama-4-maverick0.4444
17Web_Agent_GPT-4o0.4348
18Claude 3.7 Sonnet:thinking0.4107
19Claude-Opus-4:Thinking0.4107
20Sonnet-4:thinking0.3929
21Gemini 2.5 pro0.3839
22gpt-4o-prsm0.3768
23Magellanes0.375
24Browser Use Claude 3.7 Sonnet:thinking0.3514
25o30.3482
26Claude 3.7 Sonnet0.3393
27openai/gpt-5-nano0.3382
28Browser Use GPT 4o0.3119
29GPT 4.10.2812
30Update Eval0.2667
31o3 mini0.25
32Stagehand Open Operator GPT 4o0.2018
33Deepseek chat v3 03240.1964
34o10.1607
35o1 mini0.1518
36GPT 4o0.1351

Full data on Hugging Face Back to the gallery