Agents & Tool Use

AppWorld

AppWorld: interactive coding agent benchmark. Per-(agent, task) pass/fail decrypted from the public leaderboard's experiment bundles; real per-task natural-language instructions from the released task data.

521items

8subjects

100%observed

Modelsubject type

Apache-2.0license

gui_agentdomain

tool_usedomain

textmodality

gui_screenshotmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 8 subjects × 521 items, 100% of cells evaluated.

AppWorld response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

I maintain my work schedule in SimpleNote and track my tasks in Todoist. Every night, I delete the completed tasks from my "Today" project. Then, I move the maximum number of incomplete tasks from my Inbox to the "Today" project. The maximum here is assuming I work back-to-back as per my schedule and I find time for the left overs from the current day first. I am busy tonight, please do it for me. Note that the moved tasks must be identical to the original ones.

Subject outcomes

Qwen3-14B incorrect
gpt-4.1-2025-04-14 incorrect
deepseek-ai/deepseek-coder-33b-instruct incorrect

Item 26% solve rate

My brother has sent me a few childhood photos in a zip over email. Add all photos to it from "~/photographs/childhood/" from my file system and send a new email to both my parents with the subject "Good Old Times", the body "See attached." and the updated zip attachment.

Subject outcomes

Qwen3-14B correct
gpt-4.1-2025-04-14 incorrect
deepseek-ai/deepseek-coder-33b-instruct incorrect

Item 311% solve rate

My last 4 amazon orders were for my friend, Valerie. They have sent me the money for their total cost on venmo today. Check if it tallies up. If not, send or request the difference on venmo with a note "Adjusting amazon payment.".

Subject outcomes

gpt-4.1-2025-04-14 correct
gpt-4o-2024-08-06 correct
deepseek-ai/deepseek-coder-33b-instruct incorrect

Item 428% solve rate

Label all email threads in my Gmail inbox/outbox involving any of my coworkers as "work" and involving any of my friends, partner, siblings or parents as "personal".

Subject outcomes

Qwen3-14B correct
gpt-4o-2024-05-13 correct
deepseek-ai/deepseek-coder-33b-instruct incorrect

Item 544% solve rate

I jotted down some songs in Simple Note recently. Make a playlist titled "Songs from Simple Note" out of it.

Subject outcomes

Qwen3-14B correct
Qwen-2.5-32B-Instruct correct
deepseek-ai/deepseek-coder-33b-instruct incorrect

Item 683% solve rate

How much money have I sent to others on venmo this month so far?

Subject outcomes

Qwen3-14B correct
gpt-4.1-2025-04-14 correct
gpt-4o-2024-05-13 incorrect

Subjects

The models, agents, and reward models evaluated.

8 subjects, ranked by mean response (accuracy) across this benchmark's items.

1Qwen3-14B0.732
2gpt-4.1-2025-04-140.621
3Qwen-2.5-32B-Instruct0.545
4gpt-4o-2024-08-060.474
5gpt-4o-2024-05-130.27
6gpt-4-turbo-2024-04-090.182
7meta-llama/Llama-3-70b-chat-hf0.082
8deepseek-ai/deepseek-coder-33b-instruct0.043

Full data on Hugging Face Back to the gallery