Skip to main content

Agents & Tool Use

AppWorld

AppWorld: interactive coding agent benchmark. Per-(agent, task) pass/fail decrypted from the public leaderboard's experiment bundles; real per-task natural-language instructions from the released task data.

521items
8subjects
100%observed
Modelsubject type
Apache-2.0license
gui_agentdomain
tool_usedomain
textmodality
gui_screenshotmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 8 subjects × 521 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

AppWorld response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

I maintain my work schedule in SimpleNote and track my tasks in Todoist. Every night, I delete the completed tasks from my "Today" project. Then, I move the maximum number of incomplete tasks from my Inbox to the "Today" project. The maximum here is assuming I work back-to-back as per my schedule and I find time for the left overs from the current day first. I am busy tonight, please do it for me. Note that the moved tasks must be identical to the original ones.

Subject outcomes

  • Qwen3-14B incorrect
  • gpt-4.1-2025-04-14 incorrect
  • deepseek-ai/deepseek-coder-33b-instruct incorrect
Item 26% solve rate

My brother has sent me a few childhood photos in a zip over email. Add all photos to it from "~/photographs/childhood/" from my file system and send a new email to both my parents with the subject "Good Old Times", the body "See attached." and the updated zip attachment.

Subject outcomes

  • Qwen3-14B correct
  • gpt-4.1-2025-04-14 incorrect
  • deepseek-ai/deepseek-coder-33b-instruct incorrect
Item 311% solve rate

My last 4 amazon orders were for my friend, Valerie. They have sent me the money for their total cost on venmo today. Check if it tallies up. If not, send or request the difference on venmo with a note "Adjusting amazon payment.".

Subject outcomes

  • gpt-4.1-2025-04-14 correct
  • gpt-4o-2024-08-06 correct
  • deepseek-ai/deepseek-coder-33b-instruct incorrect
Item 428% solve rate

Label all email threads in my Gmail inbox/outbox involving any of my coworkers as "work" and involving any of my friends, partner, siblings or parents as "personal".

Subject outcomes

  • Qwen3-14B correct
  • gpt-4o-2024-05-13 correct
  • deepseek-ai/deepseek-coder-33b-instruct incorrect
Item 544% solve rate

I jotted down some songs in Simple Note recently. Make a playlist titled "Songs from Simple Note" out of it.

Subject outcomes

  • Qwen3-14B correct
  • Qwen-2.5-32B-Instruct correct
  • deepseek-ai/deepseek-coder-33b-instruct incorrect
Item 683% solve rate

How much money have I sent to others on venmo this month so far?

Subject outcomes

  • Qwen3-14B correct
  • gpt-4.1-2025-04-14 correct
  • gpt-4o-2024-05-13 incorrect

Subjects

The models, agents, and reward models evaluated.

8 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1Qwen3-14B0.732
  2. 2gpt-4.1-2025-04-140.621
  3. 3Qwen-2.5-32B-Instruct0.545
  4. 4gpt-4o-2024-08-060.474
  5. 5gpt-4o-2024-05-130.27
  6. 6gpt-4-turbo-2024-04-090.182
  7. 7meta-llama/Llama-3-70b-chat-hf0.082
  8. 8deepseek-ai/deepseek-coder-33b-instruct0.043