Find the page of the university that has most Turning Award winners on the map.
Subject outcomes
- claude2 incorrect
- codellama-13b incorrect
- codellama-34b incorrect
- mistral-7b incorrect
- text-davinci-003 incorrect
- vicuna-13b-16k incorrect
Agents & Tool Use
AgentBoard: analytical evaluation of multi-turn LLM agents across 9 interactive environments (alfworld, scienceworld, babyai, jericho, pddl, webarena, webshop, tool-query, tool-operation). Each item is a task instance; response is per-example success_rate (1 if the agent completed the full task).
Response matrix
Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 13 subjects × 867 items, 95% of cells evaluated.
Fit to width. Hover for subject & item; click a cell for details.

Scale: 1 = correct · 0 = incorrect
Sample items
A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.
Find the page of the university that has most Turning Award winners on the map.
Subject outcomes
Preview the Magento Blank theme for my shop
Subject outcomes
Post a notice on a virtual meetup for book reading enthusiasts on March 15th in the books subreddit
Subject outcomes
Your task is to use chemistry to create yellow-green paint. You should create green paint in a jug in the art studio firstly. Then you should create yellow-green paint in a jug in the art studio. The objects you can use are a jug, wood cups in the art studio.
Subject outcomes
i need a french vanilla soy wax candle, and price lower than 50.00 dollars
Subject outcomes
i want long lasting wrangler mens smoke storm cowboy cut jeans, and price lower than 70.00 dollars
Subject outcomes
Your task is to find a counter in the kitchen.. First, focus on the thing. Then, move it to the purple box in the kitchen.
Subject outcomes
i'm looking for a pair of water resistant brown pants, and price lower than 40.00 dollars
Subject outcomes
put a cool mug in coffeemachine.
Subject outcomes
pick up the blue key
Subject outcomes
i am looking for a short sleeve top for a teenage girl. it should in xx-large size, and price lower than 40.00 dollars
Subject outcomes
I'm going to Ulaanbaatar on a business trip. It is said that it is on a plateau. How much higher is the elevation there than here? Please give me a number as an answer.
Subject outcomes
Subjects
13 subjects, ranked by mean response (accuracy) across this benchmark's items.