Skip to main content

Agents & Tool Use

AgentBoard

AgentBoard: analytical evaluation of multi-turn LLM agents across 9 interactive environments (alfworld, scienceworld, babyai, jericho, pddl, webarena, webshop, tool-query, tool-operation). Each item is a task instance; response is per-example success_rate (1 if the agent completed the full task).

867items
13subjects
95%observed
GPL-3.0license
agentsdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 13 subjects × 867 items, 95% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

AgentBoard response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Find the page of the university that has most Turning Award winners on the map.

Subject outcomes

  • claude2 incorrect
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 20% solve rate

Preview the Magento Blank theme for my shop

Subject outcomes

  • claude2 incorrect
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 30% solve rate

Post a notice on a virtual meetup for book reading enthusiasts on March 15th in the books subreddit

Subject outcomes

  • claude2 incorrect
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 40% solve rate

Your task is to use chemistry to create yellow-green paint. You should create green paint in a jug in the art studio firstly. Then you should create yellow-green paint in a jug in the art studio. The objects you can use are a jug, wood cups in the art studio.

Subject outcomes

  • claude2 incorrect
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 50% solve rate

i need a french vanilla soy wax candle, and price lower than 50.00 dollars

Subject outcomes

  • claude2 incorrect
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 60% solve rate

i want long lasting wrangler mens smoke storm cowboy cut jeans, and price lower than 70.00 dollars

Subject outcomes

  • claude2 incorrect
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 78% solve rate

Your task is to find a counter in the kitchen.. First, focus on the thing. Then, move it to the purple box in the kitchen.

Subject outcomes

  • gpt-4 correct
  • codellama-13b incorrect
  • codellama-34b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 88% solve rate

i'm looking for a pair of water resistant brown pants, and price lower than 40.00 dollars

Subject outcomes

  • text-davinci-003 correct
  • claude2 incorrect
  • codellama-34b incorrect
  • llama2-70b incorrect
  • mistral-7b incorrect
  • vicuna-13b-16k incorrect
Item 915% solve rate

put a cool mug in coffeemachine.

Subject outcomes

  • deepseek-67b correct
  • text-davinci-003 correct
  • claude2 incorrect
  • llama2-70b incorrect
  • mistral-7b incorrect
  • vicuna-13b-16k incorrect
Item 1019% solve rate

pick up the blue key

Subject outcomes

  • claude2 correct
  • gpt-4 correct
  • codellama-13b incorrect
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 1133% solve rate

i am looking for a short sleeve top for a teenage girl. it should in xx-large size, and price lower than 40.00 dollars

Subject outcomes

  • deepseek-67b correct
  • llama2-70b correct
  • gpt-35-turbo-16k correct
  • mistral-7b incorrect
  • text-davinci-003 incorrect
  • vicuna-13b-16k incorrect
Item 1258% solve rate

I'm going to Ulaanbaatar on a business trip. It is said that it is on a plateau. How much higher is the elevation there than here? Please give me a number as an answer.

Subject outcomes

  • claude2 correct
  • codellama-13b correct
  • deepseek-67b correct
  • llama2-13b incorrect
  • llama2-70b incorrect
  • vicuna-13b-16k incorrect

Subjects

The models, agents, and reward models evaluated.

13 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gpt-40.4071
  2. 2claude20.2619
  3. 3gpt-35-turbo0.2144
  4. 4deepseek-67b0.2099
  5. 5text-davinci-0030.1571
  6. 6gpt-35-turbo-16k0.1542
  7. 7codellama-13b0.1147
  8. 8codellama-34b0.0998
  9. 9lemur-70b0.0751
  10. 10vicuna-13b-16k0.0741
  11. 11llama2-70b0.07
  12. 12mistral-7b0.0619
  13. 13llama2-13b0.0158