Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 93 subjects × 4,133 items, 100% of cells evaluated. The heatmap shows a representative 279 of 4,133 items — evenly sampled across difficulty — so each cell stays square and legible.
Correct (1)Incorrect (0)Unobserved
Scale:1 = correct · 0 = incorrect
Sample items
What the questions look like — and how subjects answer.
A spread of items across the difficulty range, each shown with a few subjects' actual answers.
Item 1·0% solve rate
[user] As a high-ranking government official residing in Rivermist, I require assistance in determining the cost of first-class travel from my local airport to JFK International Airport on the 15th of December, 2024. I need the most competitive price available.
[user] After ascertaining the flight costs, secure a booking for me utilizing my secured access and charge it to my card with id card_4526 and access code secure_access_token_987654321, ensuring all the details align perfectly with my professional profile.
[user] Please retrieve the invoice from the flight booking recently made; I need to verify that all entries are correct before my departure.
[user] There has been a problem with my booking and I previously reached out to support without any feedback yet. Kindly contact customer support on my behalf, emphasizing the smooth facilitation of my travel arrangements.
[user] Due to unforeseen changes in my schedule, I am contemplating cancelling my flight. Initiate the cancellation and ensure this procedure is executed faultlessly and promptly.
[user] Lastly, draft and publish a tw …
How subjects answered
BitAgent_GoGoAgent incorrect
Error during inference: 'NoneType' object is not subscriptable
gpt-4o-mini-2024-07-18 incorrect
[["[get_nearest_airport_by_city(location='Rivermist')]", "[get_flight_cost(travel_from='RMS', travel_to='JFK', travel_date='2024-12-15', travel_class='first')]", "The most competitive price for first-class travel from Rivermist (RMS) to JFK International Airport on the 15th of December, 2024, is $1050.00."], ["[get_nearest_airport_by_city(location='Rivermist')]", "[get_flight_cost(travel_from='RMS', travel_to='JFK', travel_date='2024-12-15', travel_class='first')]", "The most competitive price for first-class travel from Rivermist (RMS) to JFK International Airport on the 15th of December, 202 …
[[[{"get_nearest_airport_by_city": {"location": "Rivermist"}}], "The nearest airport to your location in Rivermist is RMS. Now, I will proceed to find the most competitive price for first-class travel from RMS to JFK International Airport on the 15th of December, 2024. Please wait a moment."], [[{"get_flight_cost": {"travel_from": "RMS", "travel_to": "JFK", "travel_date": "2024-12-15", "travel_class": "first"}}], "The most competitive price for first-class travel from RMS to JFK International Airport on the 15th of December, 2024, is $1050.00."], [[{"book_flight": {"access_token": "secure_acce …
[user] I'm planning a trip to Washington D.C. and need to reserve a room for my stay. I'd like to check in on April 21, 2023, and will be staying for 3 nights. Find me a place?
I'll help you find a hotel in Washington D.C. First, I'll search for available hotels in the area.
[Hotels_4_SearchHotel(location='Washington, DC')]
Now, I'll proceed to reserve a hotel for your specified dates.
[Hotels_4_ReserveHotel(place_name='Washington Marriott at Metro Center', location='Washington, DC', check_in_date='2023-04-21', stay_length=3)]
Item 3·68% solve rate
[user] Can you find the address for the coordinates 40.748817, -73.985428 using the Geocoding API, and ensure the response is in geojson format? I'll be using my key 'YOUR-GEOCODE-API-KEY' for this request.
[user] I have a client who's planning a trip to Paris and they're looking for some detailed travel plans. Could we find out the exact latitude and longitude of Paris for this purpose? They're really into the specifics and would appreciate having the coordinates for their personal itinerary.