Skip to main content

Agents & Tool Use

BFCL

Build a per-task binary response matrix and extract per-task error types from BFCL score files

4,133items
93subjects
100%observed
Modelsubject type
Apache-2.0license
tool_usedomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 93 subjects × 4,133 items, 100% of cells evaluated. The heatmap shows a representative 279 of 4,133 items — evenly sampled across difficulty — so each cell stays square and legible.

BFCL response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate
[user] As a high-ranking government official residing in Rivermist, I require assistance in determining the cost of first-class travel from my local airport to JFK International Airport on the 15th of December, 2024. I need the most competitive price available.
[user] After ascertaining the flight costs, secure a booking for me utilizing my secured access and charge it to my card with id card_4526 and access code secure_access_token_987654321, ensuring all the details align perfectly with my professional profile.
[user] Please retrieve the invoice from the flight booking recently made; I need to verify that all entries are correct before my departure.
[user] There has been a problem with my booking and I previously reached out to support without any feedback yet. Kindly contact customer support on my behalf, emphasizing the smooth facilitation of my travel arrangements.
[user] Due to unforeseen changes in my schedule, I am contemplating cancelling my flight. Initiate the cancellation and ensure this procedure is executed faultlessly and promptly.
[user] Lastly, draft and publish a tw …

How subjects answered

  • BitAgent_GoGoAgent incorrect

    Error during inference: 'NoneType' object is not subscriptable

  • gpt-4o-mini-2024-07-18 incorrect

    [["[get_nearest_airport_by_city(location='Rivermist')]", "[get_flight_cost(travel_from='RMS', travel_to='JFK', travel_date='2024-12-15', travel_class='first')]", "The most competitive price for first-class travel from Rivermist (RMS) to JFK International Airport on the 15th of December, 2024, is $1050.00."], ["[get_nearest_airport_by_city(location='Rivermist')]", "[get_flight_cost(travel_from='RMS', travel_to='JFK', travel_date='2024-12-15', travel_class='first')]", "The most competitive price for first-class travel from Rivermist (RMS) to JFK International Airport on the 15th of December, 202 …

  • deepseek-ai_DeepSeek-Coder-V2-Instruct-0724 incorrect

    [[[{"get_nearest_airport_by_city": {"location": "Rivermist"}}], "The nearest airport to your location in Rivermist is RMS. Now, I will proceed to find the most competitive price for first-class travel from RMS to JFK International Airport on the 15th of December, 2024. Please wait a moment."], [[{"get_flight_cost": {"travel_from": "RMS", "travel_to": "JFK", "travel_date": "2024-12-15", "travel_class": "first"}}], "The most competitive price for first-class travel from RMS to JFK International Airport on the 15th of December, 2024, is $1050.00."], [[{"book_flight": {"access_token": "secure_acce …

Item 232% solve rateanswer: [{"Hotels_4_SearchHotel": {"location": ["Washington D.C."], "star_rating": ["", "dontcare"], "smoking_allowed": ["", false], "number_of_rooms": ["", "dontcare"]}}]
[user] I'm planning a trip to Washington D.C. and need to reserve a room for my stay. I'd like to check in on April 21, 2023, and will be staying for 3 nights. Find me a place?

How subjects answered

  • watt-ai_watt-tool-8B correct

    [Hotels_4_SearchHotel(location="Washington D.C.")]

  • gemini-1.5-pro-001 correct

    [Hotels_4_SearchHotel(location='Washington, DC')]

  • claude-3-5-haiku-20241022 incorrect

    I'll help you find a hotel in Washington D.C. First, I'll search for available hotels in the area. [Hotels_4_SearchHotel(location='Washington, DC')] Now, I'll proceed to reserve a hotel for your specified dates. [Hotels_4_ReserveHotel(place_name='Washington Marriott at Metro Center', location='Washington, DC', check_in_date='2023-04-21', stay_length=3)]

Item 368% solve rate
[user] Can you find the address for the coordinates 40.748817, -73.985428 using the Geocoding API, and ensure the response is in geojson format? I'll be using my key 'YOUR-GEOCODE-API-KEY' for this request.

How subjects answered

  • BitAgent_GoGoAgent correct

    [requests.get(url="https://geocode.maps.co/reverse", params={"lat": 40.748817, "lon": -73.985428, "api_key": "REDACTED_GEOCODE_API_KEY", "format": "geojson"})]

  • meta-llama_Llama-3.1-8B-Instruct correct

    requests.get(url='https://geocode.maps.co/reverse', params={'lat': 40.748817, 'lon': -73.985428, 'api_key': 'REDACTED_GEOCODE_API_KEY', 'format': 'geojson'})

  • gemini-2.0-flash-exp-FC incorrect

    [{"requests_get": {"url": "https://geocode.maps.co/reverse", "params": {"format": "geojson", "lon": -73.985428, "lat": 40.748817, "api_key": "YOUR-GEOCODE-API-KEY"}}}]

Item 482% solve rateanswer: [{"law_crimes.search": {"crime": ["money laundering"], "location": ["San Francisco", "SF"], "year": [2019]}}, {"law_crimes.search": {"crime": ["money laundering"], "location": ["Texas", "TX"], "year":
[user] Show me all individuals who were convicted for money laundering from San Francisco in 2019 and ones convicted for the same in Texas in 2018

How subjects answered

  • BitAgent_GoGoAgent correct

    [law_crimes.search(crime="money laundering", location="San Francisco", year=2019), law_crimes.search(crime="money laundering", location="Texas", year=2018)]

  • gpt-3.5-turbo-0125-FC correct

    [{"law_crimes_search": "{\"crime\": \"money laundering\", \"location\": \"San Francisco\", \"year\": 2019}"}, {"law_crimes_search": "{\"crime\": \"money laundering\", \"location\": \"Texas\", \"year\": 2018}"}]

  • claude-3-5-haiku-20241022-FC incorrect

    [{"law_crimes_search": "{\"crime\": \"money laundering\", \"location\": \"San Francisco\", \"year\": 2019}"}]

Item 590% solve rateanswer: [{"calculate_compounded_interest": {"principal": [5000], "rate": [0.03], "time": [5], "n": [4]}}]
[user] Calculate the compounded interest on an initial deposit of $5000 at an annual interest rate of 3% for 5 years, compounded quarterly.

How subjects answered

  • BitAgent_GoGoAgent correct

    [calculate_compounded_interest(principal=5000, rate=0.03, time=5, n=4)]

  • gemini-1.5-pro-001-FC correct

    [{"calculate_compounded_interest": {"n": 4, "rate": 0.03, "time": 5, "principal": 5000}}]

  • MadeAgents_Hammer2.1-1.5b incorrect

    ``` [{"name": "calculate_compounded_interest", "arguments": {"principal": 5000, "rate": 0.03, "time": 5 * 4, "n": 4}}] ```

Item 6100% solve rate
[user] I have a client who's planning a trip to Paris and they're looking for some detailed travel plans. Could we find out the exact latitude and longitude of Paris for this purpose? They're really into the specifics and would appreciate having the coordinates for their personal itinerary.

How subjects answered

  • BitAgent_GoGoAgent correct

    [get_coordinates_from_city(city_name="Paris")]

  • gpt-4o-2024-11-20 correct

    [get_coordinates_from_city(city_name="Paris")]

  • claude-3-5-haiku-20241022-FC correct

    [{"get_coordinates_from_city": "{\"city_name\": \"Paris\"}"}]

Subjects

The models, agents, and reward models evaluated.

93 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1watt-ai_watt-tool-70B0.783
  2. 2gpt-4o-2024-11-200.782
  3. 3gpt-4o-2024-11-20-FC0.762
  4. 4gpt-4-turbo-2024-04-09-FC0.755
  5. 5watt-ai_watt-tool-8B0.747
  6. 6o1-2024-12-170.747
  7. 7gemini-2.0-flash-exp0.734
  8. 8meetkai_functionary-medium-v3.1-FC0.724
  9. 9gemini-1.5-pro-0020.719
  10. 10o1-mini-2024-09-120.716
  11. 11gpt-4o-mini-2024-07-18-FC0.714
  12. 12MadeAgents_Hammer2.1-7b0.714
  13. 13Qwen_Qwen2.5-72B-Instruct0.714
  14. 14gemini-2.0-flash-exp-FC0.712
  15. 15gemini-1.5-pro-0010.708
  16. 16gpt-4o-mini-2024-07-180.707
  17. 17Team-ACE_ToolACE-8B0.706
  18. 18gemini-exp-1206-FC0.706
  19. 19gemini-1.5-pro-002-FC0.705
  20. 20gemini-1.5-pro-001-FC0.698
  21. 21Qwen_Qwen2.5-32B-Instruct0.696
  22. 22MadeAgents_Hammer2.1-3b0.695
  23. 23gemini-1.5-flash-0020.687
  24. 24Qwen_Qwen2.5-14B-Instruct0.685
  25. 25gemini-1.5-flash-002-FC0.683
  26. 26Salesforce_xLAM-8x22b-r0.678
  27. 27meetkai_functionary-small-v3.1-FC0.677
  28. 28deepseek-ai_DeepSeek-Coder-V2-Instruct-07240.674
  29. 29BitAgent_GoGoAgent0.674
  30. 30gemini-1.5-flash-001-FC0.671
  31. 31claude-3-5-sonnet-20241022-FC0.669
  32. 32claude-3-opus-20240229-FC0.668
  33. 33mistral-large-2407-FC0.666
  34. 34DeepSeek-V30.663
  35. 35mistral-medium-23120.659
  36. 36gpt-4-turbo-2024-04-090.657

+ 57 more subjects evaluated.