Skip to main content

Coding & Software

BIRD (text-to-SQL)

BIRD: a large-scale, cross-domain text-to-SQL benchmark (NL question + gold SQL over 95 real databases, 37 domains), scored by execution accuracy. The official repo ships questions/gold SQL/eval harness but no baseline predictions; this build ingests petavue/NL2SQL-Benchmark's public 'Inference Level Dataset' of 43,200 per-inference rows over a 360-question subset of BIRD dev. Each response is one model's binary execution-match correctness on one question under one prompting config; items are the real NL questions, correct_answer is the gold SQL. Models served under multiple provider aliases are merged to one canonical subject, with the serving platform recorded in test_condition.

360items
16subjects
100%observed
MITlicense
codedomain
databasesdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 16 subjects × 360 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

BIRD (text-to-SQL) response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: SELECT t1.buildUpPlaySpeed FROM Team_Attributes AS t1 INNER JOIN Team AS t2 ON t1.team_api_id = t2.team_api_id ORDER BY t1.buildUpPlayDribbling ASC LIMIT 4

What are the speed in which attacks are put together of the top 4 teams with the highest build Up Play Speed?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 20% solve rateanswer: SELECT DISTINCT t2.defensive_work_rate FROM Player AS t1 INNER JOIN Player_Attributes AS t2 ON t1.player_api_id = t2.player_api_id WHERE t1.player_name = 'David Wilson'

What is the defensive work rate of the football player David Wilson ?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 30% solve rateanswer: SELECT T1.originalReleaseDate, T2.format FROM cards AS T1 INNER JOIN legalities AS T2 ON T1.uuid = T2.uuid WHERE T1.rarity = 'mythic' AND T1.originalReleaseDate IS NOT NULL AND T2.status = 'Legal' ORDER BY T1.originalReleaseDate LIMIT 1

When was the oldest mythic card released and what are its legal play formats?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 40% solve rateanswer: SELECT COUNT(T1.ID) FROM Patient AS T1 INNER JOIN Laboratory AS T2 ON T1.ID = T2.ID WHERE T2.IGA BETWEEN 80 AND 500 AND T1.`First Date` > '1990-01-01'

How many patients with a normal Ig A level came to the hospital after 1990/1/1?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 50% solve rateanswer: SELECT T1.CustomerID FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID = T2.CustomerID WHERE T1.Currency = 'CZK' AND T2.Date BETWEEN 201101 AND 201112 GROUP BY T1.CustomerID ORDER BY SUM(T2.Consumption) DESC LIMIT 1

Which customers, paying in CZK, consumed the most gas in 2011?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 61% solve rateanswer: SELECT COUNT(*) FROM Examination WHERE Thrombosis = 2 AND `ANA Pattern` = 'S' AND `aCL IgM` > (SELECT AVG(`aCL IgM`) * 1.2 FROM Examination WHERE Thrombosis = 2 AND `ANA Pattern` = 'S')

What number of patients with a degree of thrombosis level 2 and ANA pattern of only S, have a level of anti-Cardiolip in antibody (IgM) 20% higher than average?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 73% solve rateanswer: SELECT T2.account_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN disp AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'OWNER' AND T1.A2 = 'Tabor'

Please list the accounts whose district is Tabor that are eligible for loans.

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • databricks/dbrx-instruct incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 87% solve rateanswer: SELECT T3.constructorRef, T3.url FROM results AS T1 INNER JOIN races AS T2 on T1.raceId = T2.raceId INNER JOIN constructors AS T3 on T1.constructorId = T3.constructorId WHERE T2.name = 'Singapore Grand Prix' AND T2.year = 2009 AND T1.time LIKE '_:%:__.___'

What is the constructor reference name of the champion in the 2009 Singapore Grand Prix? Please give its website.

Subject outcomes

  • databricks/dbrx-instruct correct
  • defog/sqlcoder-70b-alpha correct
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 911% solve rateanswer: SELECT translation FROM set_translations WHERE setCode IN ( SELECT code FROM sets WHERE name = 'Mirrodin' ) AND language = 'Chinese Simplified'

Show the Simplified Chinese translation of the name of the set "Mirrodin"?

Subject outcomes

  • defog/sqlcoder-70b-alpha correct
  • WizardLM/WizardCoder-33B-V1.1 incorrect
  • codellama/CodeLlama-34b-Instruct-hf incorrect
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 1016% solve rateanswer: SELECT T2.event_name FROM attendance AS T1 INNER JOIN event AS T2 ON T2.event_id = T1.link_to_event INNER JOIN member AS T3 ON T1.link_to_member = T3.member_id WHERE T3.position = 'Vice President' AND T2.location = '900 E. Washington St.' AND T2.type = 'Social'

What is the name of the social event that was attended by the vice president of the Student_Club located at 900 E. Washington St.?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 correct
  • databricks/dbrx-instruct correct
  • codellama/CodeLlama-34b-Instruct-hf correct
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 1123% solve rateanswer: SELECT CAST(SUM(t2.home_team_goal) AS REAL) / COUNT(t2.id) FROM Country AS t1 INNER JOIN Match AS t2 ON t1.id = t2.country_id WHERE t1.name = 'Poland' AND t2.season = '2010/2011'

Calculate the average home team goal in the 2010/2011 season in the country of Poland.

Subject outcomes

  • databricks/dbrx-instruct correct
  • defog/sqlcoder-70b-alpha correct
  • defog/sqlcoder-7b-2 correct
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect
Item 1230% solve rateanswer: SELECT T2.colour FROM superhero AS T1 INNER JOIN colour AS T2 ON T1.eye_colour_id = T2.id WHERE T1.superhero_name = 'Abomination'

What is Abomination's eye colour?

Subject outcomes

  • WizardLM/WizardCoder-33B-V1.1 correct
  • databricks/dbrx-instruct correct
  • codellama/CodeLlama-34b-Instruct-hf correct
  • gpt-4-turbo-preview incorrect
  • gpt-3.5-turbo-16k incorrect
  • gemini-1.0-pro incorrect

Subjects

The models, agents, and reward models evaluated.

16 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1WizardLM/WizardCoder-33B-V1.10.31
  2. 2defog/sqlcoder-70b-alpha0.3
  3. 3databricks/dbrx-instruct0.2439
  4. 4defog/sqlcoder-7b-20.2272
  5. 5codellama/CodeLlama-34b-Instruct-hf0.1683
  6. 6mistralai/Mixtral-8x7B-Instruct-v0.10.1296
  7. 7codellama/CodeLlama-70b-Instruct-hf0.1136
  8. 8mistralai/Mistral-7B-Instruct-v0.20.0853
  9. 9mistralai/Mistral-7B-Instruct-v0.10.0353
  10. 10meta-llama/Llama-2-70b-chat-hf0.0106
  11. 11claude-3-haiku-202403070
  12. 12claude-3-sonnet-202402290
  13. 13gemini-1.0-pro0
  14. 14gpt-4-turbo-preview0
  15. 15claude-3-opus-202402290
  16. 16gpt-3.5-turbo-16k0