Coding & Software

BIRD (text-to-SQL)

BIRD: a large-scale, cross-domain text-to-SQL benchmark (NL question + gold SQL over 95 real databases, 37 domains), scored by execution accuracy. The official repo ships questions/gold SQL/eval harness but no baseline predictions; this build ingests petavue/NL2SQL-Benchmark's public 'Inference Level Dataset' of 43,200 per-inference rows over a 360-question subset of BIRD dev. Each response is one model's binary execution-match correctness on one question under one prompting config; items are the real NL questions, correct_answer is the gold SQL. Models served under multiple provider aliases are merged to one canonical subject, with the serving platform recorded in test_condition.

360items

16subjects

100%observed

MITlicense

codedomain

databasesdomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 16 subjects × 360 items, 100% of cells evaluated.

BIRD (text-to-SQL) response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: SELECT t1.buildUpPlaySpeed FROM Team_Attributes AS t1 INNER JOIN Team AS t2 ON t1.team_api_id = t2.team_api_id ORDER BY t1.buildUpPlayDribbling ASC LIMIT 4

What are the speed in which attacks are put together of the top 4 teams with the highest build Up Play Speed?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 20% solve rateanswer: SELECT DISTINCT t2.defensive_work_rate FROM Player AS t1 INNER JOIN Player_Attributes AS t2 ON t1.player_api_id = t2.player_api_id WHERE t1.player_name = 'David Wilson'

What is the defensive work rate of the football player David Wilson ?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 30% solve rateanswer: SELECT T1.originalReleaseDate, T2.format FROM cards AS T1 INNER JOIN legalities AS T2 ON T1.uuid = T2.uuid WHERE T1.rarity = 'mythic' AND T1.originalReleaseDate IS NOT NULL AND T2.status = 'Legal' ORDER BY T1.originalReleaseDate LIMIT 1

When was the oldest mythic card released and what are its legal play formats?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 40% solve rateanswer: SELECT COUNT(T1.ID) FROM Patient AS T1 INNER JOIN Laboratory AS T2 ON T1.ID = T2.ID WHERE T2.IGA BETWEEN 80 AND 500 AND T1.`First Date` > '1990-01-01'

How many patients with a normal Ig A level came to the hospital after 1990/1/1?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 50% solve rateanswer: SELECT T1.CustomerID FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID = T2.CustomerID WHERE T1.Currency = 'CZK' AND T2.Date BETWEEN 201101 AND 201112 GROUP BY T1.CustomerID ORDER BY SUM(T2.Consumption) DESC LIMIT 1

Which customers, paying in CZK, consumed the most gas in 2011?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 61% solve rateanswer: SELECT COUNT(*) FROM Examination WHERE Thrombosis = 2 AND `ANA Pattern` = 'S' AND `aCL IgM` > (SELECT AVG(`aCL IgM`) * 1.2 FROM Examination WHERE Thrombosis = 2 AND `ANA Pattern` = 'S')

What number of patients with a degree of thrombosis level 2 and ANA pattern of only S, have a level of anti-Cardiolip in antibody (IgM) 20% higher than average?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 73% solve rateanswer: SELECT T2.account_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN disp AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'OWNER' AND T1.A2 = 'Tabor'

Please list the accounts whose district is Tabor that are eligible for loans.

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 incorrect
databricks/dbrx-instruct incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 87% solve rateanswer: SELECT T3.constructorRef, T3.url FROM results AS T1 INNER JOIN races AS T2 on T1.raceId = T2.raceId INNER JOIN constructors AS T3 on T1.constructorId = T3.constructorId WHERE T2.name = 'Singapore Grand Prix' AND T2.year = 2009 AND T1.time LIKE '_:%:__.___'

What is the constructor reference name of the champion in the 2009 Singapore Grand Prix? Please give its website.

Subject outcomes

databricks/dbrx-instruct correct
defog/sqlcoder-70b-alpha correct
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 911% solve rateanswer: SELECT translation FROM set_translations WHERE setCode IN ( SELECT code FROM sets WHERE name = 'Mirrodin' ) AND language = 'Chinese Simplified'

Show the Simplified Chinese translation of the name of the set "Mirrodin"?

Subject outcomes

defog/sqlcoder-70b-alpha correct
WizardLM/WizardCoder-33B-V1.1 incorrect
codellama/CodeLlama-34b-Instruct-hf incorrect
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 1016% solve rateanswer: SELECT T2.event_name FROM attendance AS T1 INNER JOIN event AS T2 ON T2.event_id = T1.link_to_event INNER JOIN member AS T3 ON T1.link_to_member = T3.member_id WHERE T3.position = 'Vice President' AND T2.location = '900 E. Washington St.' AND T2.type = 'Social'

What is the name of the social event that was attended by the vice president of the Student_Club located at 900 E. Washington St.?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 correct
databricks/dbrx-instruct correct
codellama/CodeLlama-34b-Instruct-hf correct
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 1123% solve rateanswer: SELECT CAST(SUM(t2.home_team_goal) AS REAL) / COUNT(t2.id) FROM Country AS t1 INNER JOIN Match AS t2 ON t1.id = t2.country_id WHERE t1.name = 'Poland' AND t2.season = '2010/2011'

Calculate the average home team goal in the 2010/2011 season in the country of Poland.

Subject outcomes

databricks/dbrx-instruct correct
defog/sqlcoder-70b-alpha correct
defog/sqlcoder-7b-2 correct
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Item 1230% solve rateanswer: SELECT T2.colour FROM superhero AS T1 INNER JOIN colour AS T2 ON T1.eye_colour_id = T2.id WHERE T1.superhero_name = 'Abomination'

What is Abomination's eye colour?

Subject outcomes

WizardLM/WizardCoder-33B-V1.1 correct
databricks/dbrx-instruct correct
codellama/CodeLlama-34b-Instruct-hf correct
gpt-4-turbo-preview incorrect
gpt-3.5-turbo-16k incorrect
gemini-1.0-pro incorrect

Subjects

The models, agents, and reward models evaluated.

16 subjects, ranked by mean response (accuracy) across this benchmark's items.

1WizardLM/WizardCoder-33B-V1.10.31
2defog/sqlcoder-70b-alpha0.3
3databricks/dbrx-instruct0.2439
4defog/sqlcoder-7b-20.2272
5codellama/CodeLlama-34b-Instruct-hf0.1683
6mistralai/Mixtral-8x7B-Instruct-v0.10.1296
7codellama/CodeLlama-70b-Instruct-hf0.1136
8mistralai/Mistral-7B-Instruct-v0.20.0853
9mistralai/Mistral-7B-Instruct-v0.10.0353
10meta-llama/Llama-2-70b-chat-hf0.0106
11claude-3-haiku-202403070
12claude-3-sonnet-202402290
13gemini-1.0-pro0
14gpt-4-turbo-preview0
15claude-3-opus-202402290
16gpt-3.5-turbo-16k0

Full data on Hugging Face Back to the gallery