Mathematics

StatQA

StatQA: statistical-analysis question answering over tabular datasets. Given a dataset and a refined question, a model selects the relevant columns and applicable statistical methods; it is correct on an item iff both its column selection and method selection exactly match the gold set. Evaluated across prompting strategies (zero-/one-shot, CoT, stats-prompt) for GPT and LLaMA models.

1,158items

10subjects

100%observed

CC-BY-4.0license

statisticsdomain

reasoningdomain

textmodality

tabularmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 10 subjects × 1,158 items, 100% of cells evaluated.

StatQA response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: {"columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test"]}

Dataset: Los Angeles Library Monthly Statistics Question: Do the total number of computer usage sessions (including wireless and PC sessions) and the number of attendees at children's and teen programming events held at the library exhibit similar variance characteristics? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-3.5-turbo incorrect
{ "columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }
gpt-4 incorrect
{ "columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["F-Test for Variance", "Levene Test", "Bartlett Test"] }
gpt-4o incorrect
{ "columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }
llama3_8b_instruct_sft incorrect
{"columns": ["Total Computer Usage (Wireless + PC Sessions)", "Total Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test"]}
llama3_8b incorrect
{"columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}
llama3_8b_sft incorrect
{"columns": ["Total Computer Usage (Wireless + PC Sessions)", "Total eMedia Circulation"], "methods": ["Mood Variance Test", "Levene Test"]}

Item 25% solve rateanswer: {"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

Dataset: GPA Study Hours Dataset Question: Is the distribution of the grade point average (GPA) for students approximately normal? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

llama3_8b_sft correct
{"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama2_7b_sft correct
{"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
gpt-4 incorrect
{"columns": ["gpa"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama3_8b_instruct incorrect
{"columns": ["gpa", "study_hours"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama3_8b_instruct_sft incorrect
{"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality"]}
llama3_8b incorrect
{"columns": ["gpa", "study_hours"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test", "Kolmogorov-Smirnov Test", "Kolmogorov-Smirnov Test for Uniform distribution", "Kolmogorov-Smirnov Test for Gamma distribution", "Kolmogorov-Smirnov Test for Exponential distribution"]}

Item 38% solve rateanswer: {"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

Dataset: Exercise and Fitness Metrics Dataset Question: What is the likelihood that the variable "Age" is drawn from a normal distribution? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

llama3_8b_instruct_sft correct
{"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama2_7b_sft correct
{"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama3_8b_sft correct
{"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama2_13b incorrect
{ "columns": ["Age"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"] }
llama3_8b_instruct incorrect
{"columns": ["Age"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}
llama3_8b incorrect
{"columns": ["Age"], "methods": ["Shapiro-Wilk Test of Normality"]}

Item 411% solve rateanswer: {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}

Dataset: University Rank Dataset Question: Is the variability in the Research Quality Score comparable to that of the Industry Score? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

llama3_8b_instruct_sft correct
{"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}
llama2_7b_sft correct
{"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}
llama3_8b_sft correct
{"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}
llama2_13b incorrect
{ "columns": ["Overall scores", "Research Quality Score", "Industry Score"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }
llama3_8b_instruct incorrect
{"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance", "Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality"]}
llama3_8b incorrect
{"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

Item 513% solve rateanswer: {"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}

Dataset: Stroke Prediction Dataset Question: Do the average glucose level and body mass index exhibit similar patterns of variance to some extent? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

llama3_8b_instruct_sft correct
{"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}
llama2_7b_sft correct
{"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}
llama3_8b_sft correct
{"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}
llama2_13b incorrect
{ "columns": ["avg_glucose_level", "bmi"], "methods": ["Pearson Correlation Coefficient", "Partial Correlation Coefficient"] }
llama3_8b_instruct incorrect
{"columns": ["avg_glucose_level", "bmi"], "methods": ["Correlation Analysis", "Variance Test"]}
llama3_8b incorrect
{"columns": ["avg_glucose_level", "bmi"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient", "Partial Correlation Coefficient", "Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

Item 618% solve rateanswer: {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}

Dataset: Lung Cancer Survey Dataset Question: Does the relationship between alcohol consumption and coughing symptoms change when stratified by smoking status? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

llama3_8b_instruct_sft correct
{"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}
llama2_7b_sft correct
{"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}
llama3_8b_sft correct
{"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}
llama2_13b incorrect
{ "columns": ["GENDER", "AGE", "SMOKING", "YELLOW_FINGERS", "ANXIETY", "PEER_PRESSURE", "CHRONIC DISEASE", "FATIGUE", "ALLERGY", "WHEEZING"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient", "Partial Correlation Coefficient"] }
llama3_8b_instruct incorrect
{"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Contingency Table Test", "Chi-square Independence Test", "Fisher Exact Test"]}
llama3_8b incorrect
{"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

Item 726% solve rateanswer: {"columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

Dataset: Crop Production Dataset Question: Are the differences in annual crop production between Japan and China statistically significant? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-4 correct
{ "columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }
gpt-4o correct
{ "columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }
llama3_8b_instruct_sft correct
{"columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}
llama3_8b_instruct incorrect
{"columns": ["Nigeria", "Pakistan", "Japan", "United States", "Brazil", "Philippines", "Myanmar", "Thailand", "Vietnam", "Indonesia", "Bangladesh", "India", "China"], "methods": ["t-test", "ANOVA"]}
llama3_8b incorrect
{"columns": ["Nigeria", "Pakistan", "Japan", "United States", "Brazil", "Philippines", "Myanmar", "Thailand", "Vietnam", "Indonesia", "Bangladesh", "India", "China"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}
llama3_8b_sft incorrect
{"columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test"]}

Item 834% solve rateanswer: {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

Dataset: Heart Failure Prediction Dataset Question: Does the presence of different sexes influence the distribution of exercise-induced angina? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-4 correct
{"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}
llama2_7b_sft correct
{"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}
llama3_8b_sft correct
{"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}
llama2_13b incorrect
{ "columns": ["Age", "Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test"] }
llama3_8b_instruct incorrect
{"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}
llama3_8b incorrect
{"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

Item 942% solve rateanswer: {"columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

Dataset: Lung Cancer Survey Dataset Question: Is there evidence in the data to indicate a relationship between the experience of peer pressure related to smoking and the presence of chronic disease? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-4 correct
{ "columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"] }
gpt-4o correct
{ "columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"] }
llama3_8b_sft correct
{"columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}
llama2_13b incorrect
{ "columns": ["GENDER", "SMOKING", "CHRONIC DISEASE"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }
llama2_7b incorrect
{ "columns": ["PEER_PRESSURE", "CHRONIC_DISEASE"], "methods": ["Correlation Analysis", "Contingency Table Test"] }
llama3_8b incorrect
{"columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

Item 1055% solve rateanswer: {"columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"]}

Dataset: Netflix TV Shows and Movies Scores Dataset in IMDB Question: Is there a direct relationship between the length of TV show episodes or movie duration and the number of IMDB votes received? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-3.5-turbo correct
{ "columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }
gpt-4 correct
{ "columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }
gpt-4o correct
{ "columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }
llama2_13b incorrect
{ "columns": ["release_year", "age_certification", "runtime", "imdb_id", "imdb_score", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }
llama3_8b_instruct incorrect
{"columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Partial Correlation Coefficient"]}
llama3_8b incorrect
{"columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient", "Partial Correlation Coefficient"]}

Item 1171% solve rateanswer: {"columns": ["Fedu"], "methods": ["Range"]}

Dataset: Student Performance Prediction Dataset Question: What is the range of values for the variable representing the education level of the father (Fedu)? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-3.5-turbo correct
{ "columns": ["Fedu"], "methods": ["Range"] }
gpt-4 correct
{ "columns": ["Fedu"], "methods": ["Range"] }
gpt-4o correct
{ "columns": ["Fedu"], "methods": ["Range"] }
llama3_8b correct
{"columns": ["Fedu"], "methods": ["Range"]}
llama3_8b_sft correct
{"columns": ["Fedu"], "methods": ["Range"]}
llama3_8b_instruct incorrect
{"columns": ["Fedu"], "methods": ["Descriptive Statistics", "Range"]}

Item 1282% solve rateanswer: {"columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"]}

Dataset: Pokedex Dataset Question: Can we verify the hypothesis that the distribution of Pokemon heights, measured in meters, follows a uniform distribution? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

gpt-3.5-turbo correct
{ "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }
gpt-4 correct
{ "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }
gpt-4o correct
{ "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }
llama3_8b_sft correct
{"columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"]}
llama2_7b incorrect
{ "columns": ["Total", "Height_m"], "methods": ["Kolmogorov-Smirnov Test", "Mood Variance Test"] }
llama2_13b incorrect
{ "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }

Subjects

The models, agents, and reward models evaluated.

10 subjects, ranked by mean response (accuracy) across this benchmark's items.

1llama3_8b_sft0.7713
2llama3_8b_instruct_sft0.7592
3llama2_7b_sft0.6672
4gpt-4o0.5037
5gpt-40.4688
6gpt-3.5-turbo0.4107
7llama3_8b0.2917
8llama3_8b_instruct0.2191
9llama2_13b0.1505
10llama2_7b0.1209

Full data on Hugging Face Back to the gallery