Skip to main content

Mathematics

StatQA

StatQA: statistical-analysis question answering over tabular datasets. Given a dataset and a refined question, a model selects the relevant columns and applicable statistical methods; it is correct on an item iff both its column selection and method selection exactly match the gold set. Evaluated across prompting strategies (zero-/one-shot, CoT, stats-prompt) for GPT and LLaMA models.

1,158items
10subjects
100%observed
CC-BY-4.0license
statisticsdomain
reasoningdomain
textmodality
tabularmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 10 subjects × 1,158 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

StatQA response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: {"columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test"]}

Dataset: Los Angeles Library Monthly Statistics Question: Do the total number of computer usage sessions (including wireless and PC sessions) and the number of attendees at children's and teen programming events held at the library exhibit similar variance characteristics? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-3.5-turbo incorrect

    { "columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }

  • gpt-4 incorrect

    { "columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["F-Test for Variance", "Levene Test", "Bartlett Test"] }

  • gpt-4o incorrect

    { "columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }

  • llama3_8b_instruct_sft incorrect

    {"columns": ["Total Computer Usage (Wireless + PC Sessions)", "Total Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama3_8b incorrect

    {"columns": ["Total Computer Usage (Wireless + PC Sessions)", "# of Attendees at Children's and Teen Programming"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

  • llama3_8b_sft incorrect

    {"columns": ["Total Computer Usage (Wireless + PC Sessions)", "Total eMedia Circulation"], "methods": ["Mood Variance Test", "Levene Test"]}

Item 25% solve rateanswer: {"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

Dataset: GPA Study Hours Dataset Question: Is the distribution of the grade point average (GPA) for students approximately normal? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • llama3_8b_sft correct

    {"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama2_7b_sft correct

    {"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • gpt-4 incorrect

    {"columns": ["gpa"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama3_8b_instruct incorrect

    {"columns": ["gpa", "study_hours"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama3_8b_instruct_sft incorrect

    {"columns": ["gpa"], "methods": ["Anderson-Darling Test", "Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality"]}

  • llama3_8b incorrect

    {"columns": ["gpa", "study_hours"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test", "Kolmogorov-Smirnov Test", "Kolmogorov-Smirnov Test for Uniform distribution", "Kolmogorov-Smirnov Test for Gamma distribution", "Kolmogorov-Smirnov Test for Exponential distribution"]}

Item 38% solve rateanswer: {"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

Dataset: Exercise and Fitness Metrics Dataset Question: What is the likelihood that the variable "Age" is drawn from a normal distribution? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • llama3_8b_instruct_sft correct

    {"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama2_7b_sft correct

    {"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama3_8b_sft correct

    {"columns": ["Age"], "methods": ["Anderson-Darling Test", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama2_13b incorrect

    { "columns": ["Age"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"] }

  • llama3_8b_instruct incorrect

    {"columns": ["Age"], "methods": ["Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality", "Lilliefors Test"]}

  • llama3_8b incorrect

    {"columns": ["Age"], "methods": ["Shapiro-Wilk Test of Normality"]}

Item 411% solve rateanswer: {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}

Dataset: University Rank Dataset Question: Is the variability in the Research Quality Score comparable to that of the Industry Score? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • llama3_8b_instruct_sft correct

    {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama2_7b_sft correct

    {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama3_8b_sft correct

    {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama2_13b incorrect

    { "columns": ["Overall scores", "Research Quality Score", "Industry Score"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }

  • llama3_8b_instruct incorrect

    {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance", "Shapiro-Wilk Test of Normality", "Kolmogorov-Smirnov Test for Normality"]}

  • llama3_8b incorrect

    {"columns": ["Research Quality Score", "Industry Score"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

Item 513% solve rateanswer: {"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}

Dataset: Stroke Prediction Dataset Question: Do the average glucose level and body mass index exhibit similar patterns of variance to some extent? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • llama3_8b_instruct_sft correct

    {"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama2_7b_sft correct

    {"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama3_8b_sft correct

    {"columns": ["avg_glucose_level", "bmi"], "methods": ["Mood Variance Test", "Levene Test"]}

  • llama2_13b incorrect

    { "columns": ["avg_glucose_level", "bmi"], "methods": ["Pearson Correlation Coefficient", "Partial Correlation Coefficient"] }

  • llama3_8b_instruct incorrect

    {"columns": ["avg_glucose_level", "bmi"], "methods": ["Correlation Analysis", "Variance Test"]}

  • llama3_8b incorrect

    {"columns": ["avg_glucose_level", "bmi"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient", "Partial Correlation Coefficient", "Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

Item 618% solve rateanswer: {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}

Dataset: Lung Cancer Survey Dataset Question: Does the relationship between alcohol consumption and coughing symptoms change when stratified by smoking status? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • llama3_8b_instruct_sft correct

    {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}

  • llama2_7b_sft correct

    {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}

  • llama3_8b_sft correct

    {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Mantel-Haenszel Test"]}

  • llama2_13b incorrect

    { "columns": ["GENDER", "AGE", "SMOKING", "YELLOW_FINGERS", "ANXIETY", "PEER_PRESSURE", "CHRONIC DISEASE", "FATIGUE", "ALLERGY", "WHEEZING"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient", "Partial Correlation Coefficient"] }

  • llama3_8b_instruct incorrect

    {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Contingency Table Test", "Chi-square Independence Test", "Fisher Exact Test"]}

  • llama3_8b incorrect

    {"columns": ["ALCOHOL CONSUMING", "COUGHING", "SMOKING"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

Item 726% solve rateanswer: {"columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

Dataset: Crop Production Dataset Question: Are the differences in annual crop production between Japan and China statistically significant? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-4 correct

    { "columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }

  • gpt-4o correct

    { "columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"] }

  • llama3_8b_instruct_sft correct

    {"columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test", "Bartlett Test", "F-Test for Variance"]}

  • llama3_8b_instruct incorrect

    {"columns": ["Nigeria", "Pakistan", "Japan", "United States", "Brazil", "Philippines", "Myanmar", "Thailand", "Vietnam", "Indonesia", "Bangladesh", "India", "China"], "methods": ["t-test", "ANOVA"]}

  • llama3_8b incorrect

    {"columns": ["Nigeria", "Pakistan", "Japan", "United States", "Brazil", "Philippines", "Myanmar", "Thailand", "Vietnam", "Indonesia", "Bangladesh", "India", "China"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

  • llama3_8b_sft incorrect

    {"columns": ["Japan", "China"], "methods": ["Mood Variance Test", "Levene Test"]}

Item 834% solve rateanswer: {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

Dataset: Heart Failure Prediction Dataset Question: Does the presence of different sexes influence the distribution of exercise-induced angina? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-4 correct

    {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

  • llama2_7b_sft correct

    {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

  • llama3_8b_sft correct

    {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

  • llama2_13b incorrect

    { "columns": ["Age", "Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test"] }

  • llama3_8b_instruct incorrect

    {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

  • llama3_8b incorrect

    {"columns": ["Sex", "ExerciseAngina"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

Item 942% solve rateanswer: {"columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

Dataset: Lung Cancer Survey Dataset Question: Is there evidence in the data to indicate a relationship between the experience of peer pressure related to smoking and the presence of chronic disease? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-4 correct

    { "columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"] }

  • gpt-4o correct

    { "columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"] }

  • llama3_8b_sft correct

    {"columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test"]}

  • llama2_13b incorrect

    { "columns": ["GENDER", "SMOKING", "CHRONIC DISEASE"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }

  • llama2_7b incorrect

    { "columns": ["PEER_PRESSURE", "CHRONIC_DISEASE"], "methods": ["Correlation Analysis", "Contingency Table Test"] }

  • llama3_8b incorrect

    {"columns": ["PEER_PRESSURE", "CHRONIC DISEASE"], "methods": ["Chi-square Independence Test", "Fisher Exact Test", "Mantel-Haenszel Test"]}

Item 1055% solve rateanswer: {"columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"]}

Dataset: Netflix TV Shows and Movies Scores Dataset in IMDB Question: Is there a direct relationship between the length of TV show episodes or movie duration and the number of IMDB votes received? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-3.5-turbo correct

    { "columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }

  • gpt-4 correct

    { "columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }

  • gpt-4o correct

    { "columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }

  • llama2_13b incorrect

    { "columns": ["release_year", "age_certification", "runtime", "imdb_id", "imdb_score", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient"] }

  • llama3_8b_instruct incorrect

    {"columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Partial Correlation Coefficient"]}

  • llama3_8b incorrect

    {"columns": ["runtime", "imdb_votes"], "methods": ["Pearson Correlation Coefficient", "Spearman Correlation Coefficient", "Kendall Correlation Coefficient", "Partial Correlation Coefficient"]}

Item 1171% solve rateanswer: {"columns": ["Fedu"], "methods": ["Range"]}

Dataset: Student Performance Prediction Dataset Question: What is the range of values for the variable representing the education level of the father (Fedu)? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-3.5-turbo correct

    { "columns": ["Fedu"], "methods": ["Range"] }

  • gpt-4 correct

    { "columns": ["Fedu"], "methods": ["Range"] }

  • gpt-4o correct

    { "columns": ["Fedu"], "methods": ["Range"] }

  • llama3_8b correct

    {"columns": ["Fedu"], "methods": ["Range"]}

  • llama3_8b_sft correct

    {"columns": ["Fedu"], "methods": ["Range"]}

  • llama3_8b_instruct incorrect

    {"columns": ["Fedu"], "methods": ["Descriptive Statistics", "Range"]}

Item 1282% solve rateanswer: {"columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"]}

Dataset: Pokedex Dataset Question: Can we verify the hypothesis that the distribution of Pokemon heights, measured in meters, follows a uniform distribution? Task: select the relevant columns and the applicable statistical methods to answer this question.

How subjects answered

  • gpt-3.5-turbo correct

    { "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }

  • gpt-4 correct

    { "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }

  • gpt-4o correct

    { "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }

  • llama3_8b_sft correct

    {"columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"]}

  • llama2_7b incorrect

    { "columns": ["Total", "Height_m"], "methods": ["Kolmogorov-Smirnov Test", "Mood Variance Test"] }

  • llama2_13b incorrect

    { "columns": ["Height_m"], "methods": ["Kolmogorov-Smirnov Test for Uniform distribution"] }

Subjects

The models, agents, and reward models evaluated.

10 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1llama3_8b_sft0.7713
  2. 2llama3_8b_instruct_sft0.7592
  3. 3llama2_7b_sft0.6672
  4. 4gpt-4o0.5037
  5. 5gpt-40.4688
  6. 6gpt-3.5-turbo0.4107
  7. 7llama3_8b0.2917
  8. 8llama3_8b_instruct0.2191
  9. 9llama2_13b0.1505
  10. 10llama2_7b0.1209