AI Measurement Science

Authors

Sang T. Truong

Sanmi Koyejo

Updated

June 22, 2026

Preface

Every claim about AI progress—better reasoning, broader capability, safer behavior—rests on some act of measurement. But how good are our measurements, really? Benchmarks multiply faster than anyone can understand them, leaderboards rank models on scores without scales, and a system’s average accuracy tells us little about why it succeeds or where it breaks. We have plenty of numbers. What we lack is a science behind them.

This book introduces the foundations and practical tools of measurement science for AI evaluation. Instead of treating evaluation as a purely engineering task—collect a dataset, compute a metric, publish a number—we present it as an inference problem, drawing on ideas from psychometrics, statistics, and the broader science of measurement. Throughout, we emphasize not only the methods themselves but also their assumptions, limitations, and the conditions under which they can and cannot be applied—understanding when an evaluation fails is as important as understanding when it works. By the end of this book, readers will be equipped with the key concepts and tools needed to design evaluation systems that are reliable, valid, and interpretable.

The book is intended for researchers, practitioners, and students who work on or care about AI evaluation. We assume some background in probability, statistics, and machine learning, but provide sufficient context and references for readers to follow the main ideas. The book also includes interactive code examples and datasets. The science of AI measurement is a fast-moving area with many open challenges, and we hope this book will inspire readers to further explore and advance it.

We hope with the present book to help the field move from ad hoc benchmarking toward principled measurement—so that our claims about AI systems rest on solid scientific ground.

Stanford, 2026, Truong & Koyejo

Structure of this book

The book opens with the foundational question of measurement—does an evaluation measure what it claims to?—and only then builds the machinery to answer it: the data landscape, predictive models and their estimation, the reliability analysis and diagnostics that establish trust, efficient design, and the causal analysis of when results generalize. It closes with the strategic and applied dimensions of evaluation: governance, mechanism design, and adversarial testing.

1 Validity opens the book with validity—the question of whether an evaluation measures the construct it claims to. It develops validity as a unitary concept supported by five kinds of evidence (content, criterion, construct, external, and consequential), organized around the nomological network of constructs and Borsboom’s existence-and-causation condition. It catalogs the major threats—construct-irrelevant variance, construct underrepresentation, and contamination—at the conceptual level, deferring their operationalization to later chapters once the measurement machinery is in place. The chapter is deliberately theoretical: it frames the questions the rest of the book learns to answer.

2 Data surveys the AI evaluation data landscape: the response matrix abstraction, a taxonomy of 70+ benchmarks spanning knowledge, coding, agentic, multilingual, cultural, and preference evaluation, and the practical data quality issues that arise in real-world evaluation. It introduces the torch_measure toolkit for loading and analyzing benchmark data.

3 Models lays the mathematical groundwork for the rest of the book. It covers probabilistic models for AI evaluation data—including Item Response Theory (Rasch, 2PL, 3PL), Bradley-Terry models for pairwise comparisons, and factor models for multidimensional ability. A central theme is the relationship between model assumptions and what they allow us to infer: sufficiency, specific objectivity, and the tradeoffs between model complexity and interpretability. The chapter also introduces prediction-powered evaluation—an amortized factor model that maps external features to latent parameters for cold-start prediction of unseen models and items.

4 Learning studies how to estimate latent parameters from evaluation data. It covers maximum likelihood estimation, gradient-based optimization, the EM algorithm, Bayesian inference, regularization, and model selection. It also develops generalization experiments with various masking schemes to evaluate the robustness and transferability of learned factor models.

5 Reliability addresses the question of signal versus noise in AI measurement. It covers Classical Test Theory, Generalizability Theory for decomposing multiple sources of error, and practical tools for assessing and improving the reliability of evaluation protocols—including LLM-as-a-judge designs and agentic measurement. It is also where the validity threats of 1 Validity become operational: the item-level diagnostics—differential item functioning, dimensionality assessment, and item-fit statistics—are each a structured reading of the same variance decomposition.

6 Efficiency develops the statistical foundations for efficient evaluation design. It introduces Fisher information for item selection, Computerized Adaptive Testing (CAT), D-optimal design for constructing maximally informative item pools, and efficient paired-comparison schedules for model arenas.

7 Causality examines the causal foundations of AI measurement. It covers structural causal models for evaluation data, distribution shift (covariate, label, and concept shift) and when benchmark results generalize across contexts, interventional and counterfactual reasoning for diagnosing construct-irrelevant variance, and methods for robust prediction under shift including conformal inference and data attribution.

9 Design considers the strategic and economic dimensions of evaluation design. It introduces decision-making under ambiguity (Maxmin Expected Utility), Bayesian persuasion for optimal information disclosure, robust mechanism design, and the strategic considerations that arise when benchmark results influence development incentives.

10 Application applies the measurement framework to adversarial evaluation: red-teaming as structured measurement, attack success rate as an estimand (and when comparisons are valid), adversarial robustness as a multidimensional latent trait, synthetic data for evaluation at scale, and prediction-powered inference for combining human and automated judgments.

11 Conclusion synthesizes the book’s themes, distills lessons from the field, identifies six open challenges (beyond binary responses, multidimensional ability, temporal dynamics, agentic evaluation, scalable oversight, and fairness in evaluation), presents ten capstone projects for the CS321M course, and offers a practitioner’s checklist for designing rigorous AI evaluations.

How to engage with this book

3 Models is foundational to all of the book, so it is part of all reading pathways.

For practitioners and applied AI researchers, we recommend 3 Models, 4 Learning, 6 Efficiency, and 9 Design. This sequence covers the core modeling, estimation, and design methods needed to build and improve evaluation systems.

For researchers interested in evaluation quality, we recommend 3 Models, 5 Reliability, 1 Validity, and 7 Causality. This pathway focuses on when and why evaluations fail, and how to diagnose and address those failures.

For those interested in the societal and governance dimensions of AI evaluation, we recommend 3 Models, 9 Design, and 10 Application.

Prior knowledge

The book assumes knowledge of the fundamentals of statistics, linear algebra, and machine learning. Code examples are written in Python using PyTorch, scikit-learn, and pandas, so familiarity with Python is valuable for readers who wish to engage with the interactive examples.

For readers seeking to strengthen their prerequisites or deepen their understanding of the fields this book draws from, we recommend the following companion texts:

Murphy (2022) — A modern, comprehensive treatment of probabilistic modeling, Bayesian inference, and statistical learning. Excellent preparation for the latent variable models and estimation methods in 3 Models through 6 Efficiency.
Bishop (2006) — Thorough coverage of factor analysis, EM algorithms, and Bayesian methods, all of which appear throughout this book.
Borsboom (2005) — A philosophically rigorous account of what it means to measure a psychological attribute. Essential reading for the validity and construct modeling discussions in 5 Reliability and 1 Validity.
Lord and Novick (1968) — The classical statistical theory of mental testing that underlies much of modern psychometrics. For readers wanting depth in reliability, item analysis, and test construction before the measurement chapters.

Citation

Thanks for reading our book! We hope it proves useful for research and teaching.

Truong, S. T., Koyejo, S. 2026. AI Measurement Science. Stanford University.

To cite this book, use the following BibTeX entry:

@book{truongkoyejo2026aims,
  title     = {AI Measurement Science},
  author    = {Truong, Sang T. and Koyejo, Sanmi},
  year      = {2026},
  url       = {https://aimslab.stanford.edu/textbook/}
}

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge University Press.

Lord, Frederic M., and Melvin R. Novick. 1968. Statistical Theories of Mental Test Scores. Addison-Wesley.

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press.

# Preface {.unnumbered #sec-intro} Every claim about AI progress—better reasoning, broader capability, safer behavior—rests on some act of measurement. But how good are our measurements, really? Benchmarks multiply faster than anyone can understand them, leaderboards rank models on scores without scales, and a system's average accuracy tells us little about *why* it succeeds or *where* it breaks. We have plenty of numbers. What we lack is a science behind them. This book introduces the foundations and practical tools of measurement science for AI evaluation. Instead of treating evaluation as a purely engineering task—collect a dataset, compute a metric, publish a number—we present it as an *inference* problem, drawing on ideas from psychometrics, statistics, and the broader science of measurement. Throughout, we emphasize not only the methods themselves but also their assumptions, limitations, and the conditions under which they can and cannot be applied—understanding when an evaluation fails is as important as understanding when it works. By the end of this book, readers will be equipped with the key concepts and tools needed to design evaluation systems that are reliable, valid, and interpretable. The book is intended for researchers, practitioners, and students who work on or care about AI evaluation. We assume some background in probability, statistics, and machine learning, but provide sufficient context and references for readers to follow the main ideas. The book also includes interactive code examples and datasets. The science of AI measurement is a fast-moving area with many open challenges, and we hope this book will inspire readers to further explore and advance it. We hope with the present book to help the field move from ad hoc benchmarking toward principled measurement—so that our claims about AI systems rest on solid scientific ground. Stanford, 2026, Truong & Koyejo ## Structure of this book {.unnumbered} The book opens with the foundational question of measurement—does an evaluation measure what it claims to?—and only then builds the machinery to answer it: the data landscape, predictive models and their estimation, the reliability analysis and diagnostics that establish trust, efficient design, and the causal analysis of when results generalize. It closes with the strategic and applied dimensions of evaluation: governance, mechanism design, and adversarial testing. @sec-validity opens the book with validity—the question of whether an evaluation measures the construct it claims to. It develops validity as a unitary concept supported by five kinds of evidence (content, criterion, construct, external, and consequential), organized around the *nomological network* of constructs and Borsboom's existence-and-causation condition. It catalogs the major threats—construct-irrelevant variance, construct underrepresentation, and contamination—at the conceptual level, deferring their operationalization to later chapters once the measurement machinery is in place. The chapter is deliberately theoretical: it frames the questions the rest of the book learns to answer. @sec-data-landscape surveys the AI evaluation data landscape: the response matrix abstraction, a taxonomy of 70+ benchmarks spanning knowledge, coding, agentic, multilingual, cultural, and preference evaluation, and the practical data quality issues that arise in real-world evaluation. It introduces the `torch_measure` toolkit for loading and analyzing benchmark data. @sec-foundations lays the mathematical groundwork for the rest of the book. It covers probabilistic models for AI evaluation data—including Item Response Theory (Rasch, 2PL, 3PL), Bradley-Terry models for pairwise comparisons, and factor models for multidimensional ability. A central theme is the relationship between model assumptions and what they allow us to infer: sufficiency, specific objectivity, and the tradeoffs between model complexity and interpretability. The chapter also introduces prediction-powered evaluation—an amortized factor model that maps external features to latent parameters for cold-start prediction of unseen models and items. @sec-learning studies how to estimate latent parameters from evaluation data. It covers maximum likelihood estimation, gradient-based optimization, the EM algorithm, Bayesian inference, regularization, and model selection. It also develops generalization experiments with various masking schemes to evaluate the robustness and transferability of learned factor models. @sec-reliability addresses the question of signal versus noise in AI measurement. It covers Classical Test Theory, Generalizability Theory for decomposing multiple sources of error, and practical tools for assessing and improving the reliability of evaluation protocols—including LLM-as-a-judge designs and agentic measurement. It is also where the validity threats of @sec-validity become operational: the item-level diagnostics—differential item functioning, dimensionality assessment, and item-fit statistics—are each a structured reading of the same variance decomposition. @sec-efficient develops the statistical foundations for efficient evaluation design. It introduces Fisher information for item selection, Computerized Adaptive Testing (CAT), D-optimal design for constructing maximally informative item pools, and efficient paired-comparison schedules for model arenas. @sec-causality examines the causal foundations of AI measurement. It covers structural causal models for evaluation data, distribution shift (covariate, label, and concept shift) and when benchmark results generalize across contexts, interventional and counterfactual reasoning for diagnosing construct-irrelevant variance, and methods for robust prediction under shift including conformal inference and data attribution. @sec-design considers the strategic and economic dimensions of evaluation design. It introduces decision-making under ambiguity (Maxmin Expected Utility), Bayesian persuasion for optimal information disclosure, robust mechanism design, and the strategic considerations that arise when benchmark results influence development incentives. @sec-redteaming applies the measurement framework to adversarial evaluation: red-teaming as structured measurement, attack success rate as an estimand (and when comparisons are valid), adversarial robustness as a multidimensional latent trait, synthetic data for evaluation at scale, and prediction-powered inference for combining human and automated judgments. @sec-conclusion synthesizes the book's themes, distills lessons from the field, identifies six open challenges (beyond binary responses, multidimensional ability, temporal dynamics, agentic evaluation, scalable oversight, and fairness in evaluation), presents ten capstone projects for the CS321M course, and offers a practitioner's checklist for designing rigorous AI evaluations. ## How to engage with this book {.unnumbered} @sec-foundations is foundational to all of the book, so it is part of all reading pathways. For **practitioners and applied AI researchers**, we recommend @sec-foundations, @sec-learning, @sec-efficient, and @sec-design. This sequence covers the core modeling, estimation, and design methods needed to build and improve evaluation systems. For **researchers interested in evaluation quality**, we recommend @sec-foundations, @sec-reliability, @sec-validity, and @sec-causality. This pathway focuses on when and why evaluations fail, and how to diagnose and address those failures. For those interested in the **societal and governance dimensions** of AI evaluation, we recommend @sec-foundations, @sec-design, and @sec-redteaming. ## Prior knowledge {.unnumbered} The book assumes knowledge of the fundamentals of statistics, linear algebra, and machine learning. Code examples are written in Python using PyTorch, scikit-learn, and pandas, so familiarity with Python is valuable for readers who wish to engage with the interactive examples. For readers seeking to strengthen their prerequisites or deepen their understanding of the fields this book draws from, we recommend the following companion texts: - @murphy2022probabilistic — A modern, comprehensive treatment of probabilistic modeling, Bayesian inference, and statistical learning. Excellent preparation for the latent variable models and estimation methods in @sec-foundations through @sec-efficient. - @bishop2006pattern — Thorough coverage of factor analysis, EM algorithms, and Bayesian methods, all of which appear throughout this book. - @borsboom2005measuring — A philosophically rigorous account of what it means to measure a psychological attribute. Essential reading for the validity and construct modeling discussions in @sec-reliability and @sec-validity. - @lord1968statistical — The classical statistical theory of mental testing that underlies much of modern psychometrics. For readers wanting depth in reliability, item analysis, and test construction before the measurement chapters. ## Citation {.unnumbered} Thanks for reading our book! We hope it proves useful for research and teaching. ::: {.content-visible when-format="html"} ::: {.quarto-title-citeas .csl-entry} Truong, S. T., Koyejo, S. 2026. _AI Measurement Science_. Stanford University. ::: ::: To cite this book, use the following BibTeX entry: ```bibtex @book{truongkoyejo2026aims, title = {AI Measurement Science}, author = {Truong, Sang T. and Koyejo, Sanmi}, year = {2026}, url = {https://aimslab.stanford.edu/textbook/} } ```