AI Measurement Science

A Science of Knowing Where AI Thrives, Where It Breaks, and How to Respond

Authors

Sang T. Truong

Sanmi Koyejo

Updated

March 22, 2026

Preface

Every claim about AI progress—better reasoning, broader capability, safer behavior—rests on some act of measurement. But how good are our measurements, really? Benchmarks multiply faster than anyone can understand them, leaderboards rank models on scores without scales, and a system’s average accuracy tells us little about why it succeeds or where it breaks. We have plenty of numbers. What we lack is a science behind them.

This book introduces the foundations and practical tools of measurement science for AI evaluation. Instead of treating evaluation as a purely engineering task—collect a dataset, compute a metric, publish a number—we present it as an inference problem, drawing on ideas from psychometrics, statistics, and the broader science of measurement. Throughout, we emphasize not only the methods themselves but also their assumptions, limitations, and the conditions under which they can and cannot be applied—understanding when an evaluation fails is as important as understanding when it works. By the end of this book, readers will be equipped with the key concepts and tools needed to design evaluation systems that are reliable, valid, and interpretable.

The book is intended for researchers, practitioners, and students who work on or care about AI evaluation. We assume some background in probability, statistics, and machine learning, but provide sufficient context and references for readers to follow the main ideas. The book also includes interactive code examples and datasets. The science of AI measurement is a fast-moving area with many open challenges, and we hope this book will inspire readers to further explore and advance it.

We hope with the present book to help the field move from ad hoc benchmarking toward principled measurement—so that our claims about AI systems rest on solid scientific ground.

Stanford, 2025, Truong & Koyejo

Structure of this book

The book opens with a survey of the evaluation data landscape, followed by three parts which introduce predictive measurement models, develop reliability and validity analysis, and discuss design, governance, and applications.

1  The Data Landscape surveys the AI evaluation data landscape: the response matrix abstraction, a taxonomy of 70+ benchmarks spanning knowledge, coding, agentic, multilingual, cultural, and preference evaluation, and the practical data quality issues that arise in real-world evaluation. It introduces the torch_measure toolkit for loading and analyzing benchmark data.

Part I: Measurement as Predictive Modeling

2  Foundations of Measurement lays the mathematical groundwork for the rest of the book. It covers probabilistic models for AI evaluation data—including Item Response Theory (Rasch, 2PL, 3PL), Bradley-Terry models for pairwise comparisons, and factor models for multidimensional ability. A central theme is the relationship between model assumptions and what they allow us to infer: sufficiency, specific objectivity, and the tradeoffs between model complexity and interpretability. The chapter also introduces prediction-powered evaluation—an amortized factor model that maps external features to latent parameters for cold-start prediction of unseen models and items.

3  Learning studies how to estimate latent parameters from evaluation data. It covers maximum likelihood estimation, gradient-based optimization, the EM algorithm, Bayesian inference, regularization, and model selection. It also develops generalization experiments with various masking schemes to evaluate the robustness and transferability of learned factor models.

4  Efficient Measurement develops the statistical foundations for efficient evaluation design. It introduces Fisher information for item selection, Computerized Adaptive Testing (CAT), D-optimal design for constructing maximally informative item pools, and efficient paired-comparison schedules for model arenas.

Part II: Measurement Reliability and Validity

5  Reliability addresses the question of signal versus noise in AI measurement. It covers Classical Test Theory, Generalizability Theory for decomposing multiple sources of error, and practical tools for assessing and improving the reliability of evaluation protocols—including LLM-as-a-judge designs and agentic measurement.

6  Validity asks whether our evaluations measure what they claim to measure. It covers content, criterion, and construct validity; diagnostic tools for detecting benchmark contamination, differential item functioning, and construct-irrelevant variance; and principled approaches to instrument construction and revision, including the use of synthetic data.

7  Causality and Distribution Shift examines the causal foundations of AI measurement. It covers structural causal models for evaluation data, distribution shift (covariate, label, and concept shift) and when benchmark results generalize across contexts, interventional and counterfactual reasoning for diagnosing construct-irrelevant variance, and methods for robust prediction under shift including conformal inference and data attribution.

Part III: Design, Governance, and Applications

8  Information and Mechanism Design considers the strategic and economic dimensions of evaluation design. It introduces decision-making under ambiguity (Maxmin Expected Utility), Bayesian persuasion for optimal information disclosure, robust mechanism design, and the strategic considerations that arise when benchmark results influence development incentives.

9  Red-Teaming and Adversarial Evaluation applies the measurement framework to adversarial evaluation: red-teaming as structured measurement, attack success rate as an estimand (and when comparisons are valid), adversarial robustness as a multidimensional latent trait, synthetic data for evaluation at scale, and prediction-powered inference for combining human and automated judgments.

Conclusion

10  Conclusion synthesizes the book’s themes, distills lessons from the field, identifies six open challenges (beyond binary responses, multidimensional ability, temporal dynamics, agentic evaluation, scalable oversight, and fairness in evaluation), presents ten capstone projects for the CS321M course, and offers a practitioner’s checklist for designing rigorous AI evaluations.

How to engage with this book

2  Foundations of Measurement is foundational to all of the book, so it is part of all reading pathways.

For practitioners and applied AI researchers, we recommend 2  Foundations of Measurement, 3  Learning, 4  Efficient Measurement, and 8  Information and Mechanism Design. This sequence covers the core modeling, estimation, and design methods needed to build and improve evaluation systems.

For researchers interested in evaluation quality, we recommend 2  Foundations of Measurement, 5  Reliability, 6  Validity, and 7  Causality and Distribution Shift. This pathway focuses on when and why evaluations fail, and how to diagnose and address those failures.

For those interested in the societal and governance dimensions of AI evaluation, we recommend 2  Foundations of Measurement, 8  Information and Mechanism Design, and 9  Red-Teaming and Adversarial Evaluation.

Prior knowledge

The book assumes knowledge of the fundamentals of statistics, linear algebra, and machine learning. Code examples are written in Python using PyTorch, scikit-learn, and pandas, so familiarity with Python is valuable for readers who wish to engage with the interactive examples.

For readers seeking to strengthen their prerequisites or deepen their understanding of the fields this book draws from, we recommend the following companion texts:

  • Murphy (2022) — A modern, comprehensive treatment of probabilistic modeling, Bayesian inference, and statistical learning. Excellent preparation for the latent variable models and estimation methods in Chapters 1–3.
  • Bishop (2006) — Thorough coverage of factor analysis, EM algorithms, and Bayesian methods, all of which appear throughout this book.
  • Borsboom (2005) — A philosophically rigorous account of what it means to measure a psychological attribute. Essential reading for the validity and construct modeling discussions in Chapters 4–5.
  • Lord and Novick (1968) — The classical statistical theory of mental testing that underlies much of modern psychometrics. For readers wanting depth in reliability, item analysis, and test construction before the measurement chapters.

Citation

Thanks for reading our book! We hope you find it useful in your research and teaching.

Truong, S. T., Koyejo, S. 2026. AI Measurement Science: A Science of Knowing Where AI Thrives, Where It Breaks, and How to Respond. Stanford University.

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge, UK: Cambridge University Press.
Lord, Frederic M., and Melvin R. Novick. 1968. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press.