CS321M materials library

Textbook, lectures, and readings.

Everything for CS321M in one place. Find your materials by lecture or browse the thematic reading list.

AI Measurement Science Textbook

The course textbook covers measurement theory, probabilistic models, reliability, validity, and evaluation design. Chapter references appear in the lecture cards below.

Open textbook Full syllabus

Lecture materials

Arc 1: Measurement as Predictive Modeling

L1March 30, 2026Lecture

Introduction + Foundations

Lecture 1 slidesLecture 1 videoChapter 1 Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

L2April 1, 2026Lecture

Probabilistic Models I: item-wise and pairwise data, Rasch, 2PL, 3PL, factor models, Bradley-Terry

Lecture 2 slidesLecture 2 videoChapter 2 Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

April 3, 2026Discussion

Hands-on tutorial on data and predictive modeling

Discussion 1 notebook

L3April 6, 2026Lecture

Probabilistic Models II: inference, EM, MLE, Bayesian methods, AUC, and ECE

Lecture 3 slidesLecture 3 videoChapter 3 Reliable and Efficient Amortized Model-based Evaluation Statistical Theories of Mental Test Scores

L4April 8, 2026Lecture

Probabilistic Models III: sparse, low-rank, deep methods, cold-start, and amortization

Lecture 4 slidesLecture 4 videoChapter 3 Nonlinear Sequential Designs for Logistic Item Response Theory Models with Applications to Computerized Adaptive Tests

April 10, 2026Discussion

Hands-on tutorial on learning algorithms and metrics

Discussion 2 notebook

L5April 13, 2026Lecture

Sample-Efficient Measurement & Active Learning

Lecture 5 slidesLecture 5 videoChapter 4 Reliable and Efficient Amortized Model-based Evaluation tinyBenchmarks: evaluating LLMs with fewer examples On Speeding Up Language Model Evaluation

L6April 15, 2026Lecture

Scaling Laws

Lecture 6 slidesLecture 6 videoScaling Laws for Neural Language Models Training Compute-Optimal Large Language Models Item Response Scaling Laws: A Measurement Theory Approach to Generalizable Neural Performance Prediction

April 17, 2026Discussion

Discussion and hands-on tutorial on sample-efficient measurement

Discussion 3 notebook

Arc 2: Measurement Reliability and Validity

L7April 20, 2026Lecture

Reliability I: Signal and Noise in AI Measurement, Generalizability Theory, uncertainty estimation, and LLM-as-a-judge

Lecture 7 slidesLecture 7 videoReliability notes Chapter 5 How Benchmark Prediction from Fewer Data Misses the Mark Coefficients and Indices in Generalizability Theory My Current Thoughts on Coefficient Alpha and Successor Procedures The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs Can Unconfident LLM Annotations Be Used for Confident Conclusions?Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models

L8April 22, 2026Lecture

Reliability II: Designing and Diagnosing for Measurement; diagnostic tools, random effects, decision studies, and agentic measurement designs

Lecture 8 slidesLecture 8 videoReliability notes Chapter 5 Predictive Evaluation Handbook Should I Use Fixed or Random Effects?Fantastic Bugs and Where to Find Them in AI Benchmarks GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

April 24, 2026Discussion

Discussion and hands-on tutorial on reliability

Discussion 4 notebook

L9April 27, 2026Lecture

High-noise contexts: Evaluations of Downstream Tasks and Intended Impacts, with noisy human annotations and edtech

Lecture 9 slidesLecture 9 videoReliability notes Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact "All that Glitters": Techniques for Evaluations with Unreliable Model and Human Annotations The Fallacy of AI Functionality AI and the Everything in the Whole Wide World Benchmark

L10April 29, 2026Lecture

Validity

Lecture 10 slidesLecture 10 videoChapter 6 Construct validity in psychological tests We Can't Understand AI Using our Existing Vocabulary

L11May 4, 2026Lecture

Design against noise: Model Evaluation, Instrument Construction & Revision

Lecture 11 slidesLecture 11 videoMeasuring data Correlated Errors in Large Language Models Data Annotation with Large Language Models: Lessons from a Large Empirical Evaluation Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI Training on the Test Task Confounds Evaluation and Emergence In defense of typing monkeys: We definitely are training on our test sets. This is fine.

Arc 3: Design, Governance, Applications, and Frontier

L12May 6, 2026Lecture

The AI Evaluation Ecosystem

Lecture 12 slidesLecture 12 videoEcosystem ChapterApplication Domains: Healthcare and Education

L13May 11, 2026Lecture

Synthetic Data, Red-teaming & Adversarial Evaluation

Lecture 13 slidesLecture 13 videoComparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

L14May 13, 2026Guest

Guest lecture with Serena Wang: information and mechanism design against gaming in measurement

Guest lecture materialsLecture 14 videoGuest: Serena Wang

L15May 18, 2026Guest

Guest lecture with Wale Salaudeen on validity, causality, and distribution shift in measurement

Guest lecture materialsLecture 15 videoDo ImageNet Classifiers Generalize to ImageNet?ImageNot: A contrast with ImageNet preserves model rankings Benchmark suites instead of leaderboards for evaluating AI fairness Do Better ImageNet Models Transfer Better?

L16May 20, 2026Lecture

Evaluations in AI Governance

Lecture 16 slidesLecture 16 videoParadigms of AI Evaluation: Mapping Goals, Methodologies and Culture Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs

Guest Lectures and Synthesis

L17May 27, 2026Guest

Guest lecture with Berivan Isik (Google DeepMind)

Lecture 17 videoGuest: Berivan Isik

L18June 1, 2026Guest

Guest lecture with Daniel Johnson (Transluce)

Lecture 18 videoWho Should Develop Which AI Evaluations?Anka's civil society paperAdvancing science-and evidence-based AI policy

L19June 3, 2026Lecture

Summary and review

Summary slidesLecture 19 videoAggregated Individual Reporting for Post-Deployment Evaluation

Readings by theme

Defining Constructs + Validity

Moving Beyond Human Construct and Its Measurements

Benchmarking as a Scientific Tool: Case Study with ImageNet

Limitations of Benchmarks for Predicting Utility

Benchmarks in the Evaluation Ecosystem

Improving Benchmark Score Alignment with Measurements

Alternatives to Benchmarks

Civil Society

Learning outcomes

The course is organized around predictive modeling, reliability and validity, evaluation design, and measurement governance.

LO1Core

Predictive measurement modeling

Model AI evaluation data with Rasch, IRT, Bradley-Terry, factor, and scaling-law style methods.

LO2Core

Reliability and validity analysis

Assess noise, reliability, and validity when evaluation results are used to compare models or justify decisions.

LO3Core

Evaluation design

Design evaluation protocols, task construction, and sampling strategies that hold up in deployment and governance contexts.

LO4Important

Critical consumption of measurement models

Interpret modeling assumptions, limitations, and failure modes instead of taking benchmark outputs at face value.

LO5Important

Practical reliability analysis

Use diagnostics for uncertainty, noise analysis, and reliability assessment on real evaluation datasets.

LO6Important

Measurement ecosystem awareness

Understand how leaderboards, benchmarks, and governance incentives shape what model builders optimize for.