CS321M · Stanford · Spring 2026

AI Measurement Science

Frameworks and methodologies for evaluating, benchmarking, and understanding AI systems.

Course description.

Artificial Intelligence (AI) measurement science provides frameworks and methodologies for evaluating, benchmarking, and understanding AI systems. As AI systems become increasingly powerful and deploy into high-stakes domains, the need for rigorous measurement approaches has become increasingly important. Current measurement approaches are often ad hoc, lacking theoretical grounding, and failing to connect to real-world use cases. This has led to a measurement crisis characterized by benchmark saturation, inconsistent evaluation methodologies, and difficulty in making valid claims about AI capabilities. This course develops AI measurement science through three connected themes:

Theme 1

Measurement as Predictive Modeling

Probabilistic models of evaluation data (item-wise and pairwise response models, latent variable models), modeling benchmark response matrices, scaling laws, sample-efficient measurement.

Theme 2

Measurement Validity and Reliability

Validity theory applied to AI evaluation (content, criterion, construct, external, and consequential validity), operationalizing constructs in AI systems, and reliability analysis including noise models and sources of measurement error.

Theme 3

Design, Governance, and Applications

Benchmark and instrument design, synthetic and adversarial evaluation, incentive-aware leaderboard design, and governance and policy considerations around AI measurement.

This is a graduate-level course. By the end of the course, students will be able to understand, implement, and critique state-of-the-art AI measurement approaches and be prepared to conduct research in these areas.

Given the rapid growth of this field, the course will consist of weekly lectures and student-led discussions of assigned papers. Graded work includes two homeworks focused on implementing and analyzing measurement approaches, three quizzes, and a final project where students will develop a novel measurement approach or analysis for an AI system or capability.

If you are a CS PhD student at Stanford, this course is counted toward the breadth requirement for "Learning and Modeling".

Teaching staff.

Personal matters: cs321m-sp2526-staff@lists.stanford.edu

Sanmi Koyejo

Sang Truong

Mike Hardy

Anka Reuel

Colin Sullivan

Alyssa Unell

Nicole Chiou

Yegor Denisov-Blanch

Natalie Dullerud

Logistics.

Syllabus: Open the syllabus
Textbook: AIMS textbook
Lectures: Monday and Wednesday, 11:30 AM – 12:50 PM in CoDa B60
Assignments: 2 homeworks, 3 quizzes, 1 final project
Prerequisites: Machine Learning (e.g., CS 221, CS 229, CS 230, CS 224N)
Probability & Statistics (e.g., CS 109 or equivalent)
Linear Algebra & Calculus (e.g., MATH 51, CME 100)
Proficiency in Python

Showcase

Student projects, Spring 2026.

Final projects from this quarter — measurement-themed investigations across safety, validity, item-response theory, and benchmark design. Write-ups and codebases included where students opted in.

Measuring Safety in Autonomous Penetration Testing

Ethan Ho

Abstract

Existing cybersecurity benchmarks often measure whether language-model agents can solve technical tasks, but professional penetration testing also requires safe operation in business-critical environments. This project presents a pilot benchmark for measuring safe cybersecurity capability: making useful security progress while preserving availability and managing operational risk. We evaluate four models on two sandboxed scenarios, a SaaS web application and a simulated water-treatment industrial-control system, using a Kali-based Codex scaffold with a fixed time limit. Finding quality is scored by matching agent reports against severity-weighted gold findings. Safety is measured separately with two deterministic dimensions: availability and risk management, where risk management captures scenario-specific harms such as sensitive-data exposure, credential-sensitive views, and unsafe ICS state changes. The pilot run shows that models which find more vulnerabilities can also create larger operational safety costs. A Generalizability Theory analysis suggests meaningful model-level variation, but also a substantial model-by-scenario interaction, so the results should not be read as a definitive leaderboard. The main contribution is a measurement framework and pilot analysis showing how safe cybersecurity capability can be separated into finding quality, availability, and risk management, and why a larger scenario suite is needed for stronger claims.

Code

How Many Items Do You Really Need? IRT-Based Redundancy Analysis of LLM Benchmarks

Dinesh Katupputhur Ramprasath

Abstract

LLM benchmarks contain hundreds to thousands of evaluation items, yet it remains unclear how many are actually necessary to produce reliable model rankings. We apply Item Response Theory (IRT) to six LLM benchmarks from the Fantastic-Bugs dataset, fitting Rasch, Two-Parameter Logistic (2PL), and Logistic Factor Model (LogisticFM) models to binary response matrices containing 42–91 language models and 500–3,316 items. We systematically compare five item selection strategies — random, stratified, max-information at θ̂=0 (tinyBenchmarks), integrated information, and difficulty-coverage — and evaluate their ability to recover full-benchmark rankings using Spearman ρ, Kendall τ, and top-5 agreement. Our key finding is that IRT-guided item selection recovers model rankings using as few as 5% of benchmark items (ρ > 0.95) for well-designed benchmarks (MMLU, MedQA), while benchmarks with poor psychometric structure (BBQ) are irreducible. We characterize what makes a benchmark reducible — discrimination concentration, difficulty spread, and unidimensionality — and provide actionable design guidelines for benchmark developers.

Code

Evaluating LLMs as Poker Players: An Item-Response Theory and Q-Matrix Analysis of PokerBench

Abhinav Sattiraju

Abstract

PokerBench, a recent static benchmark for evaluating large language models on no-limit Texas Hold'em poker decision-making, reports a single accuracy score per model and implicitly treats "poker skill" as a unidimensional construct. We examine this assumption using standard item-response and factor-analysis techniques. We evaluate a panel of seven LLMs spanning capability tiers, model families, and reasoning vs. base-instruct variants on PokerBench, and fit the resulting binary response matrix with unidimensional IRT and a multidimensional K-factor model. Exploratory factor analysis does not recover stable extra dimensions: model selection prefers K=1. We then turn to a confirmatory Q-matrix family that imposes externally-specified item structure (GTO action class, item phase, a rule-based skill heuristic, and an LLM-labeled skill taxonomy), and find that every Q-matrix variant explains held-out response cells substantially better than the single-ability baseline; a joint Action×Skill partition is the strongest. Permutation tests support the claim that the labels carry signal beyond what random partitions of the same shape achieve, and an LLM-free heuristic Q-matrix fits at least as well as the LLM-labeled one. The models rank differently across these axes, which means a single PokerBench accuracy score hides meaningful differences in how they succeed and fail.

Code

Dimensionality and Measurement Precision in HLE's Multiple-Choice Subset

Mayank Sharma, Savira Nadela, Tyler Matteson

Abstract

Humanity's Last Exam (HLE) has emerged as a prominent benchmark for evaluating advanced language models, yet widespread adoption has outpaced systematic evaluation of its measurement properties. Most benchmark studies, including HLE, report aggregate accuracy and domain-specific subscores without testing whether the reported domains correspond to empirically distinct latent factors, and without characterizing where on the ability continuum measurement precision actually concentrates. This project evaluates a panel of frontier language models on the text-only multiple-choice subset of HLE (J = 428 items) and applies psychometric methods — McDonald's ω_h, principal component analysis of item-response profiles, residual correlation analysis, domain-level ability comparisons, and the test information function decomposed by subject domain — to ask two questions: (a) Does HLE's eight-domain structure reflect distinct latent constructs, or do the domains collapse into a general reasoning factor? and (b) Where along the ability continuum does HLE concentrate measurement precision, and which domains contribute most to discrimination among frontier models?

Code