Measurement foundations
Define what is being measured, how to represent it, and where metrics diverge from the construct.
About AIMS
A Stanford community building the methods, courses, and software to close the gap.
The measurement problem
A model scores 87% on a benchmark. Compared to what baseline? With what sampling error? Measuring which construct? Psychometrics, metrology, and educational testing have answers. AI evaluation can draw on that work.
AIMS defines constructs before collecting data, reports uncertainty with every score, and designs benchmarks for durability. The textbook (used in CS321M) covers the theory. The software and competitions put it into practice.
The framework
Each pillar addresses a core question in measuring AI systems well.
Define what is being measured, how to represent it, and where metrics diverge from the construct.
Protocols that are comparable, decision-relevant, and built to last.
Uncertainty, sampling, and distribution shift as core parts of every evaluation.
Shared resources for teaching, competition, software, and discussion that reinforce each other.
Research, course updates, and releases.
Measurement is the foundation of scientific progress.
Lord Kelvin