About AIMS

AI evaluation lacks the measurement science that other fields rely on.

A Stanford community building the methods, courses, and software to close the gap.

The measurement problem

Better evaluation starts with better measurement.

A model scores 87% on a benchmark. Compared to what baseline? With what sampling error? Measuring which construct? Psychometrics, metrology, and educational testing have answers. AI evaluation can draw on that work.

AIMS defines constructs before collecting data, reports uncertainty with every score, and designs benchmarks for durability. The textbook (used in CS321M) covers the theory. The software and competitions put it into practice.

The framework

Four disciplines, one integrated science.

Each pillar addresses a core question in measuring AI systems well.

Measurement foundations

Define what is being measured, how to represent it, and where metrics diverge from the construct.

Evaluation science

Protocols that are comparable, decision-relevant, and built to last.

Statistical discipline

Uncertainty, sampling, and distribution shift as core parts of every evaluation.

Community infrastructure

Shared resources for teaching, competition, software, and discussion that reinforce each other.

Follow the work.

Research, course updates, and releases.

Measurement is the foundation of scientific progress.

Lord Kelvin