Skip to main content

AI Measurement Science

AI benchmarks don't generalize, and they're easy to game. We build what comes next.

Why it matters

Rigorous measurement is the foundation of trustworthy AI.

The methods used to evaluate AI systems often lack the rigor required for sound scientific claims. AIMS ties every claim to explicit constructs, measurable procedures, and clear uncertainty estimates.

01

AI claims outpace evidence

Benchmark scores are hard to interpret without explicit constructs, validated instruments, and uncertainty reporting.

02

Decisions depend on measurement quality

Deployment, regulation, and funding all rely on evaluation results.

03

The field lacks shared infrastructure

No unified community, curriculum, or software stack exists yet.

What AIMS builds

Shared infrastructure for AI measurement.

Education, open software and data, a public competition, and a workshop — each reinforcing the others.

Education

Learn the field, end to end.

A textbook for shared concepts and notation, and CS321M — the Stanford course that brings them into a classroom.

Software & Data

Tools and data for careful measurement.

torch_measure for psychometric modeling, measurement-db for curated evaluation data, and interactive apps — Benchmark Caliper and AI Evarium — for probing benchmarks and the evaluation ecosystem.

Competition

Stress test methods in public.

Common tasks, comparable baselines, and sharper empirical feedback.

Workshop

Convene the community.

The AI Measurement Science Workshop at COLM 2026 — invited talks, contributed papers, and a competition track.

Community

Open by design.

AIMS holds itself to scientific standards and runs with the energy of an open-source community.

Discord

Real-time discussion about measurement science, course material, and open problems.

Newsletter

Updates on course activity, new resources, and software milestones.

Events

Workshops, reading groups, and community calls. Sign up for the newsletter to stay informed.