AI claims outpace evidence
Benchmark scores are hard to interpret without explicit constructs, validated instruments, and uncertainty reporting.
AI benchmarks don't generalize, and they're easy to game. We build what comes next.We build what comes next.
Why it matters
The methods used to evaluate AI systems often lack the rigor required for sound scientific claims. AIMS ties every claim to explicit constructs, measurable procedures, and clear uncertainty estimates.
Benchmark scores are hard to interpret without explicit constructs, validated instruments, and uncertainty reporting.
Deployment, regulation, and funding all rely on evaluation results.
No unified community, curriculum, or software stack exists yet.
What AIMS builds
Education, open software and data, a public competition, and a workshop — each reinforcing the others.
Education
A textbook for shared concepts and notation, and CS321M — the Stanford course that brings them into a classroom.
Software & Data
torch_measure for psychometric modeling, measurement-db for curated evaluation data, and interactive apps — Benchmark Caliper and AI Evarium — for probing benchmarks and the evaluation ecosystem.
Competition
Common tasks, comparable baselines, and sharper empirical feedback.
Workshop
The AI Measurement Science Workshop at COLM 2026 — invited talks, contributed papers, and a competition track.
Community
AIMS holds itself to scientific standards and runs with the energy of an open-source community.
Real-time discussion about measurement science, course material, and open problems.
Updates on course activity, new resources, and software milestones.
Workshops, reading groups, and community calls. Sign up for the newsletter to stay informed.