Textbook
A direct link to the current textbook, with PDF download when that file is configured.
Stanford CS321M
Most AI benchmarks are already saturated, partly memorized, or disconnected from the tasks they claim to measure. CS321M teaches how to build evaluations that hold up. Sign up to get the textbook, syllabus, and course materials as they are released.
Included
A direct link to the current textbook, with PDF download when that file is configured.
A growing lecture deck archive as materials are posted through the quarter.
The live course arc, readings, events, and learning outcomes in one place.
Why this class exists
CS321M is built around the gap between benchmark scores and valid measurement. The course focuses on what breaks in practice, how to design evaluations that hold up, and how to reason about evidence rather than leaderboard theater.
Many popular AI benchmarks are already saturated, partly memorized, or weakly tied to real use.
Model choice, budget decisions, and policy arguments still lean heavily on fragile evaluation signals.
CS321M is about designing evaluations that remain interpretable, reliable, and decision-relevant.