Stanford CS321M

Sign up here to get the course materials.

Most AI benchmarks are already saturated, partly memorized, or disconnected from the tasks they claim to measure. CS321M teaches how to build evaluations that hold up. Sign up to get the textbook, syllabus, and course materials as they are released.

View the course overview

Included

One clean route into the full course arc.

Textbook

A direct link to the current textbook, with PDF download when that file is configured.

Slides

A growing lecture deck archive as materials are posted through the quarter.

Syllabus

The live course arc, readings, events, and learning outcomes in one place.

Why this class exists

If we can’t measure AI well, we can’t improve it well.

CS321M is built around the gap between benchmark scores and valid measurement. The course focuses on what breaks in practice, how to design evaluations that hold up, and how to reason about evidence rather than leaderboard theater.

Benchmarks saturate fast

Many popular AI benchmarks are already saturated, partly memorized, or weakly tied to real use.

Decisions still depend on them

Model choice, budget decisions, and policy arguments still lean heavily on fragile evaluation signals.

Measurement needs better tools

CS321M is about designing evaluations that remain interpretable, reliable, and decision-relevant.

Stanford CS321M

Sign up here to get the course materials.

View the course overview

Included

One clean route into the full course arc.

Textbook

A direct link to the current textbook, with PDF download when that file is configured.

Slides

A growing lecture deck archive as materials are posted through the quarter.

Syllabus

The live course arc, readings, events, and learning outcomes in one place.

Why this class exists

If we can’t measure AI well, we can’t improve it well.

Benchmarks saturate fast

Many popular AI benchmarks are already saturated, partly memorized, or weakly tied to real use.

Decisions still depend on them

Model choice, budget decisions, and policy arguments still lean heavily on fragile evaluation signals.

Measurement needs better tools

CS321M is about designing evaluations that remain interpretable, reliable, and decision-relevant.