Topic 1
Interactive Measurement
Evaluation of policies, long-horizon behavior, tool-using agents, and statistical validity when observations are non-i.i.d. and path-dependent.
AIMS Workshop · 2026
Treating evaluation as a first-class scientific problem for frontier AI.
We are excited to announce the inaugural AI Measurement Science Workshop, co-located with COLM 2026 at the Hilton Union Square in San Francisco. The workshop is a one-day, in-person event on October 9, 2026, bringing together researchers from machine learning, statistics, psychometrics, economics, and policy to develop a unified perspective on measurement for frontier AI systems.
As AI systems become increasingly embedded in real-world workflows, existing evaluation paradigms face fundamental limitations. Contemporary benchmarks typically assume i.i.d. samples, stationary concepts, and passive models. These assumptions are increasingly violated in modern systems, which interact with environments, adapt to feedback, and are optimized against known evaluation signals.
This workshop frames evaluation as a problem of AI measurement science. We focus on three tightly coupled challenges — measurement of interactive systems, measurement under strategic optimization, and measurement under non-stationarity — and aim to define common abstractions and open research directions for this emerging area.
Submissions from educational testing, psychometrics, and related fields are strongly encouraged.
We invite work across theory, methodology, and systems, including but not limited to:
Topic 1
Evaluation of policies, long-horizon behavior, tool-using agents, and statistical validity when observations are non-i.i.d. and path-dependent.
Topic 2
Evaluation as a game between learner and evaluator; robustness to gaming and contamination; incentive-compatible, private, randomized, or adaptive leaderboards.
Topic 3
Covariate, label, and construct drift; adaptive test design, recalibration, and longitudinal monitoring of deployed systems.
We also invite write-ups from participants of the Predictive AI Evaluation Challenge. Papers describe each team's approach to predicting model responses from sparse observations, with methods, results, and lessons learned. Submissions follow the same 4–8 page COLM format as the research track and are reviewed on a later timeline so that final competition results can be included. Best Competition Paper will be awarded alongside the workshop's research awards.
Competition track deadline: August 15, 2026
Short, focused contributions in COLM format. Review is double-blind.
Calendar
All deadlines are in Anywhere on Earth (AoE) time.
From academia, industry, and policy.

UIUC
Assistant professor working on agentic benchmarks, AI security evaluation, and ML deployment systems.

Harvard / UBC
Postdoctoral researcher on long-term AI impacts, counterfactual metrics for social welfare, and robust distillation.

UC Berkeley · Transluce
Professor and CEO of Transluce; LLM safety and reward alignment.

Brookings
Led NIST's AI Risk Management Framework and the U.S. AI Safety Institute; named one of TIME's 100 most influential in AI.

UK AI Safety Institute
Contributes to AISI's frontier model evaluations; former Chief Scientist at Invenia Labs.
One-day, in-person workshop. 8:00–17:00.
Opening remarks
Sanmi Koyejo
Invited talk 1 — Evaluation under Strategic Adaptation
Daniel Kang
Competition presentation — The Predictive AI Evaluation Challenge
Sang Truong
Invited talk 2 — Evaluation Design under Distribution Shift
Serena Wang
Coffee break
Breakout Session 1 — small-group discussion of key challenges
Poster session 1
Mentorship Session & Lightning Debates
Led by Berivan Isik
Invited talk 3 — Measuring Interactive Systems
Jacob Steinhardt
Networking lunch
Invited talk 4 — AI Standards and the Risk Management Framework
Elham Tabassi
Invited talk 5 — Frontier Evaluation under Institutional Constraints
Coz Ududec
Contributed talks (2 × 7-minute talks selected from submissions)
Breakout Session 2
Led by Luke Guerdan
Coffee break
Poster session 2
Panel — "Static Benchmarks in a Dynamic World: Is Rigorous Measurement Even Possible?"
Moderated by Sang Truong
Closing remarks & Best Paper Award
Olawale Salaudeen

Stanford University

Stanford University

Schmidt Sciences / MIT

Google DeepMind

Carnegie Mellon University

Microsoft Research Asia
Questions about submissions, mentorship, or partnership? Reach out to the organizing committee.