AIMS Workshop · 2026

AI Measurement Science: Toward Rigorous AI Evaluation

Treating evaluation as a first-class scientific problem for frontier AI.

Venue: Hilton San Francisco Union Square, USA
Date: October 9, 2026 · Co-located with COLM 2026
Format: One-day, in-person · 8:00–17:00

01Call for Papers

We are excited to announce the inaugural AI Measurement Science Workshop, co-located with COLM 2026 at the Hilton San Francisco Union Square. The workshop is a one-day, in-person event on October 9, 2026, bringing together researchers from machine learning, statistics, psychometrics, economics, and policy to develop a unified perspective on measurement for frontier AI systems.

As AI systems become increasingly embedded in real-world workflows, existing evaluation paradigms face fundamental limitations. Contemporary benchmarks typically assume i.i.d. samples, stationary concepts, and passive models. These assumptions are increasingly violated in modern systems, which interact with environments, adapt to feedback, and are optimized against known evaluation signals.

This workshop frames evaluation as a problem of AI measurement science. We focus on three tightly coupled challenges (measurement of interactive systems, measurement under strategic optimization, and measurement under non-stationarity) and aim to define common abstractions and open research directions for this emerging area.

Submissions from educational testing, psychometrics, and related fields are strongly encouraged.

Two submission tracks.

Research Track

We invite work across theory, methodology, and systems, including but not limited to the three tightly coupled challenges below.

Interactive Measurement

Evaluation of policies, long-horizon behavior, tool-using agents, and statistical validity when observations are non-i.i.d. and path-dependent.

Measurement under Strategic Optimization

Evaluation as a game between learner and evaluator; robustness to gaming and contamination; incentive-compatible, private, randomized, or adaptive leaderboards.

Measurement under Non-Stationarity

Covariate, label, and construct drift; adaptive test design, recalibration, and longitudinal monitoring of deployed systems.

Competition Track

We also invite write-ups from participants of the Predictive AI Evaluation Challenge. Papers should describe each team's approach to predicting model responses from sparse observations, with methods, results, and lessons learned. Submissions follow the same 4–8 page COLM format as the research track and are reviewed on a later timeline so that final competition results can be included. The Best Competition Paper award will be presented alongside the workshop's research awards.

Competition track deadline: August 15, 2026

View the competition

02Submit

Submission guidelines.

Short, focused contributions in COLM format. Review is double-blind.

Papers of 4–8 pages in COLM format, with unlimited references.
Submitted via OpenReview; double-blind review, with at least three reviews per paper.
Accepted papers are non-archival.
Four contributed talks selected from accepted submissions.

Important dates.

All deadlines are in Anywhere on Earth (AoE) time.

Research track submission deadlineJune 23, 2026
Research track notificationJuly 24, 2026
Competition track submission deadlineAugust 15, 2026
Camera-ready dueSeptember 21, 2026
WorkshopOctober 9, 2026 (COLM 2026 workshops day)

Submissions open now on OpenReview. Deadline June 23, 2026.

03Speakers

Invited speakers from academia, industry, and policy.

Daniel Kang

UIUC

Assistant professor working on agentic benchmarks, AI security evaluation, and ML deployment systems.

Serena Wang

Harvard / UBC

Postdoctoral researcher on long-term AI impacts, counterfactual metrics for social welfare, and robust distillation.

Jacob Steinhardt

UC Berkeley · Transluce

Professor and CEO of Transluce; LLM safety and reward alignment.

Elham Tabassi

Brookings

Led NIST's AI Risk Management Framework and served as chief technologist of the U.S. AI Safety Institute; named one of TIME's 100 most influential in AI.

Coz Ududec

AI Security Institute

Contributes to AISI's frontier model evaluations; former Chief Scientist at Invenia Labs.

04Schedule

Tentative program for the one-day workshop. October 9, 2026 · 8:00–17:00.

Morning

8:00 – 8:05
Opening remarks
Sanmi Koyejo
8:05 – 8:25
Invited talk 1: Evaluation under Strategic Adaptation
Daniel Kang
8:25 – 8:45
Competition presentation: The Predictive AI Evaluation Challenge
Sang Truong
8:45 – 9:05
Invited talk 2: Evaluation Design under Distribution Shift
Serena Wang
9:05 – 9:20
Coffee break
9:20 – 10:20
Breakout Session 1: small-group discussion of key challenges
10:20 – 11:00
Poster session 1
11:00 – 12:00
Mentorship Session & Lightning Debates
Led by Berivan Isik
12:00 – 12:30
Invited talk 3: Measuring Interactive Systems
Jacob Steinhardt
12:30 – 13:30
Networking lunch

Afternoon

13:30 – 13:50
Invited talk 4: AI Standards and the Risk Management Framework
Elham Tabassi
13:50 – 14:10
Invited talk 5: Frontier Evaluation under Institutional Constraints
Coz Ududec
14:10 – 14:25
Contributed talks (2 × 7-minute talks selected from submissions)
14:25 – 15:25
Breakout Session 2
Led by Luke Guerdan
15:25 – 15:40
Coffee break
15:40 – 16:20
Poster session 2
16:20 – 16:55
Panel: "Static Benchmarks in a Dynamic World: Is Rigorous Measurement Even Possible?"
Moderated by Sang Truong
16:55 – 17:00
Closing remarks & Best Paper Award
Olawale Salaudeen