Skip to main content

AIMS Workshop · 2026

AI Measurement Science: Toward Rigorous AI Evaluation

Treating evaluation as a first-class scientific problem for frontier AI.

Venue
Hilton Union Square, San Francisco, USA
Date
October 9, 2026 · Co-located with COLM 2026

Call for Papers

We are excited to announce the inaugural AI Measurement Science Workshop, co-located with COLM 2026 at the Hilton Union Square in San Francisco. The workshop is a one-day, in-person event on October 9, 2026, bringing together researchers from machine learning, statistics, psychometrics, economics, and policy to develop a unified perspective on measurement for frontier AI systems.

As AI systems become increasingly embedded in real-world workflows, existing evaluation paradigms face fundamental limitations. Contemporary benchmarks typically assume i.i.d. samples, stationary concepts, and passive models. These assumptions are increasingly violated in modern systems, which interact with environments, adapt to feedback, and are optimized against known evaluation signals.

This workshop frames evaluation as a problem of AI measurement science. We focus on three tightly coupled challenges — measurement of interactive systems, measurement under strategic optimization, and measurement under non-stationarity — and aim to define common abstractions and open research directions for this emerging area.

Submissions from educational testing, psychometrics, and related fields are strongly encouraged.

Research Track

We invite work across theory, methodology, and systems, including but not limited to:

Topic 1

Interactive Measurement

Evaluation of policies, long-horizon behavior, tool-using agents, and statistical validity when observations are non-i.i.d. and path-dependent.

Topic 2

Measurement under Strategic Optimization

Evaluation as a game between learner and evaluator; robustness to gaming and contamination; incentive-compatible, private, randomized, or adaptive leaderboards.

Topic 3

Measurement under Non-Stationarity

Covariate, label, and construct drift; adaptive test design, recalibration, and longitudinal monitoring of deployed systems.

Competition Track

We also invite write-ups from participants of the Predictive AI Evaluation Challenge. Papers describe each team's approach to predicting model responses from sparse observations, with methods, results, and lessons learned. Submissions follow the same 4–8 page COLM format as the research track and are reviewed on a later timeline so that final competition results can be included. Best Competition Paper will be awarded alongside the workshop's research awards.

Competition track deadline: August 15, 2026

Submission guidelines.

Short, focused contributions in COLM format. Review is double-blind.

  • Papers of 4–8 pages in COLM format, with unlimited references.
  • Submitted via OpenReview; double-blind review, with at least three reviews per paper.
  • Accepted papers are non-archival.
  • Four contributed talks selected from accepted submissions.

Calendar

Important dates.

All deadlines are in Anywhere on Earth (AoE) time.

Research track submission deadline
June 23, 2026
Research track notification
July 24, 2026
Competition track submission deadline
August 15, 2026
Camera-ready due
September 21, 2026
Workshop
October 9, 2026 (COLM 2026 workshops day)

Invited speakers.

From academia, industry, and policy.

Daniel Kang

Daniel Kang

UIUC

Assistant professor working on agentic benchmarks, AI security evaluation, and ML deployment systems.

Serena Wang

Serena Wang

Harvard / UBC

Postdoctoral researcher on long-term AI impacts, counterfactual metrics for social welfare, and robust distillation.

Jacob Steinhardt

Jacob Steinhardt

UC Berkeley · Transluce

Professor and CEO of Transluce; LLM safety and reward alignment.

Elham Tabassi

Elham Tabassi

Brookings

Led NIST's AI Risk Management Framework and the U.S. AI Safety Institute; named one of TIME's 100 most influential in AI.

Coz Ududec

Coz Ududec

UK AI Safety Institute

Contributes to AISI's frontier model evaluations; former Chief Scientist at Invenia Labs.

Tentative schedule.

One-day, in-person workshop. 8:00–17:00.

  1. 8:00 – 8:05

    Opening remarks

    Sanmi Koyejo

  2. 8:05 – 8:25

    Invited talk 1 — Evaluation under Strategic Adaptation

    Daniel Kang

  3. 8:25 – 8:45

    Competition presentation — The Predictive AI Evaluation Challenge

    Sang Truong

  4. 8:45 – 9:05

    Invited talk 2 — Evaluation Design under Distribution Shift

    Serena Wang

  5. 9:05 – 9:20

    Coffee break

  6. 9:20 – 10:20

    Breakout Session 1 — small-group discussion of key challenges

  7. 10:20 – 11:00

    Poster session 1

  8. 11:00 – 12:00

    Mentorship Session & Lightning Debates

    Led by Berivan Isik

  9. 12:00 – 12:30

    Invited talk 3 — Measuring Interactive Systems

    Jacob Steinhardt

  10. 12:30 – 13:30

    Networking lunch

  11. 13:30 – 13:50

    Invited talk 4 — AI Standards and the Risk Management Framework

    Elham Tabassi

  12. 13:50 – 14:10

    Invited talk 5 — Frontier Evaluation under Institutional Constraints

    Coz Ududec

  13. 14:10 – 14:25

    Contributed talks (2 × 7-minute talks selected from submissions)

  14. 14:25 – 15:25

    Breakout Session 2

    Led by Luke Guerdan

  15. 15:25 – 15:40

    Coffee break

  16. 15:40 – 16:20

    Poster session 2

  17. 16:20 – 16:55

    Panel — "Static Benchmarks in a Dynamic World: Is Rigorous Measurement Even Possible?"

    Moderated by Sang Truong

  18. 16:55 – 17:00

    Closing remarks & Best Paper Award

    Olawale Salaudeen

Organizers.

Get in touch.

Questions about submissions, mentorship, or partnership? Reach out to the organizing committee.