Papers and preprints.

We believe progress in AI depends on our ability to measure it well. Our research develops the theory, methods, and infrastructure needed to make AI evaluation a rigorous science, spanning foundations, prediction, validity, incentives, and real-world impact.

Papers and preprints

Research directions

Venues and archives

2024–2026

Years of work

Predictive evaluation

Modeling, scaling laws, and forecasting evaluation outcomes.

6 papers

A Measurement Science Roadmap: From Human Assessment to AI Evaluation
S Truong, N Goodman, E Brunskill, B Domingue, N Haber, S Koyejo
Preprint · 2026
Pretraining Scaling Laws for Generative Evaluations of Language Models
R Schaeffer, N Levi, B Miranda, S Koyejo
ICLR 2026 poster · 2025
Reliable and Efficient Amortized Model-based Evaluation
S Truong, Y Tu, P Liang, B Li, S Koyejo
ICML 2025 poster · 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
R Schaeffer, PS Koura, B Tang, R Subramanian, AK Singh, T Mihaylov, et al.
arXiv preprint · 2025
How Do Large Language Monkeys Get Their Power (Laws)?
R Schaeffer, J Kazdan, J Hughes, J Juravsky, S Price, A Lynch, E Jones, et al.
ICML 2025 oral · 2025
Quantifying Variance in Evaluation Benchmarks
L Madaan, AK Singh, R Schaeffer, A Poulton, S Koyejo, P Stenetorp, et al.
arXiv preprint · 2024

Validity and reliability

Whether benchmarks measure what they claim, and how reliably, under bugs, saturation, and leakage.

4 papers

Incentive, design, and governance

Incentive-aware evaluation, leaderboard design, and the governance of deployed AI.

6 papers

Evaluation in the real world

Fairness, safety, and capability evaluated where AI systems actually operate, across domains, populations, and high-stakes settings.

9 papers

Collaborate

Measurement science is a community effort.

Much of this work is joint with collaborators across labs and institutions. Join the conversation, follow new papers as they land, or reach out to work with us.

Join the Discord

Real-time discussion about measurement science, open problems, and work in progress.

Follow the work

Get new papers, resources, and milestones delivered as they are released.

Papers and preprints.

Predictive evaluation

A Measurement Science Roadmap: From Human Assessment to AI Evaluation

Pretraining Scaling Laws for Generative Evaluations of Language Models

Reliable and Efficient Amortized Model-based Evaluation

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

How Do Large Language Monkeys Get Their Power (Laws)?

Quantifying Variance in Evaluation Benchmarks

Validity and reliability

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Quantifying the Effect of Test Set Contamination on Generative Evaluations

Fantastic Bugs and Where to Find Them in AI Benchmarks

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Incentive, design, and governance

Strategic Evaluation: Incentivizing AI Capability Coverage with Private Benchmarks

Public AI Benchmarks Are Broken, But Are Private Benchmarks the Answer?

Stop Automating Peer Review Without Rigorous Evaluation

The Leaderboard Illusion

Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes

Toward an Evaluation Science for Generative AI Systems

Evaluation in the real world

SWE-chat: Coding Agent Interactions From Real Users in the Wild

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

Holistic evaluation of large language models for medical tasks with MedHELM

Why Do Safety Guardrails Degrade Across Languages?

Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology

Development and evaluation of deep learning models for cardiotocography interpretation

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Measurement science is a community effort.

Join the Discord

Follow the work