Research
Papers and preprints.
Selected work on the foundations and practice of AI measurement science.
Predictive evaluation
Modeling, scaling laws, and forecasting evaluation outcomes.
A Measurement Science Roadmap: From Human Assessment to AI Evaluation
S Truong, N Goodman, E Brunskill, B Domingue, N Haber, S Koyejo
Preprint · 2026
Pretraining Scaling Laws for Generative Evaluations of Language Models
R Schaeffer, N Levi, B Miranda, S Koyejo
arXiv preprint · 2025
Reliable and efficient amortized model-based evaluation
S Truong, Y Tu, P Liang, B Li, S Koyejo
arXiv preprint · 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
R Schaeffer, PS Koura, B Tang, R Subramanian, AK Singh, T Mihaylov, et al.
arXiv preprint · 2025
How do large language monkeys get their power (laws)?
R Schaeffer, J Kazdan, J Hughes, J Juravsky, S Price, A Lynch, E Jones, et al.
arXiv preprint · 2025
Quantifying variance in evaluation benchmarks
L Madaan, AK Singh, R Schaeffer, A Poulton, S Koyejo, P Stenetorp, et al.
arXiv preprint · 2024
Validity and reliability
Whether benchmarks measure what they claim — and how reliably — under bugs, saturation, and leakage.
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
M Akhtar, A Reuel, P Soni, S Ahuja, PS Ammanamanchi, R Rawal, et al.
arXiv preprint · 2026
Quantifying the Effect of Test Set Contamination on Generative Evaluations
R Schaeffer, J Kazdan, B Abbasi, KZ Liu, B Miranda, A Ahmed, F Berez, et al.
arXiv preprint · 2026
Fantastic bugs and where to find them in AI benchmarks
S Truong, Y Tu, M Hardy, A Reuel-Lamparth, Z Tang, J Burapacheep, et al.
NeurIPS · 2025
Measurement to meaning: A validity-centered framework for AI evaluation
O Salaudeen, A Reuel, A Ahmed, S Bedi, Z Robertson, S Sundar, et al.
arXiv preprint · 2025
Incentive, design, and governance
Incentive-aware evaluation, leaderboard design, and the governance of deployed AI.
Strategic Evaluation: Incentivizing AI Capability Coverage with Private Benchmarks
S Truong, S Wang, N Haber, S Koyejo
Preprint · 2026
Stop Automating Peer Review Without Rigorous Evaluation
J Baumann, J Pei, S Koyejo, D Hovy
arXiv preprint · 2026
The leaderboard illusion
S Singh, Y Nan, A Wang, D Dsouza, S Kapoor, A Üstün, S Koyejo, Y Deng, et al.
NeurIPS · 2025
Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes
Z Robertson, S Koyejo
arXiv preprint · 2025
Toward an evaluation science for generative AI systems
L Weidinger, ID Raji, H Wallach, M Mitchell, A Wang, O Salaudeen, et al.
arXiv preprint · 2025
Evaluation in the real world
Fairness, safety, and capability evaluated where AI systems actually operate — across domains, populations, and high-stakes settings.
SWE-chat: Coding agent interactions from real users in the wild
J Baumann, V Padmakumar, X Li, J Yang, D Yang, S Koyejo
arXiv preprint · 2026
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Z Tang, ST Truong, D Owens, S Sharma, YJ Zhang, B Miranda, S Koyejo
arXiv preprint · 2026
Holistic evaluation of large language models for medical tasks with MedHELM
S Bedi, H Cui, M Fuentes, A Unell, M Wornow, JM Banda, N Kotecha, et al.
Nature Medicine · 2026
Why Do Safety Guardrails Degrade Across Languages?
M Zhang, A Patel, ST Truong, S Koyejo
arXiv preprint · 2026
Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness
S Pfohl, N Harris, C Nagpal, D Madras, V Mhasawade, O Salaudeen, et al.
NeurIPS · 2025
Autoredteamer: Autonomous red teaming with lifelong attack integration
A Zhou, K Wu, F Pinto, Z Chen, Y Zeng, Y Yang, S Yang, S Koyejo, J Zou, et al.
NeurIPS · 2025
The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology
F Patel, DQ Nguyen, ST Truong, J Vaynshtok, S Koyejo, N Haber
EMNLP · 2025
Development and evaluation of deep learning models for cardiotocography interpretation
N Chiou, N Young-Lin, C Kelly, J Cattiau, T Tiyasirichokchai, A Diack, et al.
npj Women's Health · 2025
Crossing linguistic horizons: Finetuning and comprehensive evaluation of Vietnamese large language models
S Truong, D Nguyen, TQ Nguyen, D Le, N Truong, T Quan, S Koyejo
NAACL · 2024