Skip to main content

Research

Papers and preprints.

Selected work on the foundations and practice of AI measurement science.

Predictive evaluation

Modeling, scaling laws, and forecasting evaluation outcomes.

  1. A Measurement Science Roadmap: From Human Assessment to AI Evaluation

    S Truong, N Goodman, E Brunskill, B Domingue, N Haber, S Koyejo

    Preprint · 2026

  2. Pretraining Scaling Laws for Generative Evaluations of Language Models

    R Schaeffer, N Levi, B Miranda, S Koyejo

    arXiv preprint · 2025

  3. Reliable and efficient amortized model-based evaluation

    S Truong, Y Tu, P Liang, B Li, S Koyejo

    arXiv preprint · 2025

  4. Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

    R Schaeffer, PS Koura, B Tang, R Subramanian, AK Singh, T Mihaylov, et al.

    arXiv preprint · 2025

  5. How do large language monkeys get their power (laws)?

    R Schaeffer, J Kazdan, J Hughes, J Juravsky, S Price, A Lynch, E Jones, et al.

    arXiv preprint · 2025

  6. Quantifying variance in evaluation benchmarks

    L Madaan, AK Singh, R Schaeffer, A Poulton, S Koyejo, P Stenetorp, et al.

    arXiv preprint · 2024

Validity and reliability

Whether benchmarks measure what they claim — and how reliably — under bugs, saturation, and leakage.

  1. When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

    M Akhtar, A Reuel, P Soni, S Ahuja, PS Ammanamanchi, R Rawal, et al.

    arXiv preprint · 2026

  2. Quantifying the Effect of Test Set Contamination on Generative Evaluations

    R Schaeffer, J Kazdan, B Abbasi, KZ Liu, B Miranda, A Ahmed, F Berez, et al.

    arXiv preprint · 2026

  3. Fantastic bugs and where to find them in AI benchmarks

    S Truong, Y Tu, M Hardy, A Reuel-Lamparth, Z Tang, J Burapacheep, et al.

    NeurIPS · 2025

  4. Measurement to meaning: A validity-centered framework for AI evaluation

    O Salaudeen, A Reuel, A Ahmed, S Bedi, Z Robertson, S Sundar, et al.

    arXiv preprint · 2025

Incentive, design, and governance

Incentive-aware evaluation, leaderboard design, and the governance of deployed AI.

  1. Strategic Evaluation: Incentivizing AI Capability Coverage with Private Benchmarks

    S Truong, S Wang, N Haber, S Koyejo

    Preprint · 2026

  2. Stop Automating Peer Review Without Rigorous Evaluation

    J Baumann, J Pei, S Koyejo, D Hovy

    arXiv preprint · 2026

  3. The leaderboard illusion

    S Singh, Y Nan, A Wang, D Dsouza, S Kapoor, A Üstün, S Koyejo, Y Deng, et al.

    NeurIPS · 2025

  4. Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes

    Z Robertson, S Koyejo

    arXiv preprint · 2025

  5. Toward an evaluation science for generative AI systems

    L Weidinger, ID Raji, H Wallach, M Mitchell, A Wang, O Salaudeen, et al.

    arXiv preprint · 2025

Evaluation in the real world

Fairness, safety, and capability evaluated where AI systems actually operate — across domains, populations, and high-stakes settings.

  1. SWE-chat: Coding agent interactions from real users in the wild

    J Baumann, V Padmakumar, X Li, J Yang, D Yang, S Koyejo

    arXiv preprint · 2026

  2. In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

    Z Tang, ST Truong, D Owens, S Sharma, YJ Zhang, B Miranda, S Koyejo

    arXiv preprint · 2026

  3. Holistic evaluation of large language models for medical tasks with MedHELM

    S Bedi, H Cui, M Fuentes, A Unell, M Wornow, JM Banda, N Kotecha, et al.

    Nature Medicine · 2026

  4. Why Do Safety Guardrails Degrade Across Languages?

    M Zhang, A Patel, ST Truong, S Koyejo

    arXiv preprint · 2026

  5. Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness

    S Pfohl, N Harris, C Nagpal, D Madras, V Mhasawade, O Salaudeen, et al.

    NeurIPS · 2025

  6. Autoredteamer: Autonomous red teaming with lifelong attack integration

    A Zhou, K Wu, F Pinto, Z Chen, Y Zeng, Y Yang, S Yang, S Koyejo, J Zou, et al.

    NeurIPS · 2025

  7. The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology

    F Patel, DQ Nguyen, ST Truong, J Vaynshtok, S Koyejo, N Haber

    EMNLP · 2025

  8. Development and evaluation of deep learning models for cardiotocography interpretation

    N Chiou, N Young-Lin, C Kelly, J Cattiau, T Tiyasirichokchai, A Diack, et al.

    npj Women's Health · 2025

  9. Crossing linguistic horizons: Finetuning and comprehensive evaluation of Vietnamese large language models

    S Truong, D Nguyen, TQ Nguyen, D Le, N Truong, T Quan, S Koyejo

    NAACL · 2024