"Can you predict AI behavior from partial observations?"
As the number of large language models and evaluation benchmarks grows rapidly, exhaustive evaluation has become infeasible. This challenge reframes evaluation as a prediction problem: given partial and potentially contaminated evaluation data, can we predict unobserved model-item outcomes and recover trustworthy ability scores?
Two complementary tracks addressing core challenges in predictive AI evaluation. Participate in one or both.
Predict held-out entries in a partially observed binary response matrix of models and items. A structured matrix completion problem with binary outcomes.
Produce scalar ability scores that remain stable under benchmark contamination. Training data includes mixtures of genuine and artificially inflated responses.
The AI community faces an evaluation bottleneck: hundreds of models, thousands of benchmarks, and limited resources to run every combination. Predictive evaluation treats this as a data science problem -- can we infer the full picture from partial observations?
This challenge brings together ideas from psychometrics (Item Response Theory), recommender systems (matrix factorization), and robust statistics (contamination-aware estimation) to build better tools for understanding AI capabilities.
Install the toolkit, explore the baselines, and submit your first predictions.
pip install torch_measure
Registration opens and datasets become available. Baselines and starter code are released.
Final submissions due for both tracks. Public leaderboard freezes.
Private leaderboard revealed. Winning teams notified and results published.
Performance-based prizes for top-ranking teams in each track, based on primary evaluation metrics (AUC-ROC for Track 1, Kendall's tau for Track 2).
Special recognition for novel methodological contributions, creative approaches, and insightful analysis of the evaluation prediction problem.
Winning teams will be invited to contribute to the official competition report and present their methods.
Sponsor logos coming soon
We are seeking sponsors to support this challenge. Sponsorship opportunities include prize pool contributions, compute resources, and more.
Stanford University
Stanford University
If you use this competition in your research, please cite our work:
@inproceedings{truong2026predictiveeval,
title = {The Predictive AI Evaluation Challenge},
author = {Truong, Sang and Koyejo, Sanmi},
booktitle = {Competition Track},
year = {2026},
}