The Predictive AI Evaluation Challenge

"Can you predict AI behavior from partial observations?"

As the number of large language models and evaluation benchmarks grows rapidly, exhaustive evaluation has become infeasible. This challenge reframes evaluation as a prediction problem: given partial and potentially contaminated evaluation data, can we predict unobserved model-item outcomes and recover trustworthy ability scores?

Submit on Codabench Getting Started

About the Challenge

Why Predictive Evaluation?

The AI community faces an evaluation bottleneck: hundreds of models, thousands of benchmarks, and limited resources to run every combination. Predictive evaluation treats this as a data science problem -- can we infer the full picture from partial observations?

This challenge brings together ideas from psychometrics (Item Response Theory), recommender systems (matrix factorization), and robust statistics (contamination-aware estimation) to build better tools for understanding AI capabilities.

Key Research Areas

Matrix Completion: Predicting unobserved model-item interactions from sparse binary data with rich metadata.
Robust Measurement: Recovering trustworthy ability scores despite benchmark contamination and data inflation.
Latent Variable Models: Applying IRT, factor analysis, and neural measurement models to AI evaluation at scale.
Contamination Detection: Identifying and correcting for artificially inflated benchmark results.

Ready to Get Started?

Install the toolkit, explore the baselines, and submit your first predictions.

pip install torch_measure

Getting Started Guide Submit on Codabench

Timeline

TBD

Competition Opens

Registration opens and datasets become available. Baselines and starter code are released.

TBD

Submission Deadline

Final submissions due for both tracks. Public leaderboard freezes.

TBD

Winners Announced

Private leaderboard revealed. Winning teams notified and results published.

Prizes & Recognition

Leaderboard Prizes

Performance-based prizes for top-ranking teams in each track, based on primary evaluation metrics (AUC-ROC for Track 1, Kendall's tau for Track 2).

Innovation Awards

Special recognition for novel methodological contributions, creative approaches, and insightful analysis of the evaluation prediction problem.

Publication Opportunity

Winning teams will be invited to contribute to the official competition report and present their methods.

Cite This Work

If you use this competition in your research, please cite our work:

@inproceedings{truong2026predictiveeval,
  title        = {The Predictive AI Evaluation Challenge},
  author       = {Truong, Sang and Koyejo, Sanmi},
  booktitle    = {Competition Track},
  year         = {2026},
}

The Predictive AI Evaluation Challenge

Submission Deadline

Competition Tracks

Track 1: Response Prediction

Track 2: Robust Scoring

About the Challenge

Why Predictive Evaluation?

Key Research Areas

Ready to Get Started?

Timeline

Competition Opens

Submission Deadline

Winners Announced

Prizes & Recognition

Leaderboard Prizes

Innovation Awards

Publication Opportunity

Sponsors

Become a Sponsor

Organizing Team

Sang Truong

Sanmi Koyejo

Cite This Work