The Predictive AI Evaluation Challenge

"Can you predict AI behavior from partial observations?"

As the number of large language models and evaluation benchmarks grows rapidly, exhaustive evaluation has become infeasible. This challenge reframes evaluation as a prediction problem: given partial and potentially contaminated evaluation data, can we predict unobserved model-item outcomes and recover trustworthy ability scores?

Submit on Codabench Getting Started

Submission Deadline

-- Days
-- Hours
-- Minutes
-- Seconds

Deadline: TBD

Competition Tracks

Two complementary tracks addressing core challenges in predictive AI evaluation. Participate in one or both.

About the Challenge

Why Predictive Evaluation?

The AI community faces an evaluation bottleneck: hundreds of models, thousands of benchmarks, and limited resources to run every combination. Predictive evaluation treats this as a data science problem -- can we infer the full picture from partial observations?

This challenge brings together ideas from psychometrics (Item Response Theory), recommender systems (matrix factorization), and robust statistics (contamination-aware estimation) to build better tools for understanding AI capabilities.

Key Research Areas

  • Matrix Completion: Predicting unobserved model-item interactions from sparse binary data with rich metadata.
  • Robust Measurement: Recovering trustworthy ability scores despite benchmark contamination and data inflation.
  • Latent Variable Models: Applying IRT, factor analysis, and neural measurement models to AI evaluation at scale.
  • Contamination Detection: Identifying and correcting for artificially inflated benchmark results.

Ready to Get Started?

Install the toolkit, explore the baselines, and submit your first predictions.

pip install torch_measure
Getting Started Guide Submit on Codabench

Timeline

TBD

Competition Opens

Registration opens and datasets become available. Baselines and starter code are released.

TBD

Submission Deadline

Final submissions due for both tracks. Public leaderboard freezes.

TBD

Winners Announced

Private leaderboard revealed. Winning teams notified and results published.

Prizes & Recognition

Leaderboard Prizes

Performance-based prizes for top-ranking teams in each track, based on primary evaluation metrics (AUC-ROC for Track 1, Kendall's tau for Track 2).

Innovation Awards

Special recognition for novel methodological contributions, creative approaches, and insightful analysis of the evaluation prediction problem.

Publication Opportunity

Winning teams will be invited to contribute to the official competition report and present their methods.

Sponsors

Become a Sponsor

We are seeking sponsors to support this challenge. Sponsorship opportunities include prize pool contributions, compute resources, and more.

Organizing Team

Sang Truong

Sang Truong

Stanford University

Sanmi Koyejo

Sanmi Koyejo

Stanford University

Cite This Work

If you use this competition in your research, please cite our work:

@inproceedings{truong2026predictiveeval,
  title        = {The Predictive AI Evaluation Challenge},
  author       = {Truong, Sang and Koyejo, Sanmi},
  booktitle    = {Competition Track},
  year         = {2026},
}