Predictive evaluation reframes AI evaluation as a prediction and forecasting problem rather than exhaustive testing. Instead of running every model on every benchmark (which is increasingly infeasible), we predict how models would perform on evaluations they haven't been tested on, using partial observations and metadata. This is analogous to how Item Response Theory in psychometrics can estimate a student's ability from a subset of test items.
torch_measure is an open-source PyTorch toolkit for measurement models applied to AI evaluation. It provides implementations of Item Response Theory (IRT) models (Rasch, 2PL, 3PL, Beta-IRT), factor analysis, and other latent variable models specifically designed for analyzing AI evaluation data. The toolkit includes dataset loaders, baseline implementations, and evaluation utilities for this competition.
The competition is open to everyone -- academic researchers, industry practitioners, and independent participants. Teams of up to 5 members are allowed. You may participate in one or both tracks.
No. All tasks in both tracks are solvable on CPU. The baseline IRT models (Rasch, 2PL, 3PL) typically train in minutes on a modern CPU. A GPU will speed up training for larger neural models (VAEs, amortized IRT, deep matrix factorization) but is not required for competitive submissions.
Yes. If you want to use an LLM to generate features (e.g., embedding evaluation items or model descriptions), you can upload your model to HuggingFace and submit a configuration file with a pinned commit hash. Important constraints:
trust_remote_code -- your model must use standard HuggingFace architecturesThe core data is a binary response matrix of shape (M models x N items), where each entry is 0 (incorrect) or 1 (correct). The matrix is partially observed -- your task is to predict the held-out entries (Track 1) or estimate robust ability scores from potentially contaminated data (Track 2). Additional metadata includes model attributes (size, organization, release date) and item attributes (text embeddings, benchmark category).
Benchmark contamination occurs when a model has been exposed to test data during training. This leads to artificially inflated scores on those specific items that do not reflect the model's true capabilities. In Track 2, the training data explicitly includes a mixture of genuine and contaminated responses. Your scoring method must detect and account for this contamination to produce trustworthy ability estimates.
Each track has a separate leaderboard with its own primary metric:
Both leaderboards use a public/private split. The public leaderboard (visible during the competition) uses a subset of the test data. Final rankings are determined by the private leaderboard, revealed after the competition ends.
During the competition, each team may submit up to 2 submissions per day. After the competition deadline, up to 2 submissions per week are allowed for the post-competition leaderboard. Plan your submissions carefully to make the most of your daily quota.
Three submission tracks are available:
See the Getting Started page for detailed formatting instructions.
Only winning solutions are required to be open-sourced (within 30 days of the competition ending). However, we strongly encourage all participants to share their code and methods to advance the field of predictive AI evaluation.