FAQ - Predictive AI Evaluation Challenge

What is predictive evaluation?

Predictive evaluation reframes AI evaluation as a prediction and forecasting problem rather than exhaustive testing. Instead of running every model on every benchmark (which is increasingly infeasible), we predict how models would perform on evaluations they haven't been tested on, using partial observations and metadata. This is analogous to how Item Response Theory in psychometrics can estimate a student's ability from a subset of test items.

What is torch_measure?

torch_measure is an open-source PyTorch toolkit for measurement models applied to AI evaluation. It provides implementations of Item Response Theory (IRT) models (Rasch, 2PL, 3PL, Beta-IRT), factor analysis, and other latent variable models specifically designed for analyzing AI evaluation data. The toolkit includes dataset loaders, baseline implementations, and evaluation utilities for this competition.

Who can participate?

The competition is open to everyone -- academic researchers, industry practitioners, and independent participants. Teams of up to 5 members are allowed. You may participate in one or both tracks.

Do I need a GPU?

No. All tasks in both tracks are solvable on CPU. The baseline IRT models (Rasch, 2PL, 3PL) typically train in minutes on a modern CPU. A GPU will speed up training for larger neural models (VAEs, amortized IRT, deep matrix factorization) but is not required for competitive submissions.

Can I use an LLM as a featurizer?

Yes. If you want to use an LLM to generate features (e.g., embedding evaluation items or model descriptions), you can upload your model to HuggingFace and submit a configuration file with a pinned commit hash. Important constraints:

No trust_remote_code -- your model must use standard HuggingFace architectures
Safetensors format only (no pickle-based model files)
Evaluation runs air-gapped -- your model will be downloaded before evaluation and run without network access

What format is the evaluation data in?

The core data is a binary response matrix of shape (M models x N items), where each entry is 0 (incorrect) or 1 (correct). The matrix is partially observed -- your task is to predict the held-out entries (Track 1) or estimate robust ability scores from potentially contaminated data (Track 2). Additional metadata includes model attributes (size, organization, release date) and item attributes (text embeddings, benchmark category).

What is benchmark contamination (Track 2)?

Benchmark contamination occurs when a model has been exposed to test data during training. This leads to artificially inflated scores on those specific items that do not reflect the model's true capabilities. In Track 2, the training data explicitly includes a mixture of genuine and contaminated responses. Your scoring method must detect and account for this contamination to produce trustworthy ability estimates.

How is the leaderboard scored?

Each track has a separate leaderboard with its own primary metric:

Track 1 (Response Prediction): AUC-ROC on held-out entries. Higher is better.
Track 2 (Robust Scoring): Kendall's tau between predicted scores and ground-truth ability ranking. Higher is better.

Both leaderboards use a public/private split. The public leaderboard (visible during the competition) uses a subset of the test data. Final rankings are determined by the private leaderboard, revealed after the competition ends.

How many submissions can I make?

During the competition, each team may submit up to 2 submissions per day. After the competition deadline, up to 2 submissions per week are allowed for the post-competition leaderboard. Plan your submissions carefully to make the most of your daily quota.

What are the submission options?

Three submission tracks are available:

Default (CSV upload): Upload your predictions as a CSV file. This is the simplest option.
Advanced (Code submission): Submit a Docker container with your full pipeline for reproducible evaluation.
LLM Featurizer: Upload a model to HuggingFace and submit a config file with a pinned commit hash.

See the Getting Started page for detailed formatting instructions.

Do I need to open-source my solution?

Only winning solutions are required to be open-sourced (within 30 days of the competition ending). However, we strongly encourage all participants to share their code and methods to advance the field of predictive AI evaluation.

Frequently Asked Questions

General

Technical

Submissions