Track 1: Response Prediction

Matrix completion for AI evaluation

Given a partially observed binary response matrix (models x items), predict the held-out entries. Each entry indicates whether a model answered an evaluation item correctly (1) or incorrectly (0). This is a structured matrix completion problem with binary outcomes, where rich metadata about both models and items is available to inform predictions.

Task Details

Objective

You are given a sparse binary matrix R of shape (M models x N items), where observed entries are 0 or 1 indicating incorrect or correct responses. Your goal is to predict the probability of correctness for each held-out (unobserved) entry.

Available Data

Response matrix: Partially observed binary matrix (models x items)
Model metadata: Model size (parameters), release date, organization, model family
Item metadata: Item text embeddings, benchmark category, difficulty estimates

Evaluation Metrics

AUC-ROC

Primary Metric

Area under the receiver operating characteristic curve, measuring discriminative ability of predicted probabilities.

Log-Loss

Secondary Metric

Binary cross-entropy loss, measuring calibration quality of predicted probabilities.

Suggested Approaches

Participants are free to use any method. Here are some starting points from the literature.

IRT Models

Classic psychometric models: Rasch (1PL), 2PL, 3PL, and their modern variants like Beta-IRT and Amortized IRT.

Rasch 2PL / 3PL Beta-IRT

Matrix Factorization

Low-rank decompositions and collaborative filtering approaches adapted for binary outcomes with side information.

SVD NMF Side Information

Neural / Hybrid

Neural latent variable models, deep matrix factorization, and ensemble methods combining multiple approaches.

VAE Neural CF Ensemble

Submission Format

How to Submit

Submit a CSV file containing your predicted probabilities for each held-out entry in the response matrix.

CSV Format

model_id,item_id,predicted_probability
model_001,item_042,0.8734
model_001,item_105,0.2156
model_003,item_042,0.9512
...

Submission Options: Default track: CSV upload (prediction only). Advanced track: Docker container (code submission). LLM featurizer: Upload model to HuggingFace + config with pinned commit hash.

Submit on Codabench