Matrix completion for AI evaluation
Given a partially observed binary response matrix (models x items), predict the held-out entries. Each entry indicates whether a model answered an evaluation item correctly (1) or incorrectly (0). This is a structured matrix completion problem with binary outcomes, where rich metadata about both models and items is available to inform predictions.
You are given a sparse binary matrix R of shape (M models x N items), where observed entries are 0 or 1 indicating incorrect or correct responses. Your goal is to predict the probability of correctness for each held-out (unobserved) entry.
Area under the receiver operating characteristic curve, measuring discriminative ability of predicted probabilities.
Binary cross-entropy loss, measuring calibration quality of predicted probabilities.
Participants are free to use any method. Here are some starting points from the literature.
Classic psychometric models: Rasch (1PL), 2PL, 3PL, and their modern variants like Beta-IRT and Amortized IRT.
Low-rank decompositions and collaborative filtering approaches adapted for binary outcomes with side information.
Neural latent variable models, deep matrix factorization, and ensemble methods combining multiple approaches.
Submit a CSV file containing your predicted probabilities for each held-out entry in the response matrix.
model_id,item_id,predicted_probability
model_001,item_042,0.8734
model_001,item_105,0.2156
model_003,item_042,0.9512
...
Submission Options: Default track: CSV upload (prediction only). Advanced track: Docker container (code submission). LLM featurizer: Upload model to HuggingFace + config with pinned commit hash.