Matrix Completion
Fill in a sparse evaluation matrix from partial observations so you can measure models without testing every item.
Competition
Can you predict how an AI system will respond to a question it has never seen?
The challenge
Most benchmarks report aggregate scores and move on. This competition reframes evaluation as a prediction problem: given partial observations of how models perform, fill in the missing entries and produce reliable ability scores, even when observations are sparse and item coverage is uneven.
Observe
Sparse responses across models and evaluation items
Predict
Fill in missing entries with your model
Score
Evaluate predictions on hidden items via log-loss
Research areas
Core research areas that inform solution design and evaluation methodology.
Fill in a sparse evaluation matrix from partial observations so you can measure models without testing every item.
Apply psychometric models like Rasch and 2PL to estimate latent ability and item difficulty from observed response patterns.
Use factor analysis and latent trait models to recover the hidden abilities and item properties that drive evaluation outcomes.
Strategically select which items to label using the optional labeling interface, maximizing information per evaluation.
Competition announcements, new baselines, and results. One email when it matters.