Competition

The Predictive AI Evaluation Challenge

Can you predict how an AI system will respond to a question it has never seen?

Open competition Get updates

Evaluation Items

Q10

Q11

Q12

Q13

Q14

Q15

Models

Dune

Alder

Ivy

Heath

Fern

Birch

Jade

Cedar

Elm

Grove

StrongModerateLowWeak

Evaluation Items

Models

Dune

Alder

Ivy

Heath

Fern

Birch

StrongModerateLowWeak

The challenge

Evaluation as a prediction problem.

Most benchmarks report aggregate scores and move on. This competition reframes evaluation as a prediction problem: given partial observations of how models perform, fill in the missing entries and produce reliable ability scores, even when observations are sparse and item coverage is uneven.

Observe

Sparse responses across models and evaluation items

Predict

Fill in missing entries with your model

Score

Evaluate predictions on hidden items via log-loss

Research areas

The science behind the challenge.

Core research areas that inform solution design and evaluation methodology.

Matrix Completion

Fill in a sparse evaluation matrix from partial observations so you can measure models without testing every item.

Item Response Theory

Apply psychometric models like Rasch and 2PL to estimate latent ability and item difficulty from observed response patterns.

Latent Variable Models

Use factor analysis and latent trait models to recover the hidden abilities and item properties that drive evaluation outcomes.

Adaptive Evaluation

Strategically select which items to label using the optional labeling interface, maximizing information per evaluation.

Follow the work.

Competition announcements, new baselines, and results. One email when it matters.