Skip to main content

Competition

The Predictive AI Evaluation Challenge

Can you predict how an AI system will respond to a question it has never seen?

The challenge

Evaluation as a prediction problem.

Most benchmarks report aggregate scores and move on. This competition reframes evaluation as a prediction problem: given partial observations of how models perform, fill in the missing entries and produce reliable ability scores, even when observations are sparse and item coverage is uneven.

Observe

Sparse responses across models and evaluation items

Predict

Fill in missing entries with your model

Score

Evaluate predictions on hidden items via log-loss

The task

Response Prediction

You receive a sparse response matrix of AI models evaluated across multiple benchmarks. Most entries are missing. Predict how each model responds to hidden test items sampled per submission.

Matrix completionCollaborative filteringItem response theoryAdaptive labeling

The data

  • Multiple benchmarks spanning diverse AI evaluation domains
  • Hundreds of models with tens of thousands of visible training items
  • Over a million observed binary responses
  • A sample of hidden items drawn per submission for evaluation

Scoring

Log-loss

Primary

Negative log-likelihood of predicted probabilities on held-out binary responses.

AUC-ROC

Secondary

Area under the ROC curve for predicted response probabilities.

Timeline

Key dates.

01

Competition Opens

April 2026

02

Submission Deadline

August 2026

03

Winners Announced

September 2026

04

Workshop Presentation

October 2026

How to enter

Build, package, submit.

Submissions are code-based: you write a small Python interface, bundle it into a ZIP, and upload it to the competition platform.

Prerequisites

  • Python 3.13
  • torch_measure (latest)
  • A Codabench account

Install

pip install torch_measure

Baselines provided

  • Rasch (1PL) IRT model
  • 2PL IRT model

The submission interface

A ZIP file containing model.py with train() and predict() functions. predict() returns a list of dictionaries, each with model_id, item_id, and predicted_probability fields. Optionally include labeling.py with score_pairs() for adaptive labeling.

# model.py - required interface
def train(train_dir: str) -> None:
    """Train your model on the visible response matrix."""
    ...

def predict(train_dir: str, test_dir: str,
            labeled_pairs: list[dict]) -> list[dict]:
    """Return predictions for held-out (model_id, item_id) pairs."""
    return [
        {"model_id": m, "item_id": i, "predicted_probability": p}
        for m, i, p in predictions
    ]

# Optional: labeling.py - adaptive labeling
def score_pairs(test_pairs: list[dict],
                train_dir: str = "",
                test_dir: str = "") -> list[float]:
    """Score candidate pairs by expected information gain."""
    ...

Rules

What to know before you submit.

  1. 01Submissions are code-based (ZIP with model.py).
  2. 02Daily and total submission limits apply per team.
  3. 03Only approved API providers are reachable at runtime.
  4. 04HuggingFace models must be declared in models.txt and are pre-downloaded.
  5. 05Top-ranked participants must share code and a method description to remain eligible for prizes.
  6. 06One account per person. Duplicates will be disqualified.
  7. 07Pre-trained models are allowed but must be disclosed.

Recognition

Prizes.

Cash Prizes

Awards for top submissions on the final leaderboard.

Co-Authorship

Participants who submit a technical report are invited to co-author a competition summary paper.

Workshop Presentation

Top participants are invited to present their approaches at the AI Measurement Science Workshop, co-located with COLM 2026.

FAQ

Questions, answered.

General

Who can participate?

Anyone, regardless of affiliation. Students, researchers, and industry practitioners are all welcome.

Is there a registration fee?

No. Participation is free. You only need a Codabench account to submit.

What is the submission format?

Submissions are code-based. You upload a ZIP file containing model.py with train() and predict() functions, plus optional files like labeling.py, models.txt, or bundled artifacts. Your code runs in a sandboxed container, not a CSV upload.

Technical

What programming languages are supported?

Python only. Your code runs in a sandboxed container with pre-installed dependencies.

Can I use pre-trained models?

Yes, but disclose them in your method description. The focus is on measurement methodology, not training large models from scratch.

What compute resources are available?

Participants use their own compute for development. Codabench routes submissions to appropriate GPU tiers based on the HuggingFace models declared in models.txt.

Can my code access the internet?

Only approved API providers are reachable at runtime. HuggingFace is blocked at runtime; models must be declared in models.txt and are pre-downloaded before execution.

What is adaptive labeling?

An optional feature via score_pairs() in labeling.py. Your code can score candidate (model, item) pairs by expected information gain, letting the platform prioritize which pairs to evaluate next.

Submissions

How are submissions scored?

Log-loss is the primary metric used for leaderboard ranking. AUC-ROC is reported as a secondary metric.

What happens if my submission fails?

Codabench will show an error message with details. Failed submissions don't count toward your daily limit. Use the smoke-test bundle to validate your submission locally before uploading.

How does the leaderboard work?

There is a single, continuously scored phase. Each submission is evaluated on a fresh random sample of hidden items, so repeated submissions cannot reverse-engineer the test set. Top-scoring submissions undergo manual review after the deadline.

Ready to compete?

Open the competition to register and submit, or get an email when results and new baselines land.

No spam. Unsubscribe anytime.

Questions? aims-competition@stanford.edu