Log-loss
PrimaryNegative log-likelihood of predicted probabilities on held-out binary responses.
Competition
Can you predict how an AI system will respond to a question it has never seen?
The challenge
Most benchmarks report aggregate scores and move on. This competition reframes evaluation as a prediction problem: given partial observations of how models perform, fill in the missing entries and produce reliable ability scores, even when observations are sparse and item coverage is uneven.
Observe
Sparse responses across models and evaluation items
Predict
Fill in missing entries with your model
Score
Evaluate predictions on hidden items via log-loss
The task
You receive a sparse response matrix of AI models evaluated across multiple benchmarks. Most entries are missing. Predict how each model responds to hidden test items sampled per submission.
Negative log-likelihood of predicted probabilities on held-out binary responses.
Area under the ROC curve for predicted response probabilities.
Timeline
01
Competition Opens
April 2026
02
Submission Deadline
August 2026
03
Winners Announced
September 2026
04
Workshop Presentation
October 2026
How to enter
Submissions are code-based: you write a small Python interface, bundle it into a ZIP, and upload it to the competition platform.
pip install torch_measureA ZIP file containing model.py with train() and predict() functions. predict() returns a list of dictionaries, each with model_id, item_id, and predicted_probability fields. Optionally include labeling.py with score_pairs() for adaptive labeling.
# model.py - required interface
def train(train_dir: str) -> None:
"""Train your model on the visible response matrix."""
...
def predict(train_dir: str, test_dir: str,
labeled_pairs: list[dict]) -> list[dict]:
"""Return predictions for held-out (model_id, item_id) pairs."""
return [
{"model_id": m, "item_id": i, "predicted_probability": p}
for m, i, p in predictions
]
# Optional: labeling.py - adaptive labeling
def score_pairs(test_pairs: list[dict],
train_dir: str = "",
test_dir: str = "") -> list[float]:
"""Score candidate pairs by expected information gain."""
...Rules
Recognition
Awards for top submissions on the final leaderboard.
Participants who submit a technical report are invited to co-author a competition summary paper.
Top participants are invited to present their approaches at the AI Measurement Science Workshop, co-located with COLM 2026.
FAQ
Anyone, regardless of affiliation. Students, researchers, and industry practitioners are all welcome.
No. Participation is free. You only need a Codabench account to submit.
Submissions are code-based. You upload a ZIP file containing model.py with train() and predict() functions, plus optional files like labeling.py, models.txt, or bundled artifacts. Your code runs in a sandboxed container, not a CSV upload.
Python only. Your code runs in a sandboxed container with pre-installed dependencies.
Yes, but disclose them in your method description. The focus is on measurement methodology, not training large models from scratch.
Participants use their own compute for development. Codabench routes submissions to appropriate GPU tiers based on the HuggingFace models declared in models.txt.
Only approved API providers are reachable at runtime. HuggingFace is blocked at runtime; models must be declared in models.txt and are pre-downloaded before execution.
An optional feature via score_pairs() in labeling.py. Your code can score candidate (model, item) pairs by expected information gain, letting the platform prioritize which pairs to evaluate next.
Log-loss is the primary metric used for leaderboard ranking. AUC-ROC is reported as a secondary metric.
Codabench will show an error message with details. Failed submissions don't count toward your daily limit. Use the smoke-test bundle to validate your submission locally before uploading.
There is a single, continuously scored phase. Each submission is evaluated on a fresh random sample of hidden items, so repeated submissions cannot reverse-engineer the test set. Top-scoring submissions undergo manual review after the deadline.
Open the competition to register and submit, or get an email when results and new baselines land.
Questions? aims-competition@stanford.edu