Track 2: Robust Scoring

Trustworthy ability estimation under contamination

Produce scalar ability scores for language models that remain stable even when benchmark data is contaminated. Training data includes mixtures of genuine responses and artificially inflated results. Your scoring method must detect and account for contamination to recover trustworthy rankings of model capabilities.

Task Details

Objective

Given evaluation data that contains a mixture of genuine responses and artificially inflated results (from benchmark contamination), produce a single scalar ability score for each model. These scores should reflect true model capabilities and be robust to the contamination present in the data.

The Contamination Problem

Benchmark contamination occurs when a model has been exposed to test data during training, leading to artificially inflated performance on those specific items. The training data for this track explicitly includes such contaminated responses. Your method must identify which responses are likely genuine and which are inflated, producing scores that would match the models' true abilities on clean data.

Available Data

Response data: Binary response matrix with mixtures of genuine and contaminated entries
Model metadata: Model size, release date, organization, training details
Item metadata: Benchmark category, estimated difficulty, text embeddings
Temporal signals: When models were released relative to benchmark publication dates

Evaluation Metrics

Kendall's Tau

Primary Metric

Rank correlation between predicted scores and ground-truth ability ordering.

Spearman's Rho

Secondary Metric

Rank correlation coefficient for evaluating score rankings.

Test-Retest

Secondary Metric

Reliability of scores across different subsets of evaluation items.

Suggested Approaches

Multiple strategies exist for handling contaminated evaluation data. Here are some starting points.

Mixture Models

Model each response as coming from either a genuine or contaminated distribution. Use EM or variational inference to identify and down-weight contaminated responses.

EM Algorithm Variational Inference Beta-Binomial

Temporal Priors

Leverage release dates and benchmark publication timelines. Models released after a benchmark was published are more likely to have contaminated results on that benchmark.

Bayesian Priors Time Series Changepoint Detection

Item-Level Anomaly Detection

Identify items where specific models perform suspiciously well compared to expectations from their overall ability profile. Flag and adjust for these anomalies.

Residual Analysis Outlier Detection Person-Fit Statistics

Contamination-Aware IRT

Extend classical IRT models to explicitly model contamination as a latent variable, jointly estimating ability, item parameters, and contamination indicators.

Modified IRT Latent Classes Robust Estimation

Submission Format

How to Submit

Submit a CSV file containing a scalar ability score for each model in the evaluation set.

CSV Format

model_id,ability_score
model_001,2.453
model_002,1.872
model_003,3.105
model_004,0.614
...

Note: Scores can be on any scale. Only the ranking matters for the primary metric (Kendall's tau). However, well-calibrated scores on a meaningful scale may receive additional recognition.

Submit on Codabench