Trustworthy ability estimation under contamination
Produce scalar ability scores for language models that remain stable even when benchmark data is contaminated. Training data includes mixtures of genuine responses and artificially inflated results. Your scoring method must detect and account for contamination to recover trustworthy rankings of model capabilities.
Given evaluation data that contains a mixture of genuine responses and artificially inflated results (from benchmark contamination), produce a single scalar ability score for each model. These scores should reflect true model capabilities and be robust to the contamination present in the data.
Benchmark contamination occurs when a model has been exposed to test data during training, leading to artificially inflated performance on those specific items. The training data for this track explicitly includes such contaminated responses. Your method must identify which responses are likely genuine and which are inflated, producing scores that would match the models' true abilities on clean data.
Rank correlation between predicted scores and ground-truth ability ordering.
Rank correlation coefficient for evaluating score rankings.
Reliability of scores across different subsets of evaluation items.
Multiple strategies exist for handling contaminated evaluation data. Here are some starting points.
Model each response as coming from either a genuine or contaminated distribution. Use EM or variational inference to identify and down-weight contaminated responses.
Leverage release dates and benchmark publication timelines. Models released after a benchmark was published are more likely to have contaminated results on that benchmark.
Identify items where specific models perform suspiciously well compared to expectations from their overall ability profile. Flag and adjust for these anomalies.
Extend classical IRT models to explicitly model contamination as a latent variable, jointly estimating ability, item parameters, and contamination indicators.
Submit a CSV file containing a scalar ability score for each model in the evaluation set.
model_id,ability_score
model_001,2.453
model_002,1.872
model_003,3.105
model_004,0.614
...
Note: Scores can be on any scale. Only the ranking matters for the primary metric (Kendall's tau). However, well-calibrated scores on a meaningful scale may receive additional recognition.