Getting Started - Predictive AI Evaluation Challenge

Step 1: Install torch_measure

torch_measure is an open-source PyTorch toolkit for measurement models (Item Response Theory, factor analysis, and more) applied to AI evaluation. It provides datasets, baselines, and utilities for both competition tracks.

# Install from PyPI
pip install torch_measure

# Or install from source for the latest features
git clone https://github.com/YOUR_REPO/torch_measure.git
cd torch_measure
pip install -e .

Requirements: Python 3.8+, PyTorch 1.12+. All tasks are solvable on CPU. A single GPU will speed up training but is not required.

Step 2: Load the Data

Datasets are available through torch_measure.datasets and also hosted on HuggingFace for direct download.

from torch_measure.datasets import load_competition_data

# Load Track 1 data (response prediction)
track1_data = load_competition_data("track1")
print(f"Response matrix shape: {track1_data.response_matrix.shape}")
print(f"Observed entries: {track1_data.observed_mask.sum()}")
print(f"Held-out entries: {track1_data.held_out_mask.sum()}")

# Load Track 2 data (robust scoring)
track2_data = load_competition_data("track2")
print(f"Models: {track2_data.n_models}")
print(f"Items: {track2_data.n_items}")
print(f"Contains contaminated data: True")

Alternative: Datasets are also available on HuggingFace. See the torch_measure documentation for direct download links.

Step 3: Run a Baseline

Several baseline models are included to help you get started and benchmark your approach.

Available Baselines

Model	Description	Track
Rasch (1PL)	One-parameter logistic IRT model with ability and difficulty parameters	1 & 2
2PL	Two-parameter logistic model adding item discrimination	1 & 2
3PL	Three-parameter logistic model adding guessing parameter	1 & 2
Beta-IRT	IRT with beta-distributed ability scores for bounded estimation	1 & 2
Amortized IRT	Neural amortized inference for scalable IRT parameter estimation	1 & 2

Example: Running the Rasch Baseline

from torch_measure.models import RaschModel
from torch_measure.datasets import load_competition_data
from torch_measure.evaluation import evaluate_track1

# Load data
data = load_competition_data("track1")

# Fit model
model = RaschModel(n_models=data.n_models, n_items=data.n_items)
model.fit(data.response_matrix, data.observed_mask, epochs=100)

# Predict held-out entries
predictions = model.predict(data.held_out_indices)

# Evaluate
results = evaluate_track1(predictions, data.held_out_labels)
print(f"AUC-ROC: {results['auc_roc']:.4f}")
print(f"Log-Loss: {results['log_loss']:.4f}")

Example: Running a Track 2 Baseline

from torch_measure.models import RaschModel
from torch_measure.datasets import load_competition_data
from torch_measure.evaluation import evaluate_track2

# Load data (includes contaminated responses)
data = load_competition_data("track2")

# Fit model (basic -- does not handle contamination)
model = RaschModel(n_models=data.n_models, n_items=data.n_items)
model.fit(data.response_matrix, data.observed_mask, epochs=100)

# Extract ability scores
ability_scores = model.get_ability_scores()

# Evaluate against ground truth ranking
results = evaluate_track2(ability_scores, data.ground_truth_ranking)
print(f"Kendall's tau: {results['kendall_tau']:.4f}")
print(f"Spearman's rho: {results['spearman_rho']:.4f}")

Step 4: Format & Submit

Prepare your predictions in CSV format and submit via Codabench.

Track 1: Response Prediction

# Generate submission CSV
import pandas as pd

submission = pd.DataFrame({
    'model_id': held_out_model_ids,
    'item_id': held_out_item_ids,
    'predicted_probability': predictions
})
submission.to_csv(
    'track1_submission.csv',
    index=False
)

Track 2: Robust Scoring

# Generate submission CSV
import pandas as pd

submission = pd.DataFrame({
    'model_id': model_ids,
    'ability_score': ability_scores
})
submission.to_csv(
    'track2_submission.csv',
    index=False
)

Submit on Codabench

Submission limits: 2 submissions per day during the competition period, 2 per week post-competition. Make sure your submission CSV is properly formatted before uploading.

Submission Tracks

Default: CSV Upload

Upload your predicted probabilities (Track 1) or ability scores (Track 2) as a CSV file. The simplest way to participate.

Advanced: Code Submission

Submit a Docker container with your full pipeline. Two-stage air-gapped evaluation ensures reproducibility and fairness.

LLM Featurizer

Upload your model to HuggingFace and submit a config with a pinned commit hash. No trust_remote_code; safetensors only.

Tips for Success

Start simple: Run the baselines first to understand the data format and establish benchmarks.
Leverage metadata: Model and item metadata (size, dates, categories, embeddings) can substantially improve predictions beyond pure matrix completion.
Cross-validate carefully: The held-out pattern may differ from random missing entries. Consider structured splits.
For Track 2: Simple IRT without contamination handling gives a lower bound. The gap between naive and contamination-aware methods is where points are earned.
Ensembles work: Combining multiple approaches (IRT + matrix factorization + neural models) often outperforms any single method.
CPU is fine: All baselines and most competitive solutions can train on CPU in minutes to hours.