From installation to your first submission in minutes
torch_measure is an open-source PyTorch toolkit for measurement models
(Item Response Theory, factor analysis, and more) applied to AI evaluation.
It provides datasets, baselines, and utilities for both competition tracks.
# Install from PyPI
pip install torch_measure
# Or install from source for the latest features
git clone https://github.com/YOUR_REPO/torch_measure.git
cd torch_measure
pip install -e .
Requirements: Python 3.8+, PyTorch 1.12+. All tasks are solvable on CPU. A single GPU will speed up training but is not required.
Datasets are available through torch_measure.datasets and also hosted on HuggingFace
for direct download.
from torch_measure.datasets import load_competition_data
# Load Track 1 data (response prediction)
track1_data = load_competition_data("track1")
print(f"Response matrix shape: {track1_data.response_matrix.shape}")
print(f"Observed entries: {track1_data.observed_mask.sum()}")
print(f"Held-out entries: {track1_data.held_out_mask.sum()}")
# Load Track 2 data (robust scoring)
track2_data = load_competition_data("track2")
print(f"Models: {track2_data.n_models}")
print(f"Items: {track2_data.n_items}")
print(f"Contains contaminated data: True")
Alternative:
Datasets are also available on HuggingFace. See the
torch_measure documentation for direct download links.
Several baseline models are included to help you get started and benchmark your approach.
| Model | Description | Track |
|---|---|---|
| Rasch (1PL) | One-parameter logistic IRT model with ability and difficulty parameters | 1 & 2 |
| 2PL | Two-parameter logistic model adding item discrimination | 1 & 2 |
| 3PL | Three-parameter logistic model adding guessing parameter | 1 & 2 |
| Beta-IRT | IRT with beta-distributed ability scores for bounded estimation | 1 & 2 |
| Amortized IRT | Neural amortized inference for scalable IRT parameter estimation | 1 & 2 |
from torch_measure.models import RaschModel
from torch_measure.datasets import load_competition_data
from torch_measure.evaluation import evaluate_track1
# Load data
data = load_competition_data("track1")
# Fit model
model = RaschModel(n_models=data.n_models, n_items=data.n_items)
model.fit(data.response_matrix, data.observed_mask, epochs=100)
# Predict held-out entries
predictions = model.predict(data.held_out_indices)
# Evaluate
results = evaluate_track1(predictions, data.held_out_labels)
print(f"AUC-ROC: {results['auc_roc']:.4f}")
print(f"Log-Loss: {results['log_loss']:.4f}")
from torch_measure.models import RaschModel
from torch_measure.datasets import load_competition_data
from torch_measure.evaluation import evaluate_track2
# Load data (includes contaminated responses)
data = load_competition_data("track2")
# Fit model (basic -- does not handle contamination)
model = RaschModel(n_models=data.n_models, n_items=data.n_items)
model.fit(data.response_matrix, data.observed_mask, epochs=100)
# Extract ability scores
ability_scores = model.get_ability_scores()
# Evaluate against ground truth ranking
results = evaluate_track2(ability_scores, data.ground_truth_ranking)
print(f"Kendall's tau: {results['kendall_tau']:.4f}")
print(f"Spearman's rho: {results['spearman_rho']:.4f}")
Prepare your predictions in CSV format and submit via Codabench.
# Generate submission CSV
import pandas as pd
submission = pd.DataFrame({
'model_id': held_out_model_ids,
'item_id': held_out_item_ids,
'predicted_probability': predictions
})
submission.to_csv(
'track1_submission.csv',
index=False
)
# Generate submission CSV
import pandas as pd
submission = pd.DataFrame({
'model_id': model_ids,
'ability_score': ability_scores
})
submission.to_csv(
'track2_submission.csv',
index=False
)
Submission limits: 2 submissions per day during the competition period, 2 per week post-competition. Make sure your submission CSV is properly formatted before uploading.
Upload your predicted probabilities (Track 1) or ability scores (Track 2) as a CSV file. The simplest way to participate.
Submit a Docker container with your full pipeline. Two-stage air-gapped evaluation ensures reproducibility and fairness.
Upload your model to HuggingFace and submit a config with a pinned commit hash. No trust_remote_code; safetensors only.