Skip to main content

Software · Data

measurement-db

A curated data bank of AI evaluation results, standardized for measurement.

146

Benchmarks curated

1M+

Unique evaluation items

248M+

Model–item responses

measurement-db standardizes evaluation results from 146 benchmarks into (model × item) response matrices — the raw observations that item response theory and other psychometric models are estimated from. It is the data layer beneath torch_measure.

Coverage spans reasoning, coding, agentic, multimodal, safety, and human-preference benchmarks. Matrices come in binary (correct / incorrect) and continuous (scores, rewards) form. Each benchmark ships a reproducible build script, and the bank is moving to a long-form, registry-backed schema so every response carries its model, item, benchmark, and test condition.