146
Benchmarks curated
Software · Data
A curated data bank of AI evaluation results, standardized for measurement.
146
Benchmarks curated
1M+
Unique evaluation items
248M+
Model–item responses
measurement-db standardizes evaluation results from 146 benchmarks into (model × item) response matrices — the raw observations that item response theory and other psychometric models are estimated from. It is the data layer beneath torch_measure.
Coverage spans reasoning, coding, agentic, multimodal, safety, and human-preference benchmarks. Matrices come in binary (correct / incorrect) and continuous (scores, rewards) form. Each benchmark ships a reproducible build script, and the bank is moving to a long-form, registry-backed schema so every response carries its model, item, benchmark, and test condition.