Skip to main content

Software & Data

Software & Data

Open-source tools and curated data for rigorous AI measurement — a PyTorch library, a standardized evaluation data bank, and interactive apps for probing benchmarks and the evaluation ecosystem.

The measurement stack

Tools and data for careful measurement.

From estimation to inspection — each piece stands on its own and reinforces the others.

Library

torch_measure

A PyTorch library for measurement science. Includes IRT models (Rasch, 2PL, 3PL), computerized adaptive testing, psychometric metrics, and GPU-accelerated estimation.

Data

measurement-db

A curated data bank of AI evaluation results, standardized for measurement.

Validity Analyzer

Benchmark Caliper

An interactive validity analyzer for benchmarks — inspect item behavior, reliability, and what a score actually measures, right in the browser.

Ecosystem Explorer

AI Evarium

An interactive explorer for the AI evaluation ecosystem — browse benchmarks, models, and results across the evaluation landscape.