2 Foundations of Measurement

Intended Learning Outcomes

By the end of this chapter, you will be able to:

Distinguish between Item Response Theory, factor models, paired comparison systems (Elo, Bradley-Terry), and network models (GGM, Ising).
Explain why the Rasch model holds a special status as “the measurement model” through the sufficiency of sum scores, specific objectivity, and test-free measurement.
Derive the sufficiency theorem for the Rasch model and explain its implications for AI benchmark evaluation.
Compare the prescriptive (Rasch school) and descriptive (general IRT) approaches to measurement and articulate when each is appropriate.
Trace the historical development from Thurstone (1927) through Rasch (1960) to modern network psychometrics.
Connect classical measurement concepts (reliability, validity, dimensionality) to contemporary AI benchmark evaluation.
Apply measurement theory to analyze whether AI benchmarks satisfy the requirements for scientific measurement.
Implement basic IRT models in Python and visualize item characteristic curves.
Evaluate the assumptions underlying AI leaderboards and identify potential violations of measurement principles.

Scope of This Chapter

This chapter focuses on model specification: we introduce the probabilistic models used for measurement and explain what each one assumes. We do not cover how to estimate (learn) the parameters of these models—that is the subject of Chapter 3. Think of this chapter as defining the “what” (the models) and the next chapter as defining the “how” (the estimation).

Video Overview

A visual tour of the key concepts in this chapter — from response matrices and item characteristic curves to factor models and benchmark heterogeneity.

Notation

This chapter introduces the core notation used throughout the book: $\theta_i$ (model ability), $\beta_j$ (item difficulty), $a_j$ (discrimination), $c_j$ (guessing), $Y_{ij}$ (binary response), and $\sigma(\cdot)$ (logistic sigmoid). ?sec-notation collects every symbol in one place—refer to it whenever a symbol is unfamiliar. Readers coming from psychology or education may also find the translation table in Section 2.5 helpful; it maps classical psychometric terminology (test taker, item, reliability, validity) to AI evaluation counterparts.

2.1 The Measurement Problem in AI

Consider the following scenario: You have evaluated 100 language models on a benchmark consisting of 1,000 multiple-choice questions. Each model either answers each question correctly (1) or incorrectly (0), producing a $100 \times 1000$ binary response matrix $Y$. You compute each model’s accuracy—the proportion of correct answers—and rank the models accordingly.

Have you measured anything?

The answer is not as obvious as it might seem. You have certainly scored the models: you assigned numbers to them based on their performance. But measurement, in the scientific sense, requires more than assigning numbers. It requires that those numbers correspond to some underlying property—a latent construct—in a principled way.

2.1.1 Scoring vs. Measuring

The distinction between scoring and measuring is fundamental to understanding why AI evaluation needs measurement science. Consider an analogy from physics: if you measure the temperature of water with a mercury thermometer, the height of the mercury column is a score—a number you can read off the instrument. But you trust this score as a measurement of temperature because you understand the relationship between mercury expansion and thermal energy.

In AI evaluation, we often have scores without this deeper understanding. When GPT-4 achieves 86% accuracy on MMLU (Hendrycks et al. 2021) and Claude achieves 84%, we cannot immediately conclude that GPT-4 has more “intelligence” or “capability” than Claude. Several questions must be answered first:

What latent construct does MMLU measure? Is it general intelligence, factual knowledge, test-taking ability, or something else entirely?
Is the construct unidimensional? Can model performance be characterized by a single number, or do different questions tap into different capabilities?
Are the scores comparable across different test conditions? Would the ranking change if we used different questions from the same domain?
What is the measurement error? How much of the score difference reflects true differences in capability versus noise?

These questions have been central to psychology and education for over a century. The field of psychometrics developed sophisticated tools—Item Response Theory, factor analysis, validity frameworks—precisely to address them. AI evaluation is now confronting the same fundamental challenges.

2.1.2 The Response Matrix

The basic data structure in measurement is the response matrix $Y \in \{0, 1\}^{N \times M}$, where:

Each row $i \in \{1, \ldots, N\}$ represents a test taker (in AI: a model)
Each column $j \in \{1, \ldots, M\}$ represents an item (in AI: a benchmark question)
Each entry $Y_{ij} \in \{0, 1\}$ indicates whether test taker $i$ answered item $j$ correctly

\[ Y = \begin{pmatrix} Y_{11} & Y_{12} & \cdots & Y_{1M} \\ Y_{21} & Y_{22} & \cdots & Y_{2M} \\ \vdots & \vdots & \ddots & \vdots \\ Y_{N1} & Y_{N2} & \cdots & Y_{NM} \end{pmatrix} \]

The naive approach to evaluation computes row means (model accuracies) and ranks models accordingly. But the response matrix contains far more information than these marginal statistics. The pattern of responses—which models succeed on which questions—reveals structure that aggregate scores obscure.

When we sort the response matrix by row sums (model abilities) and column sums (item difficulties), a characteristic diagonal structure emerges. High-ability models answer most questions correctly; easy questions are answered correctly by most models. This structure is not guaranteed—it depends on the data satisfying certain assumptions—but when present, it suggests that a simple latent variable model may adequately describe the data.

2.1.3 Why AI Evaluation Needs Measurement Science

The problems facing AI evaluation today mirror those that psychology confronted in the early 20th century:

Construct definition: What does it mean to measure “reasoning” or “common sense”? Psychology developed validity frameworks to address this question.
Test bias: Are some benchmark questions unfair to certain models due to training data or architecture? Educational testing developed differential item functioning (DIF) analysis.
Score comparability: Can we compare models evaluated on different benchmark subsets? Psychometrics developed test equating methods.
Efficiency: How can we evaluate models with fewer questions? Computerized adaptive testing (CAT) emerged from IRT.
Reliability: How stable are our rankings under different conditions? Test-retest reliability and standard error of measurement quantify this.

The tools developed in psychometrics are not merely analogies—they are directly applicable to AI evaluation. The response matrix from an LLM benchmark has the same structure as the response matrix from a standardized test. The statistical models that describe human test performance can describe AI benchmark performance.

The Central Claim of AIMS

AI benchmarks are tests in the psychometric sense. The methods developed over a century of educational and psychological measurement—Item Response Theory, factor analysis, validity frameworks—apply directly to AI evaluation. Understanding and applying these methods is essential for trustworthy AI measurement.

2.1.4 Evaluation Datasets Used in This Book

Throughout this book, we work with several large-scale evaluation corpora that represent distinct yet complementary perspectives on measuring model behavior. These datasets provide the empirical foundation for our analyses.

HELM Benchmark Suite. We use 22 datasets drawn from 5 HELM (Liang et al. 2023) repositories—Classic, Lite, AIR-Bench, Thai Exam, and MMLU—encompassing both capability and safety measurements. In total, this collection includes 172 test takers (models) and 217,268 questions. We focus on responses that can be graded dichotomously (correct/incorrect), as is the case for most benchmarks through metrics such as exact match or equivalent indicator. To ensure stable estimation, we remove duplicate questions, those with identical response patterns, or with fewer than 30 test takers; exclude test takers with fewer than 30 total responses; and treat unattempted questions as missing values.

Open LLM Leaderboard. We use data from the Open LLM Leaderboard (Hugging Face, 2025) (Aidar Myrzakhan 2024), a public benchmarking platform that evaluates open large language models on a standardized suite of academic and practical tasks. The dataset spans models submitted between 2022 and 2025, covering parameter scales from small models (<5B parameters) to frontier systems (>140B parameters). In total, it includes 4,416 distinct language models, each evaluated on 21,176 benchmark questions from six suites: MMLU-Pro, OpenLLM-Math, MUSR, BBH, IFEval, and GPQA.

LMarena Preference Data. In addition to correctness-based evaluation, we incorporate pairwise preference data from the LMarena dataset (Chiang et al. 2024), which provides human or automated judgments of relative model quality. Each example corresponds to a prompt presented to two competing models, with an annotation indicating which response is preferred. The dataset includes 211,728 unique prompts, 3,779 unique model pairs, and 179 distinct models. These preference judgments provide a complementary view focusing on relative comparisons rather than absolute correctness.

Agent Leaderboard. We include Agent Leaderboard data from Galileo AI (Bhavsar 2025), which evaluates the agentic performance of large language models across tool-use and reasoning scenarios. This dataset contains approximately 34,700 rows, where each row corresponds to a question, the model’s response, and a numerical score judged by GPT-4. The evaluation covers multiple agentic subjects with roughly 100 questions each, including approximately 40 distinct models such as Gemini-2.5, Claude-3.5, GPT-4.1/4.5, Llama-4, and Qwen-2.5.

Together, these four sources enable unified modeling of accuracy, preference, and agency within a shared latent-factor evaluation framework.

2.2 Probabilistic Models for Measurement

Measurement requires a model connecting observable responses to latent constructs. This section surveys the major families of probabilistic models used in measurement: Item Response Theory, factor models, paired comparison systems, network models, and hierarchical models. These models provide the statistical machinery for extracting latent variables from response data.

2.2.1 Item Response Theory

Item Response Theory (IRT) models the probability of a correct response as a function of person ability and item characteristics. The key insight is that both persons and items can be characterized by parameters on a common scale.

2.2.1.1 The Rasch Model (1PL)

The simplest IRT model is the Rasch model, also called the one-parameter logistic (1PL) model:

Definition: Rasch Model

\[ P(Y_{ij} = 1 | \theta_i, \beta_j) = \frac{\exp(\theta_i - \beta_j)}{1 + \exp(\theta_i - \beta_j)} = \sigma(\theta_i - \beta_j) \]

where:

$\theta_i \in \mathbb{R}$ is the ability of person $i$
$\beta_j \in \mathbb{R}$ is the difficulty of item $j$
$\sigma(\cdot)$ is the logistic sigmoid function

The model has an elegant interpretation: the probability of success depends only on the difference between ability and difficulty. When $\theta_i = \beta_j$, the probability is exactly 0.5—the person has a 50% chance of answering correctly. When $\theta_i > \beta_j$, the probability exceeds 0.5; when $\theta_i < \beta_j$, it falls below 0.5.

Reading Plate Diagrams

Throughout this chapter we use plate diagrams (also called plate notation) to visualize probabilistic models. The conventions are:

Shaded (gray) nodes represent observed variables (e.g., responses $Y_{ij}$).
Open (white) nodes represent latent (unobserved) variables or parameters (e.g., ability $\theta_i$).
Arrows indicate probabilistic dependencies: an arrow from $A$ to $B$ means $B$’s distribution depends on $A$.
Plates (rectangles) indicate replication: a plate labeled “$i = 1, \ldots, N$” means the enclosed variables are repeated $N$ times, once per index $i$.

Nested plates represent crossed or hierarchical structure. For example, the response $Y_{ij}$ sits inside both the person plate ($i$) and the item plate ($j$), indicating one observation per person-item pair. For a comprehensive treatment of graphical models, see Koller and Friedman (2009).

Figure 2.1: Plate diagram for the Rasch model. Shaded nodes are observed; open nodes are latent. Plates indicate replication over persons ($i$) and items ($j$).

The function $P(\theta) = \sigma(\theta - \beta)$ is called the Item Characteristic Curve (ICC). It describes how the probability of success changes with ability for a fixed item.

2.2.1.2 The Two-Parameter Logistic Model (2PL)

The Rasch model assumes all items have the same discrimination—the same slope of the ICC. The two-parameter logistic model relaxes this assumption:

Definition: 2PL Model

\[ P(Y_{ij} = 1 | \theta_i, a_j, \beta_j) = \sigma(a_j(\theta_i - \beta_j)) \]

where $a_j > 0$ is the discrimination parameter for item $j$.

Items with higher discrimination are better at distinguishing between persons of different abilities. Their ICCs are steeper, meaning small changes in ability produce large changes in response probability.

Figure 2.2: Plate diagram for the 2PL model. Each item now has both a difficulty $\beta_j$ and a discrimination $a_j$ parameter.

2.2.1.3 The Three-Parameter Logistic Model (3PL)

For multiple-choice tests, even low-ability test-takers may answer correctly by guessing. The three-parameter logistic model adds a lower asymptote:

Definition: 3PL Model

\[ P(Y_{ij} = 1 | \theta_i, a_j, \beta_j, c_j) = c_j + (1 - c_j) \sigma(a_j(\theta_i - \beta_j)) \]

where $c_j \in [0, 1]$ is the guessing (or pseudo-chance) parameter.

For a 4-option multiple-choice item, we might expect $c_j \approx 0.25$ if low-ability test-takers guess randomly.

Figure 2.3: Plate diagram for the 3PL model. The guessing parameter $c_j$ sets a lower asymptote on the response probability.

2.2.2 Factor Models

Factor models provide an alternative perspective on latent variable measurement. While IRT focuses on item-level response probabilities, factor models focus on the covariance structure of responses.

2.2.2.1 The Linear Factor Model

The classical linear factor model assumes observed variables are linear combinations of latent factors plus noise:

Definition: Linear Factor Model

\[ X_j = \lambda_{j1} F_1 + \lambda_{j2} F_2 + \cdots + \lambda_{jK} F_K + \epsilon_j \]

where:

$X_j$ is the observed score on item $j$
$F_k$ are latent factors (abilities, traits)
$\lambda_{jk}$ are factor loadings
$\epsilon_j$ is item-specific error

In matrix notation: $X = \Lambda F + \epsilon$, where $\Lambda$ is the $M \times K$ matrix of factor loadings.

2.2.2.2 The Logistic Factor Model

For binary data, we use a logistic link function:

Definition: Logistic Factor Model

\[ P(Y_{ij} = 1 | U_i, V_j, Z_j) = \sigma(U_i^\top V_j + Z_j) \]

where:

$U_i \in \mathbb{R}^K$ is the latent factor vector for person $i$
$V_j \in \mathbb{R}^K$ is the factor loading vector for item $j$
$Z_j \in \mathbb{R}$ is the item intercept

The parameters of the logistic factor model correspond directly to those of the linear factor model:

Linear Factor Model	Logistic Factor Model	Interpretation
Latent factors $F_k$	Entries of $U_i \in \mathbb{R}^K$	Abilities of person $i$ on each of the $K$ factors
Factor loadings $\lambda_{jk}$	Entries of $V_j \in \mathbb{R}^K$	How strongly item $j$ loads on factor $k$
Item-specific error $\epsilon_j$	Item intercept $Z_j \in \mathbb{R}$	Baseline easiness of item $j$ (analogous to $-\beta_j$ in IRT)

The key difference is the link function: the linear model uses an identity link ($X_j = \ldots$) for continuous responses, while the logistic model passes the linear predictor through the sigmoid $\sigma(\cdot)$ to produce a probability for binary responses.

This is the model used in later chapters of AIMS for multidimensional AI evaluation.

Figure 2.4: Plate diagram for the logistic factor model. The latent factor vector $U_i$ interacts with item-specific loadings $V_j$ and intercepts $Z_j$ to produce responses.

2.2.2.3 Connection Between IRT and Factor Analysis

The Rasch model is equivalent to a one-factor logistic model with equal loadings:

Theorem: Rasch-Factor Equivalence

The Rasch model $P(Y_{ij} = 1) = \sigma(\theta_i - \beta_j)$ is equivalent to a single-factor logistic model with $U_i = \theta_i$, $V_j = 1$ for all $j$, and $Z_j = -\beta_j$.

More generally, multidimensional IRT models and logistic factor models are closely related, differing primarily in parameterization and estimation approach.

2.2.2.4 The Structure Matrix

After fitting a multidimensional factor model, we obtain estimated item loadings $\hat{V}_j$ that describe how each item relates to the latent factors. However, raw factor loadings require standardization for interpretable comparisons across items and benchmarks.

The structure matrix $S$ captures the correlation between each item’s latent response and each factor:

Definition: Structure Matrix

For a logistic factor model, the latent response $Y^*_{ij}$ can be written as $Y^*_{ij} = U_i^\top V_j + Z_j + \epsilon_{ij}$, where $\epsilon_{ij}$ follows a logistic distribution with variance $\pi^2/3$. The structure matrix entry $S_{jk}$ is the correlation between item $j$’s latent response and factor $k$:

\[ S_{jk} = \text{Cor}(Y^*_{ij}, U_{ik}) = \frac{(V_j^\top \Sigma)_k}{\sqrt{\Sigma_{kk}} \sqrt{V_j^\top \Sigma V_j + \pi^2/3}} \]

where $\Sigma = \text{Cov}(U)$ is the factor covariance matrix.

Intuition. The formula for $S_{jk}$ is a Pearson correlation, so its form follows from the standard definition $\text{Cor}(A, B) = \text{Cov}(A, B) / (\text{SD}(A) \cdot \text{SD}(B))$. The numerator $(V_j^\top \Sigma)_k$ captures the covariance between the latent response $Y^*_{ij} = U_i^\top V_j + Z_j + \epsilon_{ij}$ and factor $U_{ik}$: since $Z_j$ and $\epsilon_{ij}$ are independent of $U$, only the term $U_i^\top V_j$ contributes, yielding $\text{Cov}(U_i^\top V_j, U_{ik}) = (V_j^\top \Sigma)_k$. The denominator normalizes by $\text{SD}(U_{ik}) = \sqrt{\Sigma_{kk}}$ and $\text{SD}(Y^*_{ij}) = \sqrt{V_j^\top \Sigma V_j + \pi^2/3}$, where the $\pi^2/3$ term is the variance of the logistic error $\epsilon_{ij}$. The result is a number in $[-1, 1]$ that is directly comparable across items, unlike raw loadings $V_j$ which depend on the arbitrary scale of the factors. A detailed derivation is given in Chapter 3.

We define the structure vector for item $j$ as $S_j = (S_{j1}, S_{j2}, \ldots, S_{jK})^\top \in \mathbb{R}^K$, where each entry $S_{jk}$ is given by the formula above. Stacking these row vectors yields the $M \times K$ structure matrix $S$, whose $j$-th row is $S_j^\top$. The structure vector $S_j$ serves as a standardized fingerprint of item $j$: it tells us how strongly the item’s latent response correlates with each factor, on a $[-1, 1]$ scale that is comparable across items.

The structure matrix has several important properties:

Bounded values: Each entry $S_{jk} \in [-1, 1]$, making comparisons intuitive.
Interpretability: High positive values indicate the item strongly measures that factor; negative values indicate inverse relationships.
Clustering: Items with similar structure vectors measure similar constructs and can be grouped together.

For AI benchmarks, the structure matrix reveals which questions tap into which capabilities. Two questions may both be “correct/incorrect” but load on different factors—one measuring reasoning, another measuring factual recall. This has important implications for how we interpret aggregate benchmark scores.

2.2.2.5 Item Clustering and Benchmark Heterogeneity

With the structure matrix in hand, we can cluster items into latent subgroups using standard clustering algorithms such as Gaussian Mixture Models (GMM). Each cluster represents a group of items sharing similar factor loadings—analogous to “skills” or “themes” within the benchmark. This approach is analogous to exploratory factor analysis in psychometrics, revealing whether benchmarks are essentially unidimensional or composed of multiple, potentially antagonistic, latent skills.

Benchmark Heterogeneity

A key insight from factor analysis applied to AI benchmarks is that benchmarks are rarely homogeneous. Intentionally or not, they often combine items that test different capabilities, and even a single benchmark item may test a combination of capabilities.

Two models with identical mean scores may excel on different capability dimensions. For example, one model might be strong in reasoning but weak in factual recall, while another may have the reverse profile. When item clusters show weak or negative correlations with each other, the benchmark-level mean score becomes neither informative nor accurate about subgroup performance.

Within each benchmark, we can quantify inter-construct correlations by:

Clustering items based on their structure vectors using GMM with BIC for model selection
Computing cluster means for each model (average accuracy on items in each cluster)
Correlating cluster means across models to assess construct overlap

Strongly positive inter-cluster correlations indicate overlapping constructs, while weak or negative correlations suggest distinct and possibly conflicting capabilities being aggregated by the benchmark’s mean score. This multidimensional pattern explains why two models with identical overall accuracies may excel on entirely different skill axes.

Factor models assign a feature vector to each item (the structure vector), allowing items to be clustered via standard algorithms. This helps interpret evaluation results that would otherwise be obscured by aggregate scores.

2.2.3 Paired Comparison Models: Elo and Bradley-Terry

Not all measurement data comes in the form of item responses. In many settings, we observe pairwise comparisons: which of two items is preferred, which of two players wins. These settings require different models.

2.2.3.1 The Bradley-Terry Model

The Bradley-Terry model (1952) is the foundational model for paired comparisons:

Definition: Bradley-Terry Model

\[ P(\text{item } i \text{ beats item } j) = \frac{\exp(\theta_i)}{\exp(\theta_i) + \exp(\theta_j)} = \sigma(\theta_i - \theta_j) \]

where $\theta_i$ is the “strength” or “quality” of item $i$.

The model has the same mathematical form as the Rasch model, but the interpretation differs: instead of a person answering an item, we have two items competing against each other.

Figure 2.5: Plate diagram for the Bradley-Terry model. A comparison outcome $Y_c$ depends on the strengths $\theta_i$ and $\theta_j$ of the two competitors in each pair.

2.2.3.2 The Elo Rating System

The Elo rating system, developed by Arpad Elo for chess ratings, is essentially a Bradley-Terry model with online updates:

Definition: Elo Rating System

After player $i$ with rating $R_i$ plays player $j$ with rating $R_j$, the ratings are updated:

\[ R_i^{\text{new}} = R_i + K(S_i - E_i) \]

where:

$S_i \in \{0, 0.5, 1\}$ is the actual outcome (loss, draw, win)
$E_i = \sigma((R_j - R_i)/400 \cdot \ln 10)$ is the expected outcome
$K$ is a learning rate parameter

The Elo system is widely used in competitive games and has been adopted for AI evaluation in settings like the Chatbot Arena, where humans compare model outputs pairwise.

2.2.3.3 Connection to AI Evaluation

The Chatbot Arena (LMSYS) uses Elo ratings to rank language models based on human preferences. When a user prefers model A’s response over model B’s, this is treated as a “win” for model A. The resulting ratings provide a preference-based complement to accuracy-based benchmarks.

Chatbot Arena as Thurstone’s Comparative Judgment

The Chatbot Arena implements exactly the paradigm that L.L. Thurstone proposed in 1927: measuring psychological attributes through pairwise comparisons. Thurstone developed this method to scale attitudes, preferences, and other subjective quantities. A century later, the same mathematics underlies how we rank AI systems.

2.2.4 Network Models: GGM and Ising

The models discussed so far assume a common cause structure: latent variables cause observed responses. Network models propose an alternative: observed variables are directly connected to each other, and correlations arise from these connections rather than common latent causes.

2.2.4.1 The Gaussian Graphical Model (GGM)

For continuous data, the Gaussian Graphical Model represents conditional independence relationships:

Definition: Gaussian Graphical Model

Variables $X = (X_1, \ldots, X_M)$ follow a multivariate normal distribution with precision matrix $\Omega = \Sigma^{-1}$. The partial correlation between $X_j$ and $X_k$ given all other variables is:

\[ \rho_{jk \cdot \text{rest}} = -\frac{\Omega_{jk}}{\sqrt{\Omega_{jj} \Omega_{kk}}} \]

Two variables are conditionally independent if and only if $\Omega_{jk} = 0$.

The GGM can be visualized as a network where nodes are variables and edges represent non-zero partial correlations.

2.2.4.2 The Ising Model

For binary data, the Ising model (borrowed from statistical physics) provides an analogous framework:

Definition: Ising Model

\[ P(Y = y) = \frac{1}{Z} \exp\left(\sum_j \tau_j y_j + \sum_{j < k} \omega_{jk} y_j y_k\right) \]

where:

$\tau_j$ are threshold parameters (similar to item difficulties)
$\omega_{jk}$ are interaction parameters (edge weights)
$Z$ is a normalizing constant

Interpreting the network. The Ising model defines a network (undirected graph) over items: each item $j$ is a node, and each non-zero interaction weight $\omega_{jk}$ is an edge between items $j$ and $k$. The network captures conditional dependencies: $\omega_{jk} \neq 0$ means that knowing the response to item $j$ changes the probability of a correct response to item $k$, even after conditioning on all other items.

The two sums in the probability have distinct roles:

$\sum_j \tau_j y_j$ is the node term: each item $j$ has a threshold $\tau_j$ that plays a role analogous to item difficulty. A large positive $\tau_j$ makes $y_j = 1$ (correct) more probable, independent of other items.
$\sum_{j < k} \omega_{jk} y_j y_k$ is the edge term: it sums the interaction weights over all pairs of items that are both answered correctly. A positive weight $\omega_{jk} > 0$ means items $j$ and $k$ tend to be answered correctly together (e.g., they share a common skill), while $\omega_{jk} < 0$ means success on one is associated with failure on the other (e.g., they require competing strategies). The total edge sum captures the overall “alignment” of the response pattern with the dependency structure.

In the Ising model, the correlation between two items arises from their direct connection $\omega_{jk}$, not from a common latent factor. Moreover, correlations can propagate through chains of connections: items $j$ and $k$ may be correlated even if $\omega_{jk} = 0$, provided there is a path of non-zero edges connecting them.

Figure 2.6: Graphical model for the Ising model. Unlike latent variable models, all nodes are observed and correlations arise from direct pairwise connections $\omega_{jk}$.

2.2.4.3 Network vs. Latent Variable Models

The choice between network and latent variable models reflects different theories about what causes observed correlations:

Aspect	Latent Variable Model	Network Model
Cause of correlations	Common latent factor	Direct connections
Removing an item	No effect on other correlations	May reduce correlations
Theoretical commitment	Constructs exist and cause responses	Constructs are summaries
Example	“Intelligence” causes good performance	Skills directly cause each other

For AI evaluation, the question is whether benchmark items are indicators of a common capability (latent variable view) or whether they form a network of related but distinct skills (network view). This distinction has implications for how we aggregate performance across items.

Which Model for AI?

The choice between latent variable and network models is not merely technical—it reflects different beliefs about AI capabilities:

Latent variable view: Models have underlying capabilities (reasoning, knowledge, language understanding) that cause their benchmark performance.
Network view: Benchmark items measure distinct skills that may reinforce each other but do not share a common cause.

Both views may be partially correct. AIMS primarily adopts the latent variable view but acknowledges that some benchmark items may not fit this framework.

2.2.5 Hierarchical Models

The models introduced so far treat all items as exchangeable—a single set of item parameters enters the likelihood without any grouping structure. In practice, AI evaluations are nested: items belong to benchmarks, benchmarks belong to suites or domains, and suites may be grouped into broader capability areas. Hierarchical (multilevel) models make this nesting explicit in the model specification, treating it as part of the data-generating process rather than an afterthought of analysis.

Definition: Hierarchical IRT Model

Consider item $i$ nested within benchmark $j$. A hierarchical extension of the Rasch model specifies:

\[ \text{logit}\, P(Y_{ij} = 1 \mid \theta, b_{ij}) = \theta - b_{ij} \]

where item difficulties are drawn from a benchmark-level distribution:

\[ b_{ij} \sim \mathcal{N}(\mu_j, \sigma_j^2) \]

The benchmark means $\mu_j$ may themselves follow a domain-level distribution $\mu_j \sim \mathcal{N}(\mu_0, \tau^2)$, creating a three-level hierarchy: items within benchmarks within domains. The same hierarchical extension applies to 2PL, 3PL, and factor models.

Figure 2.7: Plate diagram for the hierarchical IRT model. Item difficulties $b_{ij}$ are drawn from benchmark-level distributions parameterized by $\mu_j$ and $\sigma_j$, which may themselves be drawn from domain-level hyperparameters $\mu_0$ and $\tau$. Nested plates reflect the hierarchical data structure.

The decision to include hierarchical structure is a modeling choice, analogous to the decision between the 1PL and 2PL. It encodes the assumption that items within the same benchmark share difficulty characteristics—their parameters are not independent draws from a single global distribution, but cluster by benchmark. Ignoring this structure and treating all items as exchangeable conflates within-benchmark and between-benchmark variation, producing estimates that may not generalize beyond the specific items observed (Luettgau et al. 2025).

Hierarchical Structure in AI Evaluation

Modern AI evaluations exhibit natural hierarchy at multiple levels:

MMLU: 15,908 items $\to$ 57 subjects $\to$ 4 domains (humanities, social sciences, STEM, other)
GAIA: agentic tasks $\to$ 3 difficulty levels $\to$ capability domain
Coding benchmarks: problems $\to$ benchmarks (HumanEval, MBPP, DS-1000) $\to$ coding capability

Explicitly modeling these levels separates the sources of variation at each level. This enables principled generalization from the benchmarks actually tested to the broader construct they are intended to measure. Estimation methods for hierarchical models, including partial pooling and Bayesian inference, are covered in Chapter 2.

2.3 The Rasch Model as “The Measurement Model”

Among the many probabilistic models for measurement, the Rasch model holds a special status. Georg Rasch and his followers argue that it is not merely one measurement model among many—it is the measurement model, the only model that satisfies the requirements for fundamental measurement. This section examines this claim carefully.

2.3.1 Sufficiency of Sum Scores

The most remarkable property of the Rasch model is that the sum score is a sufficient statistic for the ability parameter. This means that the total number of correct responses contains all the information about a person’s ability; knowing which items were answered correctly adds nothing.

Theorem: Sufficiency in the Rasch Model

In the Rasch model, the total score $S_i = \sum_{j=1}^M Y_{ij}$ is a sufficient statistic for the ability parameter $\theta_i$. That is:

\[ P(Y_i | S_i, \theta_i) = P(Y_i | S_i) \]

The conditional distribution of the response pattern given the sum score does not depend on $\theta$.

Proof

The joint probability of response pattern $Y_i = (Y_{i1}, \ldots, Y_{iM})$ is:

\[ P(Y_i | \theta_i, \boldsymbol{\beta}) = \prod_{j=1}^M \frac{\exp(Y_{ij}(\theta_i - \beta_j))}{1 + \exp(\theta_i - \beta_j)} \]

Expanding:

\[ = \frac{\exp(\theta_i \sum_j Y_{ij}) \cdot \exp(-\sum_j Y_{ij} \beta_j)}{\prod_j (1 + \exp(\theta_i - \beta_j))} \]

\[ = \frac{\exp(\theta_i S_i) \cdot \exp(-\sum_j Y_{ij} \beta_j)}{\prod_j (1 + \exp(\theta_i - \beta_j))} \]

The likelihood factors as $L(\theta | Y) = g(S, \theta) \cdot h(Y, \boldsymbol{\beta})$, where $g$ depends on $\theta$ only through $S$.

By the factorization theorem, $S$ is sufficient for $\theta$.

To see that the conditional distribution $P(Y_i | S_i)$ does not depend on $\theta$:

\[ P(Y_i | S_i, \theta_i) = \frac{P(Y_i | \theta_i, \boldsymbol{\beta})}{P(S_i | \theta_i, \boldsymbol{\beta})} \]

Both numerator and denominator contain the factor $\exp(\theta_i S_i)$, which cancels when $S_i$ is fixed.

2.3.1.1 Why Sufficiency Matters

Sufficiency has profound implications:

Data reduction without information loss. We can summarize each person’s responses by a single number (the sum score) without losing any information about their ability.
Justification for sum scores. The common practice of computing total scores is justified only if the Rasch model holds. Under other models, sum scores discard information.
Conditional inference. We can estimate item parameters without knowing person parameters, and vice versa, by conditioning on sufficient statistics.

Sufficiency and AI Benchmarks

When we compute a model’s accuracy on a benchmark, we are computing a sum score (proportion correct = sum / number of items). This is appropriate if the Rasch model holds. But if items have different discriminations, the sum score loses information—we should weight some items more than others.

Implication: Before trusting aggregate benchmark scores, we should test whether the Rasch model fits the data.

Application: Using Sufficiency to Find Benchmark Bugs

S. T. Truong et al. (2025) turn the sufficiency property into a practical diagnostic tool. Their argument: if the AI evaluation community reports mean scores as the primary metric, it is implicitly assuming that the sum score is a sufficient statistic for ability — which, by the theorem above, implies the Rasch model is the data-generating process. Under the Rasch model, two testable consequences follow:

Positive tetrachoric correlations. All inter-item tetrachoric correlations must be non-negative (Corollary of Chebyshev’s inequality applied to increasing functions of the same latent variable $\theta$).
Positive item-total correlations. Each item must correlate positively with the total score.

Items that violate these conditions — negative tetrachoric correlations, negative item-total correlations, or low Mokken scalability coefficients — are flagged as potentially invalid. Applying this to nine benchmarks including GSM8K and MMLU, S. T. Truong et al. (2025) achieve up to 84% precision at the top-50 flagged items: of the 50 most suspicious questions, up to 42 were confirmed invalid by human experts. Common errors include ambiguous wording, incorrect answer keys (e.g., treating exponential depreciation as linear), grading bugs (e.g., “$7.00” $\neq$ “7”), and culturally-embedded assumptions.

This illustrates a broader principle: the mathematical properties of measurement models are not merely theoretical — they yield operational diagnostics for evaluation quality. When data violate model predictions, either the model is wrong or the data are corrupted. Since the sufficiency assumption is already implicit in how the community uses benchmarks, violations are strong evidence of item-level problems.

2.3.2 Specific Objectivity

Georg Rasch’s central contribution was not the mathematical model itself (which had been proposed earlier by others) but the philosophical framework of specific objectivity.

Definition: Specific Objectivity

A measurement procedure exhibits specific objectivity if comparisons between persons are independent of which items are used:

\[ \frac{P(Y_{ij} = 1)}{P(Y_{ij} = 0)} \bigg/ \frac{P(Y_{kj} = 1)}{P(Y_{kj} = 0)} = \frac{\exp(\theta_i)}{\exp(\theta_k)} \]

The item parameter $\beta_j$ cancels completely. Similarly, comparisons between items are independent of which persons are used.

In the Rasch model, the odds ratio for two persons on the same item is:

\[ \frac{P(Y_{ij} = 1) / P(Y_{ij} = 0)}{P(Y_{kj} = 1) / P(Y_{kj} = 0)} = \frac{\exp(\theta_i - \beta_j)}{\exp(\theta_k - \beta_j)} = \exp(\theta_i - \theta_k) \]

The item difficulty $\beta_j$ cancels! This means person comparisons are the same regardless of which item we use.

Rasch identified two levels of objectivity:

Local objectivity: Comparisons are item-independent for a specific pair of persons.
General objectivity: The entire ability scale is sample-independent. Ability estimates remain valid regardless of which items were administered.

2.3.3 Test-Free and Sample-Free Measurement

Specific objectivity enables what Rasch called test-free person measurement and sample-free item calibration:

Test-free person measurement: A person’s ability can be estimated from any subset of calibrated items, and the estimate will be the same (within sampling error).
Sample-free item calibration: An item’s difficulty can be estimated from any sample of persons, and the estimate will be the same.

This is remarkable because it mirrors the properties of physical measurement:

The Analogy to Physical Measurement

Consider measuring temperature with different thermometers:

A mercury thermometer in New York should give the same reading as an alcohol thermometer in London for the same temperature.
Calibrating a thermometer on hot water should yield parameters that work equally well for cold water.

The Rasch model claims the same properties for psychological measurement: calibrated tests yield the same ability estimates regardless of which specific items are used.

2.3.3.1 Implications for AI Evaluation

If AI benchmarks satisfy Rasch model assumptions:

Benchmark subset comparisons are valid. We can compare a model tested on MMLU subset A with a model tested on MMLU subset B, as long as both subsets are calibrated to the same scale.
New questions can be calibrated on any models. We can add new questions to a benchmark by testing them on a sample of models, then use them to evaluate future models.
Adaptive testing becomes possible. We can select questions dynamically based on a model’s performance, arriving at an accurate ability estimate with fewer questions.
Cross-benchmark comparisons may be possible. If different benchmarks measure the same construct, we can equate their scales.

These properties are not guaranteed—they hold only if the Rasch model fits the data. Testing model fit becomes essential.

2.3.4 The Rasch vs. General IRT Debate

The claim that Rasch is “the” measurement model is controversial. The debate centers on the prescriptive vs. descriptive approaches to measurement.

2.3.4.1 The Prescriptive Approach (Rasch School)

The Rasch school argues:

Measurement requires specific objectivity. Without it, we cannot make person comparisons that are independent of the test used.
The model is a requirement, not a description. If data do not fit the Rasch model, the items do not measure the same construct. We should discard misfitting items, not adopt a more complex model.
Discrimination variation is a problem, not a feature. Items with different discriminations measure the construct with different precision. Mixing them produces a heterogeneous test that does not measure a single thing.
Sufficiency is non-negotiable. The sum score must be sufficient for ability, or we are not measuring anything meaningful.

2.3.4.2 The Descriptive Approach (General IRT)

The general IRT school responds:

Models should fit data. The purpose of a statistical model is to describe the data accurately. If items have different discriminations, we should model this, not ignore it.
Perfect fit is unrealistic. Real data never perfectly fit any model. The Rasch school’s insistence on exact fit is impractical.
Information is lost by forcing Rasch. Discarding items that don’t fit Rasch means discarding information. Better to use all items and model their characteristics.
2PL/3PL models are more realistic. Most tests have items with varying discrimination and guessing. Pretending otherwise does not make it true.

The Fundamental Tension

Prescriptive view: The Rasch model defines what measurement IS. Items that don’t fit should be discarded because they don’t measure the same thing.

Descriptive view: Use whatever model fits the data best. The 2PL/3PL models are more realistic for most applications.

This is not merely a statistical disagreement—it reflects different philosophies of science. The Rasch school treats measurement theory as providing requirements that data must satisfy. The IRT school treats models as tools that should be chosen based on fit.

2.3.4.3 Implications for AI Evaluation

This debate has direct implications for AI benchmark design:

Question	Rasch School Answer	General IRT Answer
Should we allow items with different discriminations?	No—they measure different constructs	Yes—model the discrimination
What if data don’t fit Rasch?	Remove misfitting items	Use 2PL or 3PL
Is sum score the right metric?	Yes, if Rasch fits	Only approximately
Can we compare models across benchmarks?	Yes, with Rasch	Requires complex equating

AIMS takes a pragmatic position: we test whether data fit Rasch-like models, use the simpler model when it fits adequately, and acknowledge when more complex models are needed. The key insight is that the choice has consequences for what we can conclude from evaluation results.

2.4 Historical Development

The probabilistic models we use today emerged from over a century of work across psychology, education, economics, and statistics. Understanding this history illuminates why certain models dominate and what problems they were designed to solve.

2.4.1 Thurstone’s Law of Comparative Judgment (1927)

The story begins with L.L. Thurstone at the University of Chicago. In 1927, Thurstone proposed a model for how people make pairwise comparisons: the Law of Comparative Judgment (Thurstone 1927).

Thurstone’s insight was that subjective quantities (preferences, attitudes, perceived stimuli) could be placed on a numerical scale by analyzing patterns of pairwise comparisons. If we ask many people whether stimulus A is greater than stimulus B, and record the proportion who say yes, we can infer the underlying scale values.

Thurstone’s Model (Case V)

Each stimulus $i$ has a true scale value $\theta_i$. When comparing stimuli $i$ and $j$, each is perceived with Gaussian noise:

\[ \tilde{\theta}_i \sim N(\theta_i, \sigma^2), \quad \tilde{\theta}_j \sim N(\theta_j, \sigma^2) \]

The probability that $i$ is judged greater than $j$ is:

\[ P(i \succ j) = \Phi\left(\frac{\theta_i - \theta_j}{\sqrt{2}\sigma}\right) \]

where $\Phi$ is the standard normal CDF.

Thurstone’s method was revolutionary: it showed that subjective quantities could be measured scientifically. The same mathematics now underlies how we rank AI systems from human preferences.

2.4.2 Bradley-Terry and Luce (1952-1959)

In 1952, Ralph Bradley and Milton Terry developed a model for ranking from paired comparisons in the context of incomplete block designs (Bradley and Terry 1952). Their model:

\[ P(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j} \]

where $\pi_i > 0$ are “worth” parameters. With $\theta_i = \log \pi_i$, this becomes the familiar logistic form.

In 1959, R. Duncan Luce provided an axiomatic foundation through his Choice Axiom: the ratio of choice probabilities for two alternatives should be independent of what other alternatives are available. This axiom leads directly to the Bradley-Terry/logistic model.

2.4.3 Georg Rasch and the Danish School (1960)

Georg Rasch was a Danish mathematician who worked on problems in educational testing. In 1960, he published “Probabilistic Models for Some Intelligence and Attainment Tests,” which introduced what we now call the Rasch model (Rasch 1960).

Rasch’s contribution was not the mathematical model itself—the same formula had appeared earlier in other contexts. His contribution was the philosophical framework of specific objectivity: the requirement that person and item parameters must be separable.

Rasch’s work was introduced to the United States by Benjamin Wright at the University of Chicago, who heard Rasch lecture in 1960. Wright became the leading advocate for Rasch measurement in the English-speaking world, founding the MESA (Measurement, Evaluation, Statistical Analysis) program and the journal Rasch Measurement Transactions.

Other important figures in the Rasch tradition:

Erling Andersen (Copenhagen): Developed the theory of conditional maximum likelihood estimation for Rasch models.
Gerhard Fischer (Vienna): Extended the Rasch model to the Linear Logistic Test Model (LLTM) and developed software for estimation.
David Andrich (Australia): Extended Rasch models to polytomous data (rating scales) and developed the RUMM software.

2.4.4 McFadden and Econometrics (1974)

In 1974, Daniel McFadden (who later won the Nobel Prize in Economics) developed the random utility framework for discrete choice. His insight was that choices could be modeled as utility maximization with random error:

A person chooses alternative $i$ over $j$ if $U_i + \epsilon_i > U_j + \epsilon_j$, where $U$ is deterministic utility and $\epsilon$ is random. If the errors are i.i.d. Gumbel distributed, this yields the logistic choice model (McFadden 1974).

McFadden’s work connected preference models to economics and provided a theoretical justification for the Bradley-Terry model: it arises from random utility maximization under specific distributional assumptions.

2.4.5 Modern Developments

Network Psychometrics (2010s): Borsboom and colleagues proposed that psychological constructs might be better understood as networks of causally connected symptoms rather than reflections of underlying latent variables (Borsboom and Cramer 2013). The Ising model and Gaussian Graphical Model provide statistical tools for this perspective.

AI Evaluation (2020s): The application of psychometric methods to AI evaluation is recent. Key developments include:

Chatbot Arena using Elo ratings for LLM ranking (Chiang et al. 2024)
Application of IRT to benchmark analysis (Polo et al. 2024)
Multidimensional factor models for AI capabilities (this textbook)

Key Historical Papers

Foundations:

Thurstone (1927) . A law of comparative judgment. Psychological Review, 34, 273-286.
Bradley and Terry (1952) . Rank analysis of incomplete block designs. Biometrika, 39, 324-345.
Luce (1959) . Individual Choice Behavior: A Theoretical Analysis. Wiley.
Rasch (1960) . Probabilistic Models for Some Intelligence and Attainment Tests. Danish Institute for Educational Research.
McFadden (1974) . Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics (pp. 105-142). Academic Press.

Modern:

Borsboom (2005) . Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge University Press.
Epskamp et al. (2018) . The Gaussian graphical model in cross-sectional and time-series data. Multivariate Behavioral Research, 53, 453-480.

2.5 From Psychology to AI: Transferring Measurement Science

The concepts developed in psychology and education transfer directly to AI evaluation. This section makes the mapping explicit and highlights both the parallels and the differences.

2.5.1 The Translation Table

Psychology/Education	AI Evaluation	Symbol
Test taker (person, examinee)	AI model	$i$
Test item (question, problem)	Benchmark question	$j$
Ability, trait, latent construct	Capability, skill	$\theta_i$ or $U_i$
Item difficulty	Question difficulty	$\beta_j$ or $V_j$
Item discrimination	Question informativeness	$a_j$
Response (correct/incorrect)	Model output (correct/incorrect)	$Y_{ij}$
Test (collection of items)	Benchmark (collection of questions)	-
Sum score (number correct)	Accuracy	$S_i$
Reliability	Evaluation consistency	-
Validity	Measuring intended capability	-
Test bias (DIF)	Benchmark contamination/bias	-
Adaptive testing (CAT)	Efficient evaluation	-

2.5.2 Key Parallels

Reliability. In educational testing, reliability refers to the consistency of scores across different conditions:

Test-retest reliability: Does the same person get the same score on repeated testing?
Internal consistency: Do items within the test correlate with each other?
Standard error of measurement: How precise is the score estimate?

For AI evaluation:

Run-to-run consistency: Does the same model get the same score with different random seeds?
Item consistency: Do benchmark questions correlate with each other?
Confidence intervals: How uncertain is the accuracy estimate?

Validity. In educational testing, validity concerns whether the test measures what it claims to measure:

Content validity: Do the items adequately sample the domain?
Criterion validity: Does the score predict relevant outcomes?
Construct validity: Does the score behave as theory predicts?

For AI evaluation:

Content validity: Does the benchmark cover the intended capability domain?
Criterion validity: Does benchmark performance predict real-world usefulness?
Construct validity: Do models that score high actually have the intended capability?

Fairness and Bias. In educational testing, differential item functioning (DIF) analysis checks whether items are biased against certain groups:

An item shows DIF if persons with equal ability but different group membership have different probabilities of answering correctly.

For AI evaluation:

Training data contamination: Did some models see the test questions during training?
Architecture bias: Are some questions easier for certain model architectures?
Prompt sensitivity: Do different prompt formats advantage different models?

2.5.3 Key Differences

While the mathematical framework transfers directly, some differences are worth noting:

Number of items. Psychological tests typically have tens to hundreds of items. AI benchmarks may have thousands or hundreds of thousands. This affects estimation and model fitting.
Deterministic responses. Human test-takers show stochastic variation—they may answer the same question differently on different occasions. AI models (with temperature 0) are often deterministic. This changes how we interpret probability models.
Construct definition. Psychological constructs like “intelligence” or “anxiety” have extensive theoretical literature. AI capabilities like “reasoning” or “common sense” are less well defined.
Speed of change. Human abilities change slowly. AI capabilities can change dramatically with each model release. This affects the stability of calibrations.
Population structure. Human populations have known demographic structures. The “population” of AI models is arbitrary—determined by which models researchers choose to evaluate.

2.5.4 Case Study: The Chatbot Arena

The Chatbot Arena (LMSYS) (Liang et al. 2023) provides a concrete example of measurement concepts applied to AI:

Setting: Users interact with two anonymous language models and vote for the one they prefer. Models are ranked using Elo ratings computed from these pairwise comparisons.

Measurement framework: This is exactly Thurstone’s comparative judgment paradigm from 1927. The Elo rating system implements Bradley-Terry maximum likelihood estimation with online updates.

Validity questions:

What construct do the ratings measure? “Human preference” is vague. Preferences for what—helpfulness, harmlessness, style, factual accuracy?
Are ratings stable across different user populations?
Do ratings predict performance on other benchmarks or real-world tasks?

Reliability questions:

How many comparisons are needed for stable ratings?
How sensitive are ratings to the specific prompts used?
Do ratings fluctuate as new models enter the arena?

The Arena demonstrates both the power and limitations of measurement approaches. The Elo ratings provide a principled summary of human preferences, but interpreting what they mean requires the full apparatus of validity theory.

2.6 Amortized Models and Neural Collaborative Filtering

The factor models introduced in Section 2.2.2 learn a latent parameter vector for each person ($U_i$) and each item ($V_j$, $Z_j$) by optimizing over the observed response matrix. This works well when we have enough responses for every entity, but it has two limitations: (1) the number of parameters grows linearly with the number of persons and items, and (2) new persons or items that were not present during training have no learned representation. Amortized models address both limitations by replacing per-entity parameter optimization with a learned function that maps observable features to latent parameters.

2.6.1 From Per-Entity to Amortized Parameters

In the standard factor model, each item $j$ has its own learned parameters $V_j$ and $Z_j$. An amortized model instead learns a function $f_\theta$ that predicts these parameters from item features:

\[ [\hat{V}_j, \hat{Z}_j] = f_\theta(x_j) \]

where $x_j$ is a feature vector for item $j$ (e.g., a text embedding of the question). Similarly, person parameters can be amortized via a function $g_\phi$:

\[ \hat{U}_i = g_\phi(m_i) \]

where $m_i$ is a feature vector for person $i$ (e.g., model metadata such as parameter count, architecture family, and release date). The predicted correctness probability is then:

\[ \hat{P}_{ij} = \sigma(\hat{U}_i^\top \hat{V}_j + \hat{Z}_j) \]

The key advantage is that $f_\theta$ and $g_\phi$ generalize to new items and persons not seen during training, as long as their features are available.

2.6.2 Neural Collaborative Filtering

Amortized measurement models are closely related to neural collaborative filtering (NCF), a family of methods from recommendation systems that use neural networks to model user-item interactions. In the recommender systems literature, the response matrix $Y$ records user preferences (ratings, clicks); in our setting, it records model correctness. Both problems share the same structure: predicting missing entries of a sparse matrix.

Classical matrix factorization decomposes the response matrix as $Y \approx U V^\top + Z$, which is exactly the logistic factor model from Section 2.2.2 without the sigmoid. NCF extends this by replacing the inner product $U_i^\top V_j$ with a neural network $h_\psi(U_i, V_j)$ that can learn non-linear interactions between person and item latent factors. In the measurement context, this corresponds to replacing the linear logit $U_i^\top V_j + Z_j$ with a more flexible function:

\[ P(Y_{ij} = 1) = \sigma(h_\psi(U_i, V_j, Z_j)) \]

This increased flexibility can capture complex interaction patterns—for instance, a model that is strong on reasoning but specifically weak on multi-step arithmetic—that a linear factor model would need many dimensions to approximate. The cost is reduced interpretability: the latent factors no longer have a clean geometric interpretation as “ability axes.”

NCF and Amortized Models

Neural collaborative filtering and amortized models address complementary problems. NCF uses neural networks to model the interaction between latent factors (replacing $U_i^\top V_j$ with a learned function). Amortized models use neural networks to predict the latent factors from features (replacing per-entity $U_i$, $V_j$ with functions of observable features). The two ideas can be combined: use amortized encoders to produce latent factors, then a neural interaction function to produce predictions.

2.6.3 Prediction-Powered Evaluation

Prediction-Powered Evaluation (PPE) is a concrete instantiation of the amortized approach for AI evaluation. The pipeline has three stages:

Stage 1 — Factor Model Pretraining. Learn latent behavioral factors $(U, V, Z)$ from observed response data $Y_{ij}$ using the factor model from Section 2.2.2:

\[ p(Y_{ij}=1 \mid U_i, V_j, Z_j) = \sigma(U_i^\top V_j + Z_j) \]

Stage 2 — Amortized Mapping. Two parallel predictors are trained:

Item-side predictor $f_\theta$: maps question embeddings $E_j \in \mathbb{R}^{4096}$ to latent parameters $[\hat V_j, \hat Z_j] = f_\theta(E_j)$
Model-side predictor $g_\phi$: maps model metadata $F_i \in \mathbb{R}^{24}$ (scale, architecture, release time) to $\hat U_i = F_i W_U$

The item predictor is a neural network trained with Bernoulli log-likelihood using fixed $U$ from Stage 1. The model predictor is a simple linear transformation for interpretability.

Stage 3 — Zero-Shot Evaluation. Given a new model or item, predict its latent parameters from metadata/embeddings and reconstruct correctness probabilities:

\[ \hat P_{ij} = \sigma(\hat U_i^\top \hat V_j + \hat Z_j) \]

Component	Input	Output	Purpose
Factor model	Response matrix $Y$	$U, V, Z$	Extract latent behavioral structure
Semantic predictor	Question embeddings $E_j$	$[\hat V_j, \hat Z_j]$	Generalize to unseen questions
Model predictor	Metadata $F_i$	$\hat U_i$	Generalize to unseen models
Correctness predictor	$\hat U_i, \hat V_j, \hat Z_j$	$\hat P_{ij}$	Predict correctness without running evaluation

2.6.4 The Semantic–Behavioral Gap

A natural hope is that semantic similarity between questions implies behavioral similarity: if two questions have similar embeddings, perhaps models respond to them similarly. However, this intuition is misleading. Even near-identical embeddings (cosine > 0.99) exhibit nearly random behavioral correlations ($-1$ to $+1$), showing that semantic proximity does not imply behavioral equivalence. This gap motivates the two-stage approach above: first learn a behavioral latent space from response data, then learn to map semantic features into that space.

2.6.5 Iterative Filtering

Before training the PPE pipeline, we can improve inter-item consistency by removing adversarial or inconsistent items. Iterative filtering repeatedly computes tetrachoric correlations and removes items with the highest proportion of negative correlations. After filtering, negative correlations typically drop from ~23% to under 2%, retaining roughly half the original items. Empirical evaluation of the PPE pipeline on held-out data is presented in Chapter 3.

2.7 Summary and Preview

This chapter has introduced the measurement science framework that underlies the rest of AIMS. The key ideas are:

2.7.1 Key Takeaways

Measurement requires more than scoring. Assigning numbers to models based on benchmark performance is scoring, not measuring. Measurement requires a theory connecting scores to latent constructs.
Validity matters. A benchmark measures a construct only if the construct exists and causally produces variation in scores. The philosophical and diagnostic foundations of validity are developed in Chapter 6.
The Rasch model has special properties. Sufficiency of sum scores and specific objectivity make Rasch uniquely suitable for fundamental measurement. These properties justify treating sum scores as measurements.
Multiple models exist for different purposes. IRT models (1PL, 2PL, 3PL), factor models, paired comparison models (Bradley-Terry, Elo), network models (GGM, Ising), and hierarchical models serve different purposes. The choice of model—including whether to represent nested evaluation structure—has implications for what we can conclude.
Psychology solved these problems decades ago. The tools developed in psychometrics—reliability, validity, dimensionality analysis, adaptive testing—apply directly to AI evaluation.

2.7.2 Preview of Following Chapters

The chapters that follow apply this framework to specific AI evaluation challenges:

Chapter 2 (Learning): Covers parameter estimation for IRT and factor models, including maximum likelihood, EM algorithms, Bayesian inference, regularization, and model selection. It also develops generalization experiments—evaluating how well learned models transfer to unseen data under various masking schemes—and presents empirical results for the amortized models introduced in this chapter.
Chapter 3 (Design): Applies measurement principles to benchmark design, addressing how to construct valid and reliable AI evaluations.
Chapter 4 (Reliability): Develops the theory of measurement precision, including classical and IRT-based reliability, standard errors, and information functions.

The measurement concepts from this chapter recur throughout. When we justify using sum scores, we appeal to sufficiency in the Rasch sense. When we analyze benchmark dimensionality, we apply the factor models introduced in this chapter and trained using methods from Chapter 2. The question of whether benchmarks measure what they claim to measure — validity — is developed in Chapter 6.

2.8 Exercises

2.8.1 Theoretical Exercises

Exercise 1.1 (*): Explain in your own words why the sum score is a sufficient statistic in the Rasch model but not in the 2PL model. What information is lost when we reduce responses to sum scores under 2PL?

Exercise 1.2 (**): Prove that the Bradley-Terry model is equivalent to a Rasch model where each “person” is a comparison between two items.

Hint: Consider a “person” as an ordered pair $(i, j)$ representing a comparison, and an “item” as a single entity $k$ appearing in a comparison. Define appropriate ability and difficulty parameters.

Exercise 1.3 (**): Show that in the Rasch model, the odds ratio for persons $i$ and $k$ responding correctly to item $j$ is:

\[ \frac{P(Y_{ij} = 1) / P(Y_{ij} = 0)}{P(Y_{kj} = 1) / P(Y_{kj} = 0)} = \exp(\theta_i - \theta_k) \]

independent of the item difficulty $\beta_j$. Explain why this property is called “specific objectivity.”

Exercise 1.4 (***): The Ising model and the Rasch model make different assumptions about why responses correlate.

Write down both models for binary data $Y \in \{0, 1\}^{N \times M}$.
Describe the causal structure each model assumes.
Under what conditions might each model be appropriate for AI evaluation?
Propose an empirical test that could distinguish between them.

2.8.2 Computational Exercises

Exercise 1.5 (**): Implement Rasch model estimation using conditional maximum likelihood.

# Given: Response matrix Y (N models x M questions)
# Task: Estimate item difficulties using conditional MLE
#
# Steps:
# 1. Compute sum scores for each model
# 2. For each item, compute the conditional likelihood given sum scores
# 3. Optimize to find item difficulties
# 4. Compare estimated difficulties to empirical item means (proportion correct)
#
# Use scipy.optimize.minimize for optimization

import numpy as np
from scipy.optimize import minimize
from scipy.special import logsumexp

def estimate_rasch_conditional(Y):
    """
    Estimate Rasch model item difficulties using conditional MLE.

    Parameters:
    -----------
    Y : np.ndarray, shape (N, M)
        Binary response matrix

    Returns:
    --------
    beta : np.ndarray, shape (M,)
        Estimated item difficulties (identified by setting sum(beta) = 0)
    """
    N, M = Y.shape
    # YOUR CODE HERE
    pass

# Test on simulated data
np.random.seed(42)
N, M = 100, 50
theta_true = np.random.normal(0, 1, N)
beta_true = np.random.normal(0, 1, M)
prob = 1 / (1 + np.exp(-(theta_true[:, None] - beta_true[None, :])))
Y = (np.random.random((N, M)) < prob).astype(int)

beta_hat = estimate_rasch_conditional(Y)
# Compare to true values (after centering)

Exercise 1.6 (**): Given pairwise preference data, estimate Bradley-Terry parameters.

# Given: Comparison data as list of (winner, loser) pairs
# Task: Estimate strength parameters via maximum likelihood
#
# The likelihood for comparison (i beats j) is:
# P(i > j) = exp(theta_i) / (exp(theta_i) + exp(theta_j))
#           = sigmoid(theta_i - theta_j)

import numpy as np
from scipy.optimize import minimize

def estimate_bradley_terry(comparisons, n_items):
    """
    Estimate Bradley-Terry model parameters.

    Parameters:
    -----------
    comparisons : list of (int, int)
        List of (winner, loser) pairs
    n_items : int
        Number of items

    Returns:
    --------
    theta : np.ndarray, shape (n_items,)
        Estimated strength parameters (identified by setting theta[0] = 0)
    """
    # YOUR CODE HERE
    pass

# Test: Simulate comparisons and recover parameters

Exercise 1.7 (***): Test whether benchmark data fit the Rasch model using Andersen’s likelihood ratio test.

# Andersen's LR test:
# 1. Split persons into groups based on sum score (e.g., high vs low scorers)
# 2. Estimate item difficulties separately for each group
# 3. If Rasch holds, these estimates should be equal
# 4. Test statistic: 2 * (sum of group log-likelihoods - pooled log-likelihood)
# 5. Under H0, this is chi-squared with df = (n_groups - 1) * (n_items - 1)

def andersen_lr_test(Y, n_groups=2):
    """
    Perform Andersen's LR test for Rasch model fit.

    Parameters:
    -----------
    Y : np.ndarray, shape (N, M)
        Binary response matrix
    n_groups : int
        Number of groups to split persons into

    Returns:
    --------
    statistic : float
        LR test statistic
    p_value : float
        p-value from chi-squared distribution
    """
    # YOUR CODE HERE
    pass

2.8.3 Discussion Questions

Discussion 1.1: The Rasch model’s sufficiency property justifies using sum scores as measurements. But most AI benchmarks use items with varying discriminations, which violates the Rasch assumption. Does this mean that current benchmark rankings are fundamentally flawed, or can they still be useful approximations? Under what conditions would adopting Rasch-based measurement change the ranking of frontier models?

Discussion 1.2: The Rasch school argues that items not fitting the Rasch model should be discarded because they do not measure the same construct. What are the implications of this view for AI benchmark design? Should we design benchmarks to fit Rasch, or should we use more flexible models that accommodate heterogeneous items?

Discussion 1.3: Network psychometrics views symptoms as causally connected rather than caused by a latent factor. Could AI capabilities be “network-like” rather than “factor-like”? What evidence would distinguish these views? How would it change how we interpret benchmark scores?

Discussion 1.4: S. Truong et al. (2025) show that IRT ability $\theta$ scales linearly with $\log(\text{FLOP})$ during pre-training, and that this relationship enables cross-benchmark transfer of ability estimates. What does this imply about the nature of the latent construct $\theta$? Is it a stable property of the model, or an artifact of the IRT parameterization? Under what conditions would cross-benchmark transfer of $\theta$ fail?

2.9 Bibliographic Notes

2.9.1 Item Response Theory

The standard reference for IRT is Lord and Novick’s Statistical Theories of Mental Test Scores (1968), though it predates modern computational methods. More accessible introductions include Hambleton and Swaminathan’s Item Response Theory (1985) and de Ayala’s The Theory and Practice of Item Response Theory (2009). The Handbook of Modern Item Response Theory (van der Linden & Hambleton, 1997) provides comprehensive coverage.

2.9.2 Rasch Measurement

Rasch’s original book Probabilistic Models for Some Intelligence and Attainment Tests (1960) remains influential. Wright and Stone’s Best Test Design (1979) provides practical guidance. Fischer and Molenaar’s Rasch Models: Foundations, Recent Developments, and Applications (1995) covers extensions and applications. For the philosophical foundations, see Rasch’s papers on objectivity collected in the Rasch Measurement Transactions archive.

2.9.3 Historical Development

Thurstone’s seminal paper “A Law of Comparative Judgment” (1927) launched the quantitative study of preferences. Bradley and Terry’s “Rank Analysis of Incomplete Block Designs” (1952) and Luce’s Individual Choice Behavior (1959) established the axiomatic foundations. McFadden’s “Conditional Logit Analysis” (1974) connected these to economic theory. For a history of psychometrics, see Boring’s A History of Experimental Psychology (1950).

2.9.4 Probabilistic Graphical Models

This chapter uses plate diagrams and concepts from probabilistic graphical models (PGMs). Koller and Friedman (2009) provide a comprehensive reference covering directed and undirected graphical models, plate notation, and inference algorithms. Readers seeking deeper background on the graphical model foundations underlying IRT and network models will find Chapters 3–4 (Bayesian networks) and Chapter 5 (undirected models / Markov random fields) especially relevant.

2.9.5 Network Psychometrics

The network approach is developed in Borsboom and Cramer’s “Network Analysis: An Integrative Approach” (2013) and formalized in Epskamp et al.’s papers on the Gaussian graphical model and Ising model (2018). The Network Psychometrics with R book (Epskamp et al., 2022) provides practical guidance.

2.9.6 AI Evaluation

The application of psychometric methods to AI is recent. For IRT applied to LLMs, see Polo et al.’s “Efficient Multi-Prompt Evaluation” (2024). For factor models, see the methods developed in this textbook. The Chatbot Arena is described in Zheng et al.’s “Judging LLM-as-a-Judge” (2023).

2.9.7 Scaling Laws and IRT

S. Truong et al. (2025) introduce Item Response Scaling Laws (IRSL), which embed IRT within the scaling law framework. By factorizing model ability from question difficulty, IRSL reduces scaling law estimation from $O(M \times N)$ to $O(M + N)$ parameters, with the estimated ability transferring across benchmarks. Their Beta-IRT formulation models empirical probability responses (token probabilities, pass rates) rather than binary correctness, using a Beta loss that achieves reliable calibration with as few as 2 test takers. Schaeffer et al. (2025) provide the distributional theory explaining why per-problem exponential scaling aggregates to power-law scaling: the exponent is determined by the left-tail shape parameter of the success probability distribution, connecting the heterogeneity of item difficulty to aggregate scaling behavior.

References

Aidar Myrzakhan, Zhiqiang Shen, Sondos Mahmoud Bsharat. 2024. “Open-LLM-Leaderboard: From Multi-Choice to Open-Style Questions for LLMs Evaluation, Benchmark, and Arena.” arXiv Preprint arXiv:2406.07545.

Bhavsar, Pratik. 2025. Agent Leaderboard. Https://huggingface.co/spaces/galileo-ai/agent-leaderboard; Galileo.ai.

Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge University Press.

Borsboom, Denny, and Angélique O. J. Cramer. 2013. “Network Analysis: An Integrative Approach to the Structure of Psychopathology.” Annual Review of Clinical Psychology 9: 91–121.

Bradley, Ralph Allan, and Milton E. Terry. 1952. “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39 (3/4): 324–45. http://www.jstor.org/stable/2334029.

Chiang, Wei-Lin, Lianmin Zheng, Ying Sheng, et al. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. https://arxiv.org/abs/2403.04132.

Epskamp, Sacha, Lourens J. Waldorp, René Mõttus, and Denny Borsboom. 2018. “The Gaussian Graphical Model in Cross-Sectional and Time-Series Data.” Multivariate Behavioral Research 53 (4): 453–80.

Hendrycks, Dan, Collin Burns, Steven Basart, et al. 2021. “Measuring Massive Multitask Language Understanding.” Proceedings of the International Conference on Learning Representations (ICLR).

Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Liang, Percy, Rishi Bommasani, Tony Lee, et al. 2023. Holistic Evaluation of Language Models. arXiv. https://doi.org/10.48550/arXiv.2211.09110.

Luce, R. Duncan. 1959. Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons.

Luettgau, Lennart, Harry Coppock, Magda Dubois, Christopher Summerfield, and Cozmin Ududec. 2025. “HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics.” arXiv Preprint arXiv:2505.05602.

McFadden, Daniel. 1974. Conditional Logit Analysis of Qualitative Choice Behavior. 105–42.

Polo, Felipe Maia, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. 2024. tinyBenchmarks: Evaluating LLMs with Fewer Examples. https://arxiv.org/abs/2402.14992.

Rasch, Georg. 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Danish Institute for Educational Research.

Schaeffer, Rylan, Joshua Kazdan, John Hughes, et al. 2025. “How Do Large Language Monkeys Get Their Power (Laws)?” arXiv Preprint arXiv:2502.17578.

Thurstone, L. L. 1927. “A Law of Comparative Judgment.” Psychological Review 34 (4): 273–86. https://doi.org/10.1037/h0070288.

Truong, Sang T., Yuheng Tu, Michael Hardy, et al. 2025. “Fantastic Bugs and Where to Find Them in AI Benchmarks.” arXiv Preprint.

Truong, Sang, Yuheng Tu, Rylan Schaeffer, and Sanmi Koyejo. 2025. “Item Response Scaling Laws.” arXiv Preprint.

--- format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { // Wait for pyodide to be fully ready (mainPyodide is set after loading) function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { // Pyodide is ready, execute all cells with autorun=true if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); // Stagger execution by 1 second each } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); // Start checking after 2 seconds }); </script> filters: - pyodide pyodide: packages: - numpy - matplotlib --- # Foundations of Measurement {#sec-foundations} ::: {.callout-note title="Intended Learning Outcomes"} By the end of this chapter, you will be able to: 1. **Distinguish** between Item Response Theory, factor models, paired comparison systems (Elo, Bradley-Terry), and network models (GGM, Ising). 2. **Explain** why the Rasch model holds a special status as "the measurement model" through the sufficiency of sum scores, specific objectivity, and test-free measurement. 3. **Derive** the sufficiency theorem for the Rasch model and explain its implications for AI benchmark evaluation. 4. **Compare** the prescriptive (Rasch school) and descriptive (general IRT) approaches to measurement and articulate when each is appropriate. 5. **Trace** the historical development from Thurstone (1927) through Rasch (1960) to modern network psychometrics. 6. **Connect** classical measurement concepts (reliability, validity, dimensionality) to contemporary AI benchmark evaluation. 7. **Apply** measurement theory to analyze whether AI benchmarks satisfy the requirements for scientific measurement. 8. **Implement** basic IRT models in Python and visualize item characteristic curves. 9. **Evaluate** the assumptions underlying AI leaderboards and identify potential violations of measurement principles. ::: ::: {.callout-important title="Scope of This Chapter"} This chapter focuses on **model specification**: we introduce the probabilistic models used for measurement and explain what each one assumes. We do *not* cover how to estimate (learn) the parameters of these models---that is the subject of @sec-learning. Think of this chapter as defining the "what" (the models) and the next chapter as defining the "how" (the estimation). ::: ::: {.callout-tip title="Video Overview" collapse="false"} A visual tour of the key concepts in this chapter — from response matrices and item characteristic curves to factor models and benchmark heterogeneity. {{< video ../animations/ch1/chapter1_narrated.mp4 >}} ::: ::: {.callout-note title="Notation"} This chapter introduces the core notation used throughout the book: $\theta_i$ (model ability), $\beta_j$ (item difficulty), $a_j$ (discrimination), $c_j$ (guessing), $Y_{ij}$ (binary response), and $\sigma(\cdot)$ (logistic sigmoid). @sec-notation collects every symbol in one place---refer to it whenever a symbol is unfamiliar. Readers coming from psychology or education may also find the **translation table** in @sec-transfer helpful; it maps classical psychometric terminology (test taker, item, reliability, validity) to AI evaluation counterparts. ::: ## The Measurement Problem in AI {#sec-measurement-problem} Consider the following scenario: You have evaluated 100 language models on a benchmark consisting of 1,000 multiple-choice questions. Each model either answers each question correctly (1) or incorrectly (0), producing a $100 \times 1000$ binary response matrix $Y$. You compute each model's accuracy---the proportion of correct answers---and rank the models accordingly. Have you *measured* anything? The answer is not as obvious as it might seem. You have certainly *scored* the models: you assigned numbers to them based on their performance. But measurement, in the scientific sense, requires more than assigning numbers. It requires that those numbers correspond to some underlying property---a *latent construct*---in a principled way. ### Scoring vs. Measuring The distinction between scoring and measuring is fundamental to understanding why AI evaluation needs measurement science. Consider an analogy from physics: if you measure the temperature of water with a mercury thermometer, the height of the mercury column is a *score*---a number you can read off the instrument. But you trust this score as a *measurement* of temperature because you understand the relationship between mercury expansion and thermal energy. In AI evaluation, we often have scores without this deeper understanding. When GPT-4 achieves 86% accuracy on MMLU [@hendrycks2021mmlu] and Claude achieves 84%, we cannot immediately conclude that GPT-4 has more "intelligence" or "capability" than Claude. Several questions must be answered first: 1. **What latent construct does MMLU measure?** Is it general intelligence, factual knowledge, test-taking ability, or something else entirely? 2. **Is the construct unidimensional?** Can model performance be characterized by a single number, or do different questions tap into different capabilities? 3. **Are the scores comparable across different test conditions?** Would the ranking change if we used different questions from the same domain? 4. **What is the measurement error?** How much of the score difference reflects true differences in capability versus noise? These questions have been central to psychology and education for over a century. The field of *psychometrics* developed sophisticated tools---Item Response Theory, factor analysis, validity frameworks---precisely to address them. AI evaluation is now confronting the same fundamental challenges. ### The Response Matrix The basic data structure in measurement is the *response matrix* $Y \in \{0, 1\}^{N \times M}$, where: - Each row $i \in \{1, \ldots, N\}$ represents a *test taker* (in AI: a model) - Each column $j \in \{1, \ldots, M\}$ represents an *item* (in AI: a benchmark question) - Each entry $Y_{ij} \in \{0, 1\}$ indicates whether test taker $i$ answered item $j$ correctly $$ Y = \begin{pmatrix} Y_{11} & Y_{12} & \cdots & Y_{1M} \\ Y_{21} & Y_{22} & \cdots & Y_{2M} \\ \vdots & \vdots & \ddots & \vdots \\ Y_{N1} & Y_{N2} & \cdots & Y_{NM} \end{pmatrix} $$ The naive approach to evaluation computes row means (model accuracies) and ranks models accordingly. But the response matrix contains far more information than these marginal statistics. The *pattern* of responses---which models succeed on which questions---reveals structure that aggregate scores obscure. {{< include _plt_setup.qmd >}} ```{pyodide-python} #| label: response-matrix-visualization #| autorun: true #| fig-cap: "Response matrix from a language model evaluation. Rows are models (sorted by total score), columns are questions (sorted by difficulty). The diagonal structure suggests underlying ability and difficulty parameters." # Simulate a response matrix with Rasch model structure N, M = 50, 200 # 50 models, 200 questions # Generate latent abilities and difficulties theta = np.random.normal(0, 1, N) # model abilities beta = np.random.normal(0, 1.5, M) # question difficulties # Generate responses via Rasch model prob = 1 / (1 + np.exp(-(theta[:, None] - beta[None, :]))) Y = (np.random.random((N, M)) < prob).astype(int) # Sort by row and column sums row_order = np.argsort(Y.sum(axis=1))[::-1] col_order = np.argsort(Y.sum(axis=0))[::-1] Y_sorted = Y[row_order][:, col_order] # Plot fig, ax = plt.subplots(1, 1) ax.imshow(Y_sorted, aspect='auto', cmap='Blues', interpolation='nearest') ax.set_xlabel('Questions (sorted by difficulty)') ax.set_ylabel('Models (sorted by ability)') ax.set_title('Sorted Response Matrix') plt.show() ``` When we sort the response matrix by row sums (model abilities) and column sums (item difficulties), a characteristic diagonal structure emerges. High-ability models answer most questions correctly; easy questions are answered correctly by most models. This structure is not guaranteed---it depends on the data satisfying certain assumptions---but when present, it suggests that a simple latent variable model may adequately describe the data. ### Why AI Evaluation Needs Measurement Science The problems facing AI evaluation today mirror those that psychology confronted in the early 20th century: 1. **Construct definition:** What does it mean to measure "reasoning" or "common sense"? Psychology developed validity frameworks to address this question. 2. **Test bias:** Are some benchmark questions unfair to certain models due to training data or architecture? Educational testing developed differential item functioning (DIF) analysis. 3. **Score comparability:** Can we compare models evaluated on different benchmark subsets? Psychometrics developed test equating methods. 4. **Efficiency:** How can we evaluate models with fewer questions? Computerized adaptive testing (CAT) emerged from IRT. 5. **Reliability:** How stable are our rankings under different conditions? Test-retest reliability and standard error of measurement quantify this. The tools developed in psychometrics are not merely analogies---they are directly applicable to AI evaluation. The response matrix from an LLM benchmark has the same structure as the response matrix from a standardized test. The statistical models that describe human test performance can describe AI benchmark performance. ::: {.callout-note title="The Central Claim of AIMS"} AI benchmarks are tests in the psychometric sense. The methods developed over a century of educational and psychological measurement---Item Response Theory, factor analysis, validity frameworks---apply directly to AI evaluation. Understanding and applying these methods is essential for trustworthy AI measurement. ::: ### Evaluation Datasets Used in This Book {#sec-datasets} Throughout this book, we work with several large-scale evaluation corpora that represent distinct yet complementary perspectives on measuring model behavior. These datasets provide the empirical foundation for our analyses. **HELM Benchmark Suite.** We use **22 datasets** drawn from **5 HELM [@liang_holistic_2023] repositories**---*Classic*, *Lite*, *AIR-Bench*, *Thai Exam*, and *MMLU*---encompassing both *capability* and *safety* measurements. In total, this collection includes **172 test takers** (models) and **217,268 questions**. We focus on responses that can be graded dichotomously (correct/incorrect), as is the case for most benchmarks through metrics such as *exact match* or *equivalent indicator*. To ensure stable estimation, we remove duplicate questions, those with identical response patterns, or with fewer than 30 test takers; exclude test takers with fewer than 30 total responses; and treat unattempted questions as missing values. **Open LLM Leaderboard.** We use data from the **Open LLM Leaderboard** (Hugging Face, 2025) [@myrzakhan2024openllmleaderboard], a public benchmarking platform that evaluates open large language models on a standardized suite of academic and practical tasks. The dataset spans models submitted between **2022 and 2025**, covering parameter scales from small models (<5B parameters) to frontier systems (>140B parameters). In total, it includes **4,416 distinct language models**, each evaluated on **21,176 benchmark questions** from six suites: MMLU-Pro, OpenLLM-Math, MUSR, BBH, IFEval, and GPQA. **LMarena Preference Data.** In addition to correctness-based evaluation, we incorporate **pairwise preference data** from the **LMarena dataset** [@chiang2024chatbotarena], which provides human or automated judgments of relative model quality. Each example corresponds to a prompt presented to two competing models, with an annotation indicating which response is preferred. The dataset includes **211,728 unique prompts**, **3,779 unique model pairs**, and **179 distinct models**. These preference judgments provide a complementary view focusing on *relative comparisons* rather than absolute correctness. **Agent Leaderboard.** We include **Agent Leaderboard data** from Galileo AI [@bhavsar2025agentleaderboard], which evaluates the *agentic performance* of large language models across tool-use and reasoning scenarios. This dataset contains approximately **34,700 rows**, where each row corresponds to a question, the model's response, and a numerical score judged by GPT-4. The evaluation covers multiple agentic subjects with roughly 100 questions each, including approximately **40 distinct models** such as Gemini-2.5, Claude-3.5, GPT-4.1/4.5, Llama-4, and Qwen-2.5. Together, these four sources enable unified modeling of *accuracy*, *preference*, and *agency* within a shared latent-factor evaluation framework. ## Probabilistic Models for Measurement {#sec-probabilistic-models} Measurement requires a model connecting observable responses to latent constructs. This section surveys the major families of probabilistic models used in measurement: Item Response Theory, factor models, paired comparison systems, network models, and hierarchical models. These models provide the statistical machinery for extracting latent variables from response data. ### Item Response Theory {#sec-irt} Item Response Theory (IRT) models the probability of a correct response as a function of person ability and item characteristics. The key insight is that both persons and items can be characterized by parameters on a common scale. #### The Rasch Model (1PL) The simplest IRT model is the **Rasch model**, also called the one-parameter logistic (1PL) model: ::: {.callout-note title="Definition: Rasch Model"} $$ P(Y_{ij} = 1 | \theta_i, \beta_j) = \frac{\exp(\theta_i - \beta_j)}{1 + \exp(\theta_i - \beta_j)} = \sigma(\theta_i - \beta_j) $$ where: - $\theta_i \in \mathbb{R}$ is the ability of person $i$ - $\beta_j \in \mathbb{R}$ is the difficulty of item $j$ - $\sigma(\cdot)$ is the logistic sigmoid function ::: The model has an elegant interpretation: the probability of success depends only on the *difference* between ability and difficulty. When $\theta_i = \beta_j$, the probability is exactly 0.5---the person has a 50% chance of answering correctly. When $\theta_i > \beta_j$, the probability exceeds 0.5; when $\theta_i < \beta_j$, it falls below 0.5. ::: {.callout-tip title="Reading Plate Diagrams" collapse="false"} Throughout this chapter we use **plate diagrams** (also called plate notation) to visualize probabilistic models. The conventions are: - **Shaded (gray) nodes** represent *observed* variables (e.g., responses $Y_{ij}$). - **Open (white) nodes** represent *latent* (unobserved) variables or parameters (e.g., ability $\theta_i$). - **Arrows** indicate probabilistic dependencies: an arrow from $A$ to $B$ means $B$'s distribution depends on $A$. - **Plates** (rectangles) indicate replication: a plate labeled "$i = 1, \ldots, N$" means the enclosed variables are repeated $N$ times, once per index $i$. Nested plates represent crossed or hierarchical structure. For example, the response $Y_{ij}$ sits inside both the person plate ($i$) and the item plate ($j$), indicating one observation per person-item pair. For a comprehensive treatment of graphical models, see @koller2009pgm. ::: ![Plate diagram for the Rasch model. Shaded nodes are observed; open nodes are latent. Plates indicate replication over persons ($i$) and items ($j$).](Figures/plate_rasch.png){#fig-rasch-plate width=50%} The function $P(\theta) = \sigma(\theta - \beta)$ is called the **Item Characteristic Curve (ICC)**. It describes how the probability of success changes with ability for a fixed item. ```{pyodide-python} #| label: icc-rasch #| autorun: true #| fig-cap: "Item Characteristic Curves for Rasch model items with different difficulties. All curves have the same shape (slope), differing only in their location." import numpy as np import matplotlib.pyplot as plt def sigmoid(x): return 1 / (1 + np.exp(-x)) theta = np.linspace(-4, 4, 200) difficulties = [-2, -1, 0, 1, 2] plt.figure() for beta in difficulties: prob = sigmoid(theta - beta) plt.plot(theta, prob, label=f'$\\beta = {beta}$', linewidth=2) plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5) plt.xlabel('Ability ($\\theta$)', fontsize=12) plt.ylabel('$P(Y = 1)$', fontsize=12) plt.legend(title='Item Difficulty') plt.grid(True, alpha=0.3) plt.show() ``` #### The Two-Parameter Logistic Model (2PL) The Rasch model assumes all items have the same *discrimination*---the same slope of the ICC. The **two-parameter logistic model** relaxes this assumption: ::: {.callout-note title="Definition: 2PL Model"} $$ P(Y_{ij} = 1 | \theta_i, a_j, \beta_j) = \sigma(a_j(\theta_i - \beta_j)) $$ where $a_j > 0$ is the discrimination parameter for item $j$. ::: Items with higher discrimination are better at distinguishing between persons of different abilities. Their ICCs are steeper, meaning small changes in ability produce large changes in response probability. ![Plate diagram for the 2PL model. Each item now has both a difficulty $\beta_j$ and a discrimination $a_j$ parameter.](Figures/plate_2pl.png){#fig-2pl-plate width=50%} ```{pyodide-python} #| label: icc-2pl #| autorun: true #| fig-cap: "Item Characteristic Curves for 2PL model items with different discriminations. Higher discrimination (steeper slope) means the item better distinguishes between ability levels." theta = np.linspace(-4, 4, 200) # Items with same difficulty but different discriminations beta = 0 discriminations = [0.5, 1.0, 1.5, 2.0] plt.figure() for a in discriminations: prob = sigmoid(a * (theta - beta)) plt.plot(theta, prob, label=f'$a = {a}$', linewidth=2) plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5) plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5) plt.xlabel('Ability ($\\theta$)', fontsize=12) plt.ylabel('$P(Y = 1)$', fontsize=12) plt.title('2PL Model: Effect of Discrimination ($\\beta = 0$)', fontsize=14) plt.legend(title='Discrimination') plt.grid(True, alpha=0.3) plt.show() ``` #### The Three-Parameter Logistic Model (3PL) For multiple-choice tests, even low-ability test-takers may answer correctly by guessing. The **three-parameter logistic model** adds a lower asymptote: ::: {.callout-note title="Definition: 3PL Model"} $$ P(Y_{ij} = 1 | \theta_i, a_j, \beta_j, c_j) = c_j + (1 - c_j) \sigma(a_j(\theta_i - \beta_j)) $$ where $c_j \in [0, 1]$ is the guessing (or pseudo-chance) parameter. ::: For a 4-option multiple-choice item, we might expect $c_j \approx 0.25$ if low-ability test-takers guess randomly. ![Plate diagram for the 3PL model. The guessing parameter $c_j$ sets a lower asymptote on the response probability.](Figures/plate_3pl.png){#fig-3pl-plate width=50%} ```{pyodide-python} #| label: icc-3pl #| autorun: true #| fig-cap: "Comparison of 1PL, 2PL, and 3PL models. The 3PL has a non-zero lower asymptote representing guessing." theta = np.linspace(-4, 4, 200) # Compare the three models a, beta, c = 1.5, 0, 0.25 p_1pl = sigmoid(theta - beta) p_2pl = sigmoid(a * (theta - beta)) p_3pl = c + (1 - c) * sigmoid(a * (theta - beta)) plt.figure() plt.plot(theta, p_1pl, label='1PL (Rasch)', linewidth=2) plt.plot(theta, p_2pl, label='2PL', linewidth=2) plt.plot(theta, p_3pl, label='3PL', linewidth=2) plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.3) plt.axhline(y=c, color='gray', linestyle=':', alpha=0.5, label=f'Guessing = {c}') plt.xlabel('Ability ($\\theta$)', fontsize=12) plt.ylabel('$P(Y = 1)$', fontsize=12) plt.title('Comparison of IRT Models', fontsize=14) plt.legend() plt.grid(True, alpha=0.3) plt.show() ``` ### Factor Models {#sec-factor-models} Factor models provide an alternative perspective on latent variable measurement. While IRT focuses on item-level response probabilities, factor models focus on the covariance structure of responses. #### The Linear Factor Model The classical linear factor model assumes observed variables are linear combinations of latent factors plus noise: ::: {.callout-note title="Definition: Linear Factor Model"} $$ X_j = \lambda_{j1} F_1 + \lambda_{j2} F_2 + \cdots + \lambda_{jK} F_K + \epsilon_j $$ where: - $X_j$ is the observed score on item $j$ - $F_k$ are latent factors (abilities, traits) - $\lambda_{jk}$ are factor loadings - $\epsilon_j$ is item-specific error ::: In matrix notation: $X = \Lambda F + \epsilon$, where $\Lambda$ is the $M \times K$ matrix of factor loadings. #### The Logistic Factor Model For binary data, we use a logistic link function: ::: {.callout-note title="Definition: Logistic Factor Model"} $$ P(Y_{ij} = 1 | U_i, V_j, Z_j) = \sigma(U_i^\top V_j + Z_j) $$ where: - $U_i \in \mathbb{R}^K$ is the latent factor vector for person $i$ - $V_j \in \mathbb{R}^K$ is the factor loading vector for item $j$ - $Z_j \in \mathbb{R}$ is the item intercept ::: The parameters of the logistic factor model correspond directly to those of the linear factor model: | Linear Factor Model | Logistic Factor Model | Interpretation | |---------------------|-----------------------|----------------| | Latent factors $F_k$ | Entries of $U_i \in \mathbb{R}^K$ | Abilities of person $i$ on each of the $K$ factors | | Factor loadings $\lambda_{jk}$ | Entries of $V_j \in \mathbb{R}^K$ | How strongly item $j$ loads on factor $k$ | | Item-specific error $\epsilon_j$ | Item intercept $Z_j \in \mathbb{R}$ | Baseline easiness of item $j$ (analogous to $-\beta_j$ in IRT) | The key difference is the link function: the linear model uses an identity link ($X_j = \ldots$) for continuous responses, while the logistic model passes the linear predictor through the sigmoid $\sigma(\cdot)$ to produce a probability for binary responses. This is the model used in later chapters of AIMS for multidimensional AI evaluation. ![Plate diagram for the logistic factor model. The latent factor vector $U_i$ interacts with item-specific loadings $V_j$ and intercepts $Z_j$ to produce responses.](Figures/plate_factor.png){#fig-factor-plate width=50%} #### Connection Between IRT and Factor Analysis The Rasch model is equivalent to a one-factor logistic model with equal loadings: ::: {.callout-note title="Theorem: Rasch-Factor Equivalence"} The Rasch model $P(Y_{ij} = 1) = \sigma(\theta_i - \beta_j)$ is equivalent to a single-factor logistic model with $U_i = \theta_i$, $V_j = 1$ for all $j$, and $Z_j = -\beta_j$. ::: More generally, multidimensional IRT models and logistic factor models are closely related, differing primarily in parameterization and estimation approach. #### The Structure Matrix {#sec-structure-matrix} After fitting a multidimensional factor model, we obtain estimated item loadings $\hat{V}_j$ that describe how each item relates to the latent factors. However, raw factor loadings require standardization for interpretable comparisons across items and benchmarks. The **structure matrix** $S$ captures the correlation between each item's latent response and each factor: ::: {.callout-note title="Definition: Structure Matrix"} For a logistic factor model, the latent response $Y^*_{ij}$ can be written as $Y^*_{ij} = U_i^\top V_j + Z_j + \epsilon_{ij}$, where $\epsilon_{ij}$ follows a logistic distribution with variance $\pi^2/3$. The structure matrix entry $S_{jk}$ is the correlation between item $j$'s latent response and factor $k$: $$ S_{jk} = \text{Cor}(Y^*_{ij}, U_{ik}) = \frac{(V_j^\top \Sigma)_k}{\sqrt{\Sigma_{kk}} \sqrt{V_j^\top \Sigma V_j + \pi^2/3}} $$ where $\Sigma = \text{Cov}(U)$ is the factor covariance matrix. ::: **Intuition.** The formula for $S_{jk}$ is a Pearson correlation, so its form follows from the standard definition $\text{Cor}(A, B) = \text{Cov}(A, B) / (\text{SD}(A) \cdot \text{SD}(B))$. The numerator $(V_j^\top \Sigma)_k$ captures the covariance between the latent response $Y^*_{ij} = U_i^\top V_j + Z_j + \epsilon_{ij}$ and factor $U_{ik}$: since $Z_j$ and $\epsilon_{ij}$ are independent of $U$, only the term $U_i^\top V_j$ contributes, yielding $\text{Cov}(U_i^\top V_j, U_{ik}) = (V_j^\top \Sigma)_k$. The denominator normalizes by $\text{SD}(U_{ik}) = \sqrt{\Sigma_{kk}}$ and $\text{SD}(Y^*_{ij}) = \sqrt{V_j^\top \Sigma V_j + \pi^2/3}$, where the $\pi^2/3$ term is the variance of the logistic error $\epsilon_{ij}$. The result is a number in $[-1, 1]$ that is directly comparable across items, unlike raw loadings $V_j$ which depend on the arbitrary scale of the factors. A detailed derivation is given in @sec-learning. We define the **structure vector** for item $j$ as $S_j = (S_{j1}, S_{j2}, \ldots, S_{jK})^\top \in \mathbb{R}^K$, where each entry $S_{jk}$ is given by the formula above. Stacking these row vectors yields the $M \times K$ **structure matrix** $S$, whose $j$-th row is $S_j^\top$. The structure vector $S_j$ serves as a standardized fingerprint of item $j$: it tells us how strongly the item's latent response correlates with each factor, on a $[-1, 1]$ scale that is comparable across items. The structure matrix has several important properties: 1. **Bounded values:** Each entry $S_{jk} \in [-1, 1]$, making comparisons intuitive. 2. **Interpretability:** High positive values indicate the item strongly measures that factor; negative values indicate inverse relationships. 3. **Clustering:** Items with similar structure vectors measure similar constructs and can be grouped together. For AI benchmarks, the structure matrix reveals which questions tap into which capabilities. Two questions may both be "correct/incorrect" but load on different factors---one measuring reasoning, another measuring factual recall. This has important implications for how we interpret aggregate benchmark scores. #### Item Clustering and Benchmark Heterogeneity {#sec-item-clustering} With the structure matrix in hand, we can cluster items into latent subgroups using standard clustering algorithms such as Gaussian Mixture Models (GMM). Each cluster represents a group of items sharing similar factor loadings---analogous to "skills" or "themes" within the benchmark. This approach is analogous to exploratory factor analysis in psychometrics, revealing whether benchmarks are essentially unidimensional or composed of multiple, potentially antagonistic, latent skills. ::: {.callout-important title="Benchmark Heterogeneity"} A key insight from factor analysis applied to AI benchmarks is that **benchmarks are rarely homogeneous**. Intentionally or not, they often combine items that test different capabilities, and even a single benchmark item may test a combination of capabilities. Two models with identical mean scores may excel on different capability dimensions. For example, one model might be strong in reasoning but weak in factual recall, while another may have the reverse profile. When item clusters show weak or negative correlations with each other, the benchmark-level mean score becomes neither informative nor accurate about subgroup performance. ::: Within each benchmark, we can quantify inter-construct correlations by: 1. **Clustering items** based on their structure vectors using GMM with BIC for model selection 2. **Computing cluster means** for each model (average accuracy on items in each cluster) 3. **Correlating cluster means** across models to assess construct overlap Strongly positive inter-cluster correlations indicate overlapping constructs, while weak or negative correlations suggest distinct and possibly conflicting capabilities being aggregated by the benchmark's mean score. This multidimensional pattern explains why two models with identical overall accuracies may excel on entirely different skill axes. Factor models assign a feature vector to each item (the structure vector), allowing items to be clustered via standard algorithms. This helps interpret evaluation results that would otherwise be obscured by aggregate scores. ### Paired Comparison Models: Elo and Bradley-Terry {#sec-paired-comparison} Not all measurement data comes in the form of item responses. In many settings, we observe *pairwise comparisons*: which of two items is preferred, which of two players wins. These settings require different models. #### The Bradley-Terry Model The Bradley-Terry model (1952) is the foundational model for paired comparisons: ::: {.callout-note title="Definition: Bradley-Terry Model"} $$ P(\text{item } i \text{ beats item } j) = \frac{\exp(\theta_i)}{\exp(\theta_i) + \exp(\theta_j)} = \sigma(\theta_i - \theta_j) $$ where $\theta_i$ is the "strength" or "quality" of item $i$. ::: The model has the same mathematical form as the Rasch model, but the interpretation differs: instead of a person answering an item, we have two items competing against each other. ![Plate diagram for the Bradley-Terry model. A comparison outcome $Y_c$ depends on the strengths $\theta_i$ and $\theta_j$ of the two competitors in each pair.](Figures/plate_bt.png){#fig-bt-plate width=50%} #### The Elo Rating System The Elo rating system, developed by Arpad Elo for chess ratings, is essentially a Bradley-Terry model with online updates: ::: {.callout-note title="Definition: Elo Rating System"} After player $i$ with rating $R_i$ plays player $j$ with rating $R_j$, the ratings are updated: $$ R_i^{\text{new}} = R_i + K(S_i - E_i) $$ where: - $S_i \in \{0, 0.5, 1\}$ is the actual outcome (loss, draw, win) - $E_i = \sigma((R_j - R_i)/400 \cdot \ln 10)$ is the expected outcome - $K$ is a learning rate parameter ::: The Elo system is widely used in competitive games and has been adopted for AI evaluation in settings like the Chatbot Arena, where humans compare model outputs pairwise. #### Connection to AI Evaluation The Chatbot Arena (LMSYS) uses Elo ratings to rank language models based on human preferences. When a user prefers model A's response over model B's, this is treated as a "win" for model A. The resulting ratings provide a preference-based complement to accuracy-based benchmarks. ::: {.callout-note title="Chatbot Arena as Thurstone's Comparative Judgment"} The Chatbot Arena implements exactly the paradigm that L.L. Thurstone proposed in 1927: measuring psychological attributes through pairwise comparisons. Thurstone developed this method to scale attitudes, preferences, and other subjective quantities. A century later, the same mathematics underlies how we rank AI systems. ::: ### Network Models: GGM and Ising {#sec-network-models} The models discussed so far assume a *common cause* structure: latent variables cause observed responses. Network models propose an alternative: observed variables are directly connected to each other, and correlations arise from these connections rather than common latent causes. #### The Gaussian Graphical Model (GGM) For continuous data, the Gaussian Graphical Model represents conditional independence relationships: ::: {.callout-note title="Definition: Gaussian Graphical Model"} Variables $X = (X_1, \ldots, X_M)$ follow a multivariate normal distribution with precision matrix $\Omega = \Sigma^{-1}$. The partial correlation between $X_j$ and $X_k$ given all other variables is: $$ \rho_{jk \cdot \text{rest}} = -\frac{\Omega_{jk}}{\sqrt{\Omega_{jj} \Omega_{kk}}} $$ Two variables are conditionally independent if and only if $\Omega_{jk} = 0$. ::: The GGM can be visualized as a network where nodes are variables and edges represent non-zero partial correlations. #### The Ising Model For binary data, the Ising model (borrowed from statistical physics) provides an analogous framework: ::: {.callout-note title="Definition: Ising Model"} $$ P(Y = y) = \frac{1}{Z} \exp\left(\sum_j \tau_j y_j + \sum_{j < k} \omega_{jk} y_j y_k\right) $$ where: - $\tau_j$ are threshold parameters (similar to item difficulties) - $\omega_{jk}$ are interaction parameters (edge weights) - $Z$ is a normalizing constant ::: **Interpreting the network.** The Ising model defines a network (undirected graph) over items: each item $j$ is a node, and each non-zero interaction weight $\omega_{jk}$ is an edge between items $j$ and $k$. The network captures *conditional* dependencies: $\omega_{jk} \neq 0$ means that knowing the response to item $j$ changes the probability of a correct response to item $k$, even after conditioning on all other items. The two sums in the probability have distinct roles: - $\sum_j \tau_j y_j$ is the **node term**: each item $j$ has a threshold $\tau_j$ that plays a role analogous to item difficulty. A large positive $\tau_j$ makes $y_j = 1$ (correct) more probable, independent of other items. - $\sum_{j < k} \omega_{jk} y_j y_k$ is the **edge term**: it sums the interaction weights over all pairs of items that are *both* answered correctly. A positive weight $\omega_{jk} > 0$ means items $j$ and $k$ tend to be answered correctly together (e.g., they share a common skill), while $\omega_{jk} < 0$ means success on one is associated with failure on the other (e.g., they require competing strategies). The total edge sum captures the overall "alignment" of the response pattern with the dependency structure. In the Ising model, the correlation between two items arises from their direct connection $\omega_{jk}$, not from a common latent factor. Moreover, correlations can propagate through chains of connections: items $j$ and $k$ may be correlated even if $\omega_{jk} = 0$, provided there is a path of non-zero edges connecting them. ![Graphical model for the Ising model. Unlike latent variable models, all nodes are observed and correlations arise from direct pairwise connections $\omega_{jk}$.](Figures/plate_ising.png){#fig-network-diagram width=50%} #### Network vs. Latent Variable Models The choice between network and latent variable models reflects different theories about what causes observed correlations: | Aspect | Latent Variable Model | Network Model | |--------|----------------------|---------------| | Cause of correlations | Common latent factor | Direct connections | | Removing an item | No effect on other correlations | May reduce correlations | | Theoretical commitment | Constructs exist and cause responses | Constructs are summaries | | Example | "Intelligence" causes good performance | Skills directly cause each other | For AI evaluation, the question is whether benchmark items are indicators of a common capability (latent variable view) or whether they form a network of related but distinct skills (network view). This distinction has implications for how we aggregate performance across items. ::: {.callout-note title="Which Model for AI?"} The choice between latent variable and network models is not merely technical---it reflects different beliefs about AI capabilities: - **Latent variable view:** Models have underlying capabilities (reasoning, knowledge, language understanding) that cause their benchmark performance. - **Network view:** Benchmark items measure distinct skills that may reinforce each other but do not share a common cause. Both views may be partially correct. AIMS primarily adopts the latent variable view but acknowledges that some benchmark items may not fit this framework. ::: ### Hierarchical Models {#sec-hierarchical-models} The models introduced so far treat all items as exchangeable---a single set of item parameters enters the likelihood without any grouping structure. In practice, AI evaluations are *nested*: items belong to benchmarks, benchmarks belong to suites or domains, and suites may be grouped into broader capability areas. Hierarchical (multilevel) models make this nesting explicit in the model specification, treating it as part of the data-generating process rather than an afterthought of analysis. ::: {.callout-note title="Definition: Hierarchical IRT Model"} Consider item $i$ nested within benchmark $j$. A hierarchical extension of the Rasch model specifies: $$ \text{logit}\, P(Y_{ij} = 1 \mid \theta, b_{ij}) = \theta - b_{ij} $$ where item difficulties are drawn from a benchmark-level distribution: $$ b_{ij} \sim \mathcal{N}(\mu_j, \sigma_j^2) $$ The benchmark means $\mu_j$ may themselves follow a domain-level distribution $\mu_j \sim \mathcal{N}(\mu_0, \tau^2)$, creating a three-level hierarchy: items within benchmarks within domains. The same hierarchical extension applies to 2PL, 3PL, and factor models. ::: ![Plate diagram for the hierarchical IRT model. Item difficulties $b_{ij}$ are drawn from benchmark-level distributions parameterized by $\mu_j$ and $\sigma_j$, which may themselves be drawn from domain-level hyperparameters $\mu_0$ and $\tau$. Nested plates reflect the hierarchical data structure.](Figures/plate_hierarchical.png){#fig-hierarchical-plate width=50%} The decision to include hierarchical structure is a *modeling* choice, analogous to the decision between the 1PL and 2PL. It encodes the assumption that items within the same benchmark share difficulty characteristics---their parameters are not independent draws from a single global distribution, but cluster by benchmark. Ignoring this structure and treating all items as exchangeable conflates within-benchmark and between-benchmark variation, producing estimates that may not generalize beyond the specific items observed [@luettgau2025hibayes]. ::: {.callout-note title="Hierarchical Structure in AI Evaluation"} Modern AI evaluations exhibit natural hierarchy at multiple levels: - **MMLU:** 15,908 items $\to$ 57 subjects $\to$ 4 domains (humanities, social sciences, STEM, other) - **GAIA:** agentic tasks $\to$ 3 difficulty levels $\to$ capability domain - **Coding benchmarks:** problems $\to$ benchmarks (HumanEval, MBPP, DS-1000) $\to$ coding capability Explicitly modeling these levels separates the sources of variation at each level. This enables principled generalization from the benchmarks actually tested to the broader construct they are intended to measure. Estimation methods for hierarchical models, including partial pooling and Bayesian inference, are covered in Chapter 2. ::: ## The Rasch Model as "The Measurement Model" {#sec-rasch-measurement} Among the many probabilistic models for measurement, the Rasch model holds a special status. Georg Rasch and his followers argue that it is not merely *one* measurement model among many---it is *the* measurement model, the only model that satisfies the requirements for fundamental measurement. This section examines this claim carefully. ### Sufficiency of Sum Scores {#sec-sufficiency} The most remarkable property of the Rasch model is that the **sum score is a sufficient statistic** for the ability parameter. This means that the total number of correct responses contains all the information about a person's ability; knowing *which* items were answered correctly adds nothing. ::: {.callout-note title="Theorem: Sufficiency in the Rasch Model"} In the Rasch model, the total score $S_i = \sum_{j=1}^M Y_{ij}$ is a sufficient statistic for the ability parameter $\theta_i$. That is: $$ P(Y_i | S_i, \theta_i) = P(Y_i | S_i) $$ The conditional distribution of the response pattern given the sum score does not depend on $\theta$. ::: ::: {.callout-note title="Proof" collapse="true"} The joint probability of response pattern $Y_i = (Y_{i1}, \ldots, Y_{iM})$ is: $$ P(Y_i | \theta_i, \boldsymbol{\beta}) = \prod_{j=1}^M \frac{\exp(Y_{ij}(\theta_i - \beta_j))}{1 + \exp(\theta_i - \beta_j)} $$ Expanding: $$ = \frac{\exp(\theta_i \sum_j Y_{ij}) \cdot \exp(-\sum_j Y_{ij} \beta_j)}{\prod_j (1 + \exp(\theta_i - \beta_j))} $$ $$ = \frac{\exp(\theta_i S_i) \cdot \exp(-\sum_j Y_{ij} \beta_j)}{\prod_j (1 + \exp(\theta_i - \beta_j))} $$ The likelihood factors as $L(\theta | Y) = g(S, \theta) \cdot h(Y, \boldsymbol{\beta})$, where $g$ depends on $\theta$ only through $S$. By the factorization theorem, $S$ is sufficient for $\theta$. To see that the conditional distribution $P(Y_i | S_i)$ does not depend on $\theta$: $$ P(Y_i | S_i, \theta_i) = \frac{P(Y_i | \theta_i, \boldsymbol{\beta})}{P(S_i | \theta_i, \boldsymbol{\beta})} $$ Both numerator and denominator contain the factor $\exp(\theta_i S_i)$, which cancels when $S_i$ is fixed. ::: #### Why Sufficiency Matters Sufficiency has profound implications: 1. **Data reduction without information loss.** We can summarize each person's responses by a single number (the sum score) without losing any information about their ability. 2. **Justification for sum scores.** The common practice of computing total scores is justified *only if* the Rasch model holds. Under other models, sum scores discard information. 3. **Conditional inference.** We can estimate item parameters without knowing person parameters, and vice versa, by conditioning on sufficient statistics. ::: {.callout-note title="Sufficiency and AI Benchmarks"} When we compute a model's accuracy on a benchmark, we are computing a sum score (proportion correct = sum / number of items). This is appropriate if the Rasch model holds. But if items have different discriminations, the sum score loses information---we should weight some items more than others. Implication: Before trusting aggregate benchmark scores, we should test whether the Rasch model fits the data. ::: ::: {.callout-tip title="Application: Using Sufficiency to Find Benchmark Bugs"} @truong2025bugs turn the sufficiency property into a practical diagnostic tool. Their argument: if the AI evaluation community reports mean scores as the primary metric, it is *implicitly assuming* that the sum score is a sufficient statistic for ability --- which, by the theorem above, implies the Rasch model is the data-generating process. Under the Rasch model, two testable consequences follow: 1. **Positive tetrachoric correlations.** All inter-item tetrachoric correlations must be non-negative (Corollary of Chebyshev's inequality applied to increasing functions of the same latent variable $\theta$). 2. **Positive item-total correlations.** Each item must correlate positively with the total score. Items that violate these conditions --- negative tetrachoric correlations, negative item-total correlations, or low Mokken scalability coefficients --- are flagged as potentially invalid. Applying this to nine benchmarks including GSM8K and MMLU, @truong2025bugs achieve up to 84% precision at the top-50 flagged items: of the 50 most suspicious questions, up to 42 were confirmed invalid by human experts. Common errors include ambiguous wording, incorrect answer keys (e.g., treating exponential depreciation as linear), grading bugs (e.g., "\$7.00" $\neq$ "7"), and culturally-embedded assumptions. This illustrates a broader principle: the mathematical properties of measurement models are not merely theoretical --- they yield *operational* diagnostics for evaluation quality. When data violate model predictions, either the model is wrong or the data are corrupted. Since the sufficiency assumption is already implicit in how the community uses benchmarks, violations are strong evidence of item-level problems. ::: ### Specific Objectivity {#sec-specific-objectivity} Georg Rasch's central contribution was not the mathematical model itself (which had been proposed earlier by others) but the philosophical framework of **specific objectivity**. ::: {.callout-note title="Definition: Specific Objectivity"} A measurement procedure exhibits **specific objectivity** if comparisons between persons are independent of which items are used: $$ \frac{P(Y_{ij} = 1)}{P(Y_{ij} = 0)} \bigg/ \frac{P(Y_{kj} = 1)}{P(Y_{kj} = 0)} = \frac{\exp(\theta_i)}{\exp(\theta_k)} $$ The item parameter $\beta_j$ cancels completely. Similarly, comparisons between items are independent of which persons are used. ::: In the Rasch model, the odds ratio for two persons on the same item is: $$ \frac{P(Y_{ij} = 1) / P(Y_{ij} = 0)}{P(Y_{kj} = 1) / P(Y_{kj} = 0)} = \frac{\exp(\theta_i - \beta_j)}{\exp(\theta_k - \beta_j)} = \exp(\theta_i - \theta_k) $$ The item difficulty $\beta_j$ cancels! This means person comparisons are the same regardless of which item we use. Rasch identified two levels of objectivity: 1. **Local objectivity:** Comparisons are item-independent for a specific pair of persons. 2. **General objectivity:** The entire ability scale is sample-independent. Ability estimates remain valid regardless of which items were administered. ### Test-Free and Sample-Free Measurement {#sec-test-free} Specific objectivity enables what Rasch called **test-free person measurement** and **sample-free item calibration**: - **Test-free person measurement:** A person's ability can be estimated from *any* subset of calibrated items, and the estimate will be the same (within sampling error). - **Sample-free item calibration:** An item's difficulty can be estimated from *any* sample of persons, and the estimate will be the same. This is remarkable because it mirrors the properties of physical measurement: ::: {.callout-note title="The Analogy to Physical Measurement"} Consider measuring temperature with different thermometers: - A mercury thermometer in New York should give the same reading as an alcohol thermometer in London for the same temperature. - Calibrating a thermometer on hot water should yield parameters that work equally well for cold water. The Rasch model claims the same properties for psychological measurement: calibrated tests yield the same ability estimates regardless of which specific items are used. ::: #### Implications for AI Evaluation If AI benchmarks satisfy Rasch model assumptions: 1. **Benchmark subset comparisons are valid.** We can compare a model tested on MMLU subset A with a model tested on MMLU subset B, as long as both subsets are calibrated to the same scale. 2. **New questions can be calibrated on any models.** We can add new questions to a benchmark by testing them on a sample of models, then use them to evaluate future models. 3. **Adaptive testing becomes possible.** We can select questions dynamically based on a model's performance, arriving at an accurate ability estimate with fewer questions. 4. **Cross-benchmark comparisons may be possible.** If different benchmarks measure the same construct, we can equate their scales. These properties are not guaranteed---they hold only if the Rasch model fits the data. Testing model fit becomes essential. ### The Rasch vs. General IRT Debate {#sec-rasch-debate} The claim that Rasch is "the" measurement model is controversial. The debate centers on the **prescriptive vs. descriptive** approaches to measurement. #### The Prescriptive Approach (Rasch School) The Rasch school argues: 1. **Measurement requires specific objectivity.** Without it, we cannot make person comparisons that are independent of the test used. 2. **The model is a requirement, not a description.** If data do not fit the Rasch model, the items do not measure the same construct. We should discard misfitting items, not adopt a more complex model. 3. **Discrimination variation is a problem, not a feature.** Items with different discriminations measure the construct with different precision. Mixing them produces a heterogeneous test that does not measure a single thing. 4. **Sufficiency is non-negotiable.** The sum score must be sufficient for ability, or we are not measuring anything meaningful. #### The Descriptive Approach (General IRT) The general IRT school responds: 1. **Models should fit data.** The purpose of a statistical model is to describe the data accurately. If items have different discriminations, we should model this, not ignore it. 2. **Perfect fit is unrealistic.** Real data never perfectly fit any model. The Rasch school's insistence on exact fit is impractical. 3. **Information is lost by forcing Rasch.** Discarding items that don't fit Rasch means discarding information. Better to use all items and model their characteristics. 4. **2PL/3PL models are more realistic.** Most tests have items with varying discrimination and guessing. Pretending otherwise does not make it true. ::: {.callout-note title="The Fundamental Tension"} **Prescriptive view:** The Rasch model defines what measurement IS. Items that don't fit should be discarded because they don't measure the same thing. **Descriptive view:** Use whatever model fits the data best. The 2PL/3PL models are more realistic for most applications. This is not merely a statistical disagreement---it reflects different philosophies of science. The Rasch school treats measurement theory as providing *requirements* that data must satisfy. The IRT school treats models as *tools* that should be chosen based on fit. ::: #### Implications for AI Evaluation This debate has direct implications for AI benchmark design: | Question | Rasch School Answer | General IRT Answer | |----------|--------------------|--------------------| | Should we allow items with different discriminations? | No---they measure different constructs | Yes---model the discrimination | | What if data don't fit Rasch? | Remove misfitting items | Use 2PL or 3PL | | Is sum score the right metric? | Yes, if Rasch fits | Only approximately | | Can we compare models across benchmarks? | Yes, with Rasch | Requires complex equating | AIMS takes a pragmatic position: we test whether data fit Rasch-like models, use the simpler model when it fits adequately, and acknowledge when more complex models are needed. The key insight is that the choice has *consequences* for what we can conclude from evaluation results. ## Historical Development {#sec-history} The probabilistic models we use today emerged from over a century of work across psychology, education, economics, and statistics. Understanding this history illuminates why certain models dominate and what problems they were designed to solve. ### Thurstone's Law of Comparative Judgment (1927) {#sec-thurstone} The story begins with L.L. Thurstone at the University of Chicago. In 1927, Thurstone proposed a model for how people make pairwise comparisons: the **Law of Comparative Judgment** [@thurstone]. Thurstone's insight was that subjective quantities (preferences, attitudes, perceived stimuli) could be placed on a numerical scale by analyzing patterns of pairwise comparisons. If we ask many people whether stimulus A is greater than stimulus B, and record the proportion who say yes, we can infer the underlying scale values. ::: {.callout-note title="Thurstone's Model (Case V)"} Each stimulus $i$ has a true scale value $\theta_i$. When comparing stimuli $i$ and $j$, each is perceived with Gaussian noise: $$ \tilde{\theta}_i \sim N(\theta_i, \sigma^2), \quad \tilde{\theta}_j \sim N(\theta_j, \sigma^2) $$ The probability that $i$ is judged greater than $j$ is: $$ P(i \succ j) = \Phi\left(\frac{\theta_i - \theta_j}{\sqrt{2}\sigma}\right) $$ where $\Phi$ is the standard normal CDF. ::: Thurstone's method was revolutionary: it showed that subjective quantities could be measured scientifically. The same mathematics now underlies how we rank AI systems from human preferences. ### Bradley-Terry and Luce (1952-1959) {#sec-bradley-terry-history} In 1952, Ralph Bradley and Milton Terry developed a model for ranking from paired comparisons in the context of incomplete block designs [@bradley-terry-model]. Their model: $$ P(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j} $$ where $\pi_i > 0$ are "worth" parameters. With $\theta_i = \log \pi_i$, this becomes the familiar logistic form. In 1959, R. Duncan Luce provided an axiomatic foundation through his **Choice Axiom**: the ratio of choice probabilities for two alternatives should be independent of what other alternatives are available. This axiom leads directly to the Bradley-Terry/logistic model. ### Georg Rasch and the Danish School (1960) {#sec-rasch-history} Georg Rasch was a Danish mathematician who worked on problems in educational testing. In 1960, he published "Probabilistic Models for Some Intelligence and Attainment Tests," which introduced what we now call the Rasch model [@rasch1960probabilistic]. Rasch's contribution was not the mathematical model itself---the same formula had appeared earlier in other contexts. His contribution was the *philosophical framework* of specific objectivity: the requirement that person and item parameters must be separable. Rasch's work was introduced to the United States by Benjamin Wright at the University of Chicago, who heard Rasch lecture in 1960. Wright became the leading advocate for Rasch measurement in the English-speaking world, founding the MESA (Measurement, Evaluation, Statistical Analysis) program and the journal *Rasch Measurement Transactions*. Other important figures in the Rasch tradition: - **Erling Andersen** (Copenhagen): Developed the theory of conditional maximum likelihood estimation for Rasch models. - **Gerhard Fischer** (Vienna): Extended the Rasch model to the Linear Logistic Test Model (LLTM) and developed software for estimation. - **David Andrich** (Australia): Extended Rasch models to polytomous data (rating scales) and developed the RUMM software. ### McFadden and Econometrics (1974) {#sec-mcfadden} In 1974, Daniel McFadden (who later won the Nobel Prize in Economics) developed the random utility framework for discrete choice. His insight was that choices could be modeled as utility maximization with random error: A person chooses alternative $i$ over $j$ if $U_i + \epsilon_i > U_j + \epsilon_j$, where $U$ is deterministic utility and $\epsilon$ is random. If the errors are i.i.d. Gumbel distributed, this yields the logistic choice model [@mcfadden1974conditional]. McFadden's work connected preference models to economics and provided a theoretical justification for the Bradley-Terry model: it arises from random utility maximization under specific distributional assumptions. ### Modern Developments {#sec-modern-developments} **Network Psychometrics (2010s):** Borsboom and colleagues proposed that psychological constructs might be better understood as networks of causally connected symptoms rather than reflections of underlying latent variables [@borsboom2013network]. The Ising model and Gaussian Graphical Model provide statistical tools for this perspective. **AI Evaluation (2020s):** The application of psychometric methods to AI evaluation is recent. Key developments include: - Chatbot Arena using Elo ratings for LLM ranking [@chiang2024chatbotarena] - Application of IRT to benchmark analysis [@polo2024tinybenchmarks] - Multidimensional factor models for AI capabilities (this textbook) ::: {.callout-note title="Key Historical Papers"} **Foundations:** - @thurstone . A law of comparative judgment. *Psychological Review*, 34, 273-286. - @bradley-terry-model . Rank analysis of incomplete block designs. *Biometrika*, 39, 324-345. - @luce1959individual . *Individual Choice Behavior: A Theoretical Analysis*. Wiley. - @rasch1960probabilistic . *Probabilistic Models for Some Intelligence and Attainment Tests*. Danish Institute for Educational Research. - @mcfadden1974conditional . Conditional logit analysis of qualitative choice behavior. In *Frontiers in Econometrics* (pp. 105-142). Academic Press. **Modern:** - @borsboom2005measuring . *Measuring the Mind: Conceptual Issues in Contemporary Psychometrics*. Cambridge University Press. - @epskamp2018gaussian . The Gaussian graphical model in cross-sectional and time-series data. *Multivariate Behavioral Research*, 53, 453-480. ::: ## From Psychology to AI: Transferring Measurement Science {#sec-transfer} The concepts developed in psychology and education transfer directly to AI evaluation. This section makes the mapping explicit and highlights both the parallels and the differences. ### The Translation Table | Psychology/Education | AI Evaluation | Symbol | |---------------------|---------------|--------| | Test taker (person, examinee) | AI model | $i$ | | Test item (question, problem) | Benchmark question | $j$ | | Ability, trait, latent construct | Capability, skill | $\theta_i$ or $U_i$ | | Item difficulty | Question difficulty | $\beta_j$ or $V_j$ | | Item discrimination | Question informativeness | $a_j$ | | Response (correct/incorrect) | Model output (correct/incorrect) | $Y_{ij}$ | | Test (collection of items) | Benchmark (collection of questions) | - | | Sum score (number correct) | Accuracy | $S_i$ | | Reliability | Evaluation consistency | - | | Validity | Measuring intended capability | - | | Test bias (DIF) | Benchmark contamination/bias | - | | Adaptive testing (CAT) | Efficient evaluation | - | ### Key Parallels **Reliability.** In educational testing, reliability refers to the consistency of scores across different conditions: - *Test-retest reliability:* Does the same person get the same score on repeated testing? - *Internal consistency:* Do items within the test correlate with each other? - *Standard error of measurement:* How precise is the score estimate? For AI evaluation: - *Run-to-run consistency:* Does the same model get the same score with different random seeds? - *Item consistency:* Do benchmark questions correlate with each other? - *Confidence intervals:* How uncertain is the accuracy estimate? **Validity.** In educational testing, validity concerns whether the test measures what it claims to measure: - *Content validity:* Do the items adequately sample the domain? - *Criterion validity:* Does the score predict relevant outcomes? - *Construct validity:* Does the score behave as theory predicts? For AI evaluation: - *Content validity:* Does the benchmark cover the intended capability domain? - *Criterion validity:* Does benchmark performance predict real-world usefulness? - *Construct validity:* Do models that score high actually have the intended capability? **Fairness and Bias.** In educational testing, differential item functioning (DIF) analysis checks whether items are biased against certain groups: - An item shows DIF if persons with equal ability but different group membership have different probabilities of answering correctly. For AI evaluation: - *Training data contamination:* Did some models see the test questions during training? - *Architecture bias:* Are some questions easier for certain model architectures? - *Prompt sensitivity:* Do different prompt formats advantage different models? ### Key Differences While the mathematical framework transfers directly, some differences are worth noting: 1. **Number of items.** Psychological tests typically have tens to hundreds of items. AI benchmarks may have thousands or hundreds of thousands. This affects estimation and model fitting. 2. **Deterministic responses.** Human test-takers show stochastic variation---they may answer the same question differently on different occasions. AI models (with temperature 0) are often deterministic. This changes how we interpret probability models. 3. **Construct definition.** Psychological constructs like "intelligence" or "anxiety" have extensive theoretical literature. AI capabilities like "reasoning" or "common sense" are less well defined. 4. **Speed of change.** Human abilities change slowly. AI capabilities can change dramatically with each model release. This affects the stability of calibrations. 5. **Population structure.** Human populations have known demographic structures. The "population" of AI models is arbitrary---determined by which models researchers choose to evaluate. ### Case Study: The Chatbot Arena The Chatbot Arena (LMSYS) [@liang_holistic_2023] provides a concrete example of measurement concepts applied to AI: **Setting:** Users interact with two anonymous language models and vote for the one they prefer. Models are ranked using Elo ratings computed from these pairwise comparisons. **Measurement framework:** This is exactly Thurstone's comparative judgment paradigm from 1927. The Elo rating system implements Bradley-Terry maximum likelihood estimation with online updates. **Validity questions:** - What construct do the ratings measure? "Human preference" is vague. Preferences for what---helpfulness, harmlessness, style, factual accuracy? - Are ratings stable across different user populations? - Do ratings predict performance on other benchmarks or real-world tasks? **Reliability questions:** - How many comparisons are needed for stable ratings? - How sensitive are ratings to the specific prompts used? - Do ratings fluctuate as new models enter the arena? The Arena demonstrates both the power and limitations of measurement approaches. The Elo ratings provide a principled summary of human preferences, but interpreting what they mean requires the full apparatus of validity theory. ## Amortized Models and Neural Collaborative Filtering {#sec-cold-start} The factor models introduced in @sec-factor-models learn a latent parameter vector for each person ($U_i$) and each item ($V_j$, $Z_j$) by optimizing over the observed response matrix. This works well when we have enough responses for every entity, but it has two limitations: (1) the number of parameters grows linearly with the number of persons and items, and (2) new persons or items that were not present during training have no learned representation. **Amortized models** address both limitations by replacing per-entity parameter optimization with a learned function that maps observable features to latent parameters. ### From Per-Entity to Amortized Parameters In the standard factor model, each item $j$ has its own learned parameters $V_j$ and $Z_j$. An amortized model instead learns a function $f_\theta$ that predicts these parameters from item features: $$ [\hat{V}_j, \hat{Z}_j] = f_\theta(x_j) $$ where $x_j$ is a feature vector for item $j$ (e.g., a text embedding of the question). Similarly, person parameters can be amortized via a function $g_\phi$: $$ \hat{U}_i = g_\phi(m_i) $$ where $m_i$ is a feature vector for person $i$ (e.g., model metadata such as parameter count, architecture family, and release date). The predicted correctness probability is then: $$ \hat{P}_{ij} = \sigma(\hat{U}_i^\top \hat{V}_j + \hat{Z}_j) $$ The key advantage is that $f_\theta$ and $g_\phi$ generalize to *new* items and persons not seen during training, as long as their features are available. ### Neural Collaborative Filtering Amortized measurement models are closely related to **neural collaborative filtering** (NCF), a family of methods from recommendation systems that use neural networks to model user-item interactions. In the recommender systems literature, the response matrix $Y$ records user preferences (ratings, clicks); in our setting, it records model correctness. Both problems share the same structure: predicting missing entries of a sparse matrix. Classical matrix factorization decomposes the response matrix as $Y \approx U V^\top + Z$, which is exactly the logistic factor model from @sec-factor-models without the sigmoid. NCF extends this by replacing the inner product $U_i^\top V_j$ with a neural network $h_\psi(U_i, V_j)$ that can learn non-linear interactions between person and item latent factors. In the measurement context, this corresponds to replacing the linear logit $U_i^\top V_j + Z_j$ with a more flexible function: $$ P(Y_{ij} = 1) = \sigma(h_\psi(U_i, V_j, Z_j)) $$ This increased flexibility can capture complex interaction patterns---for instance, a model that is strong on reasoning but specifically weak on multi-step arithmetic---that a linear factor model would need many dimensions to approximate. The cost is reduced interpretability: the latent factors no longer have a clean geometric interpretation as "ability axes." ::: {.callout-note title="NCF and Amortized Models"} Neural collaborative filtering and amortized models address complementary problems. NCF uses neural networks to model the *interaction* between latent factors (replacing $U_i^\top V_j$ with a learned function). Amortized models use neural networks to *predict* the latent factors from features (replacing per-entity $U_i$, $V_j$ with functions of observable features). The two ideas can be combined: use amortized encoders to produce latent factors, then a neural interaction function to produce predictions. ::: ### Prediction-Powered Evaluation **Prediction-Powered Evaluation (PPE)** is a concrete instantiation of the amortized approach for AI evaluation. The pipeline has three stages: **Stage 1 --- Factor Model Pretraining.** Learn latent behavioral factors $(U, V, Z)$ from observed response data $Y_{ij}$ using the factor model from @sec-factor-models: $$ p(Y_{ij}=1 \mid U_i, V_j, Z_j) = \sigma(U_i^\top V_j + Z_j) $$ **Stage 2 --- Amortized Mapping.** Two parallel predictors are trained: - **Item-side predictor** $f_\theta$: maps question embeddings $E_j \in \mathbb{R}^{4096}$ to latent parameters $[\hat V_j, \hat Z_j] = f_\theta(E_j)$ - **Model-side predictor** $g_\phi$: maps model metadata $F_i \in \mathbb{R}^{24}$ (scale, architecture, release time) to $\hat U_i = F_i W_U$ The item predictor is a neural network trained with Bernoulli log-likelihood using fixed $U$ from Stage 1. The model predictor is a simple linear transformation for interpretability. **Stage 3 --- Zero-Shot Evaluation.** Given a new model or item, predict its latent parameters from metadata/embeddings and reconstruct correctness probabilities: $$ \hat P_{ij} = \sigma(\hat U_i^\top \hat V_j + \hat Z_j) $$ | **Component** | **Input** | **Output** | **Purpose** | |------------------------|------------------------------|----------------------------|--------------------------------------------------| | Factor model | Response matrix $Y$ | $U, V, Z$ | Extract latent behavioral structure | | Semantic predictor | Question embeddings $E_j$ | $[\hat V_j, \hat Z_j]$ | Generalize to unseen questions | | Model predictor | Metadata $F_i$ | $\hat U_i$ | Generalize to unseen models | | Correctness predictor | $\hat U_i, \hat V_j, \hat Z_j$ | $\hat P_{ij}$ | Predict correctness without running evaluation | ### The Semantic--Behavioral Gap A natural hope is that semantic similarity between questions implies behavioral similarity: if two questions have similar embeddings, perhaps models respond to them similarly. However, this intuition is misleading. Even near-identical embeddings (cosine > 0.99) exhibit nearly random behavioral correlations ($-1$ to $+1$), showing that semantic proximity does *not* imply behavioral equivalence. This gap motivates the two-stage approach above: first learn a *behavioral* latent space from response data, then learn to map semantic features into that space. ### Iterative Filtering Before training the PPE pipeline, we can improve inter-item consistency by removing adversarial or inconsistent items. **Iterative filtering** repeatedly computes tetrachoric correlations and removes items with the highest proportion of negative correlations. After filtering, negative correlations typically drop from ~23% to under 2%, retaining roughly half the original items. Empirical evaluation of the PPE pipeline on held-out data is presented in @sec-learning. ## Summary and Preview {#sec-summary} This chapter has introduced the measurement science framework that underlies the rest of AIMS. The key ideas are: ### Key Takeaways 1. **Measurement requires more than scoring.** Assigning numbers to models based on benchmark performance is scoring, not measuring. Measurement requires a theory connecting scores to latent constructs. 2. **Validity matters.** A benchmark measures a construct only if the construct exists and causally produces variation in scores. The philosophical and diagnostic foundations of validity are developed in @sec-validity. 3. **The Rasch model has special properties.** Sufficiency of sum scores and specific objectivity make Rasch uniquely suitable for fundamental measurement. These properties justify treating sum scores as measurements. 4. **Multiple models exist for different purposes.** IRT models (1PL, 2PL, 3PL), factor models, paired comparison models (Bradley-Terry, Elo), network models (GGM, Ising), and hierarchical models serve different purposes. The choice of model---including whether to represent nested evaluation structure---has implications for what we can conclude. 5. **Psychology solved these problems decades ago.** The tools developed in psychometrics---reliability, validity, dimensionality analysis, adaptive testing---apply directly to AI evaluation. ### Preview of Following Chapters The chapters that follow apply this framework to specific AI evaluation challenges: - **Chapter 2 (Learning):** Covers parameter estimation for IRT and factor models, including maximum likelihood, EM algorithms, Bayesian inference, regularization, and model selection. It also develops generalization experiments---evaluating how well learned models transfer to unseen data under various masking schemes---and presents empirical results for the amortized models introduced in this chapter. - **Chapter 3 (Design):** Applies measurement principles to benchmark design, addressing how to construct valid and reliable AI evaluations. - **Chapter 4 (Reliability):** Develops the theory of measurement precision, including classical and IRT-based reliability, standard errors, and information functions. The measurement concepts from this chapter recur throughout. When we justify using sum scores, we appeal to sufficiency in the Rasch sense. When we analyze benchmark dimensionality, we apply the factor models introduced in this chapter and trained using methods from Chapter 2. The question of whether benchmarks measure what they claim to measure --- validity --- is developed in @sec-validity. ## Exercises {#sec-irt-exercises} ### Theoretical Exercises **Exercise 1.1** (*): Explain in your own words why the sum score is a sufficient statistic in the Rasch model but not in the 2PL model. What information is lost when we reduce responses to sum scores under 2PL? **Exercise 1.2** (**): Prove that the Bradley-Terry model is equivalent to a Rasch model where each "person" is a comparison between two items. *Hint:* Consider a "person" as an ordered pair $(i, j)$ representing a comparison, and an "item" as a single entity $k$ appearing in a comparison. Define appropriate ability and difficulty parameters. **Exercise 1.3** (**): Show that in the Rasch model, the odds ratio for persons $i$ and $k$ responding correctly to item $j$ is: $$ \frac{P(Y_{ij} = 1) / P(Y_{ij} = 0)}{P(Y_{kj} = 1) / P(Y_{kj} = 0)} = \exp(\theta_i - \theta_k) $$ independent of the item difficulty $\beta_j$. Explain why this property is called "specific objectivity." **Exercise 1.4** (***): The Ising model and the Rasch model make different assumptions about why responses correlate. (a) Write down both models for binary data $Y \in \{0, 1\}^{N \times M}$. (b) Describe the causal structure each model assumes. (c) Under what conditions might each model be appropriate for AI evaluation? (d) Propose an empirical test that could distinguish between them. ### Computational Exercises **Exercise 1.5** (**): Implement Rasch model estimation using conditional maximum likelihood. ```python # Given: Response matrix Y (N models x M questions) # Task: Estimate item difficulties using conditional MLE # # Steps: # 1. Compute sum scores for each model # 2. For each item, compute the conditional likelihood given sum scores # 3. Optimize to find item difficulties # 4. Compare estimated difficulties to empirical item means (proportion correct) # # Use scipy.optimize.minimize for optimization import numpy as np from scipy.optimize import minimize from scipy.special import logsumexp def estimate_rasch_conditional(Y): """ Estimate Rasch model item difficulties using conditional MLE. Parameters: ----------- Y : np.ndarray, shape (N, M) Binary response matrix Returns: -------- beta : np.ndarray, shape (M,) Estimated item difficulties (identified by setting sum(beta) = 0) """ N, M = Y.shape # YOUR CODE HERE pass # Test on simulated data np.random.seed(42) N, M = 100, 50 theta_true = np.random.normal(0, 1, N) beta_true = np.random.normal(0, 1, M) prob = 1 / (1 + np.exp(-(theta_true[:, None] - beta_true[None, :]))) Y = (np.random.random((N, M)) < prob).astype(int) beta_hat = estimate_rasch_conditional(Y) # Compare to true values (after centering) ``` **Exercise 1.6** (**): Given pairwise preference data, estimate Bradley-Terry parameters. ```python # Given: Comparison data as list of (winner, loser) pairs # Task: Estimate strength parameters via maximum likelihood # # The likelihood for comparison (i beats j) is: # P(i > j) = exp(theta_i) / (exp(theta_i) + exp(theta_j)) # = sigmoid(theta_i - theta_j) import numpy as np from scipy.optimize import minimize def estimate_bradley_terry(comparisons, n_items): """ Estimate Bradley-Terry model parameters. Parameters: ----------- comparisons : list of (int, int) List of (winner, loser) pairs n_items : int Number of items Returns: -------- theta : np.ndarray, shape (n_items,) Estimated strength parameters (identified by setting theta[0] = 0) """ # YOUR CODE HERE pass # Test: Simulate comparisons and recover parameters ``` **Exercise 1.7** (***): Test whether benchmark data fit the Rasch model using Andersen's likelihood ratio test. ```python # Andersen's LR test: # 1. Split persons into groups based on sum score (e.g., high vs low scorers) # 2. Estimate item difficulties separately for each group # 3. If Rasch holds, these estimates should be equal # 4. Test statistic: 2 * (sum of group log-likelihoods - pooled log-likelihood) # 5. Under H0, this is chi-squared with df = (n_groups - 1) * (n_items - 1) def andersen_lr_test(Y, n_groups=2): """ Perform Andersen's LR test for Rasch model fit. Parameters: ----------- Y : np.ndarray, shape (N, M) Binary response matrix n_groups : int Number of groups to split persons into Returns: -------- statistic : float LR test statistic p_value : float p-value from chi-squared distribution """ # YOUR CODE HERE pass ``` ### Discussion Questions **Discussion 1.1:** The Rasch model's sufficiency property justifies using sum scores as measurements. But most AI benchmarks use items with varying discriminations, which violates the Rasch assumption. Does this mean that current benchmark rankings are fundamentally flawed, or can they still be useful approximations? Under what conditions would adopting Rasch-based measurement change the ranking of frontier models? **Discussion 1.2:** The Rasch school argues that items not fitting the Rasch model should be discarded because they do not measure the same construct. What are the implications of this view for AI benchmark design? Should we design benchmarks to fit Rasch, or should we use more flexible models that accommodate heterogeneous items? **Discussion 1.3:** Network psychometrics views symptoms as causally connected rather than caused by a latent factor. Could AI capabilities be "network-like" rather than "factor-like"? What evidence would distinguish these views? How would it change how we interpret benchmark scores? **Discussion 1.4:** @truong2025irsl show that IRT ability $\theta$ scales linearly with $\log(\text{FLOP})$ during pre-training, and that this relationship enables cross-benchmark transfer of ability estimates. What does this imply about the nature of the latent construct $\theta$? Is it a stable property of the model, or an artifact of the IRT parameterization? Under what conditions would cross-benchmark transfer of $\theta$ fail? ## Bibliographic Notes {#sec-irt-bib} ### Item Response Theory The standard reference for IRT is Lord and Novick's *Statistical Theories of Mental Test Scores* (1968), though it predates modern computational methods. More accessible introductions include Hambleton and Swaminathan's *Item Response Theory* (1985) and de Ayala's *The Theory and Practice of Item Response Theory* (2009). The *Handbook of Modern Item Response Theory* (van der Linden & Hambleton, 1997) provides comprehensive coverage. ### Rasch Measurement Rasch's original book *Probabilistic Models for Some Intelligence and Attainment Tests* (1960) remains influential. Wright and Stone's *Best Test Design* (1979) provides practical guidance. Fischer and Molenaar's *Rasch Models: Foundations, Recent Developments, and Applications* (1995) covers extensions and applications. For the philosophical foundations, see Rasch's papers on objectivity collected in the *Rasch Measurement Transactions* archive. ### Historical Development Thurstone's seminal paper "A Law of Comparative Judgment" (1927) launched the quantitative study of preferences. Bradley and Terry's "Rank Analysis of Incomplete Block Designs" (1952) and Luce's *Individual Choice Behavior* (1959) established the axiomatic foundations. McFadden's "Conditional Logit Analysis" (1974) connected these to economic theory. For a history of psychometrics, see Boring's *A History of Experimental Psychology* (1950). ### Probabilistic Graphical Models This chapter uses plate diagrams and concepts from probabilistic graphical models (PGMs). @koller2009pgm provide a comprehensive reference covering directed and undirected graphical models, plate notation, and inference algorithms. Readers seeking deeper background on the graphical model foundations underlying IRT and network models will find Chapters 3--4 (Bayesian networks) and Chapter 5 (undirected models / Markov random fields) especially relevant. ### Network Psychometrics The network approach is developed in Borsboom and Cramer's "Network Analysis: An Integrative Approach" (2013) and formalized in Epskamp et al.'s papers on the Gaussian graphical model and Ising model (2018). The *Network Psychometrics with R* book (Epskamp et al., 2022) provides practical guidance. ### AI Evaluation The application of psychometric methods to AI is recent. For IRT applied to LLMs, see Polo et al.'s "Efficient Multi-Prompt Evaluation" (2024). For factor models, see the methods developed in this textbook. The Chatbot Arena is described in Zheng et al.'s "Judging LLM-as-a-Judge" (2023). ### Scaling Laws and IRT @truong2025irsl introduce Item Response Scaling Laws (IRSL), which embed IRT within the scaling law framework. By factorizing model ability from question difficulty, IRSL reduces scaling law estimation from $O(M \times N)$ to $O(M + N)$ parameters, with the estimated ability transferring across benchmarks. Their Beta-IRT formulation models empirical probability responses (token probabilities, pass rates) rather than binary correctness, using a Beta loss that achieves reliable calibration with as few as 2 test takers. @schaeffer2025monkeys provide the distributional theory explaining why per-problem exponential scaling aggregates to power-law scaling: the exponent is determined by the left-tail shape parameter of the success probability distribution, connecting the heterogeneity of item difficulty to aggregate scaling behavior.

Linear Factor Model	Logistic Factor Model	Interpretation
Latent factors \(F_k\)	Entries of \(U_i \in \mathbb{R}^K\)	Abilities of person \(i\) on each of the \(K\) factors
Factor loadings \(\lambda_{jk}\)	Entries of \(V_j \in \mathbb{R}^K\)	How strongly item \(j\) loads on factor \(k\)
Item-specific error \(\epsilon_j\)	Item intercept \(Z_j \in \mathbb{R}\)	Baseline easiness of item \(j\) (analogous to \(-\beta_j\) in IRT)