5  Reliability

NoteIntended Learning Outcomes

By the end of this chapter, the reader will be able to:

  1. Decompose a benchmark’s response variance into person, item, person-by-item, and residual components, derived directly from the IRT response model.
  2. Define reliability as the person (signal) share of that variance—the generalizability coefficient—and distinguish it from validity.
  3. Estimate the components and the reliability they imply (ANOVA/method of moments, marginal MLE, bias-corrected plug-in), and compute the classical score-based coefficients (Cronbach’s \(\alpha\), McDonald’s \(\omega\), split-half, test–retest, SEM), linked by Spearman–Brown.
  4. Diagnose the sources of variability in AI evaluation—prompt sensitivity, sampling stochasticity, rater disagreement, item sampling—and map each to the component it feeds.
  5. Apply generalizability theory to multi-facet designs, using G- and D-studies to size items and raters to a target reliability, including LLM-as-a-judge protocols.
  6. Explain conditional reliability (precision varies with ability) and why systematic bias, such as a judge’s position bias, escapes reliability analysis.
  7. Select a model’s facet structure—deciding which terms a design can identify—and design evaluations that control the dominant sources of variability.
  8. Diagnose individual items with differential item functioning, dimensionality assessment, item-fit statistics, and the multitrait-multimethod matrix, connecting each to the same variance decomposition and to the validity threats catalogued in Chapter 1.

This chapter can be covered in 4 lectures (75-90 minutes each):

Lecture 1: From responses to variance components (§5.1–5.5)

  • Leaderboard flicker and the sources of variability (15 min)
  • The response model and the four-term decomposition; reliability as the generalizability coefficient (20 min)
  • Three estimators of the components: plug-in and its bias, Bayesian / empirical Bayes, and method of moments (30 min)
  • Hands-on: variance attribution, and why the interaction needs replication (25 min)

Lecture 2: Designing the evaluation (§5.6–5.7)

  • Generalizability theory and the D-study: sizing items and raters to a target reliability (35 min)
  • Model selection: which facets a design can identify; held-out likelihood and WAIC (30 min)
  • Hands-on: cost-optimal D-study and held-out model selection (25 min)

Lecture 3: Reliability across model families (§5.8–5.11)

  • Factor models and McDonald’s \(\omega\); the \(\alpha\)\(\omega\) gap (20 min)
  • Ordinal responses, LLM-as-a-judge, and agreement coefficients (\(\kappa\), weighted \(\kappa\), Krippendorff’s \(\alpha\)) (25 min)
  • Classical test theory: \(\alpha\), split-half, SEM, Spearman–Brown (15 min)
  • Design principles for reliable AI evaluation (10 min)
  • Hands-on: \(\omega\) vs \(\alpha\), and the agreement coefficients (20 min)

Lecture 4: Item-level diagnostics and building trustworthy evaluations

  • DIF as an item-by-group interaction; Mantel-Haenszel, logistic-regression, and IRT-based tests (25 min)
  • Dimensionality assessment via parallel analysis; item-fit and contamination detection (25 min)
  • The multitrait-multimethod matrix; principled item construction and synthetic data (30 min)
NoteNotation

This chapter is built on the response-model variance decomposition \(\operatorname{Var}(Y) = V_{\text{person}} + V_{\text{item}} + V_{\text{interaction}} + V_{\text{residual}}\), with reliability \(\rho = V_{\text{person}} / \operatorname{Var}(Y)\)—the generalizability coefficient. The same components appear as the random-effects variances \(\sigma^2_p, \sigma^2_i, \sigma^2_{pi}\) (plus a rater facet \(\sigma^2_r\)), and collapse to classical test theory’s \(X = T + E\) with reliability \(\rho_{XX'}\), Cronbach’s \(\alpha\), and the standard error of measurement (SEM). Other symbols: \(\rho(\theta)\) and test information \(I(\theta)\) for conditional reliability, and \(\kappa\) (Cohen’s kappa) for rater agreement. The item-level diagnostics of Section 5.11 add their own symbols—\(\alpha_{MH}\) (the Mantel-Haenszel odds ratio), dimensionality eigenvalues \(\lambda_k\), and item-fit mean squares—each introduced where the tool is developed. See the notation appendix for the complete reference.

TipVideo Overview

A visual tour of the key concepts in this chapter — from why leaderboards flicker and the variance decomposition to estimating reliability, designing with G- and D-studies, LLM-judge agreement, and the classical coefficients.

5.1 Sources of Variability in AI Measurement

Imagine ten language models evaluated on a reasoning benchmark, with the rankings reported to a team. The next day, a colleague re-runs the same evaluation and gets a different ranking. The top two models swap places. A model that was fourth is now second. Nothing about the models changed—only the measurement procedure differed in small ways: a slightly different prompt template, a different random seed for sampling, a different subset of annotators scoring the open-ended responses.

This is the reliability problem. Before we can ask whether a benchmark measures what it claims to measure (the validity question, addressed in Chapter 1), we must first ask a more basic question: does the evaluation give the same answer when applied to the same thing twice?

Reliability is not accuracy. A thermometer that consistently reads two degrees too high is reliable—it gives the same answer every time—even though it is not accurate. Conversely, a thermometer that fluctuates randomly between readings is unreliable, and therefore cannot be accurate in any useful sense. In measurement science, reliability refers to the consistency or reproducibility of a measurement procedure.

ImportantReliability is Necessary for Validity

If an evaluation procedure produces different results every time it is applied to the same model, it cannot be measuring anything about the model. Unreliable measurements are noise, regardless of how carefully the benchmark was designed. Chapter 1 addresses validity—whether we are measuring the right thing. This chapter addresses the prior question: are we measuring anything at all?

The sources of inconsistency in AI evaluation are diverse: prompt formatting, temperature sampling, annotator disagreement, benchmark item selection, scoring rubric ambiguity, and even API version drift. Each source contributes variability that can distort model rankings. This chapter provides the theoretical tools to quantify, decompose, and control these sources of variability, drawing on Classical Test Theory and Generalizability Theory from the measurement sciences, applied throughout to the specific challenges of evaluating AI systems.

These swings have identifiable causes, and which one dominates decides the cure—adding items does nothing if the problem is rater disagreement. Four mechanisms account for most of the variability in AI evaluation.

Prompt sensitivity. Even at temperature 0, a model gives different responses under different prompt formats: “Let’s think step by step,” a multiple-choice versus open-ended answer format, or a reordering of the few-shot examples can shift rankings (Mizrahi et al. 2024). This is the test-format effect of educational testing—the same knowledge assessed differently by different instruments. If model \(A\) beats \(B\) on one template but not another, their standing is prompt-dependent, not ability-dependent.

Sampling stochasticity. At temperature \(> 0\) the same model gives different outputs to the same prompt on each run—the closest analogue to classical test–retest error. For binary items this is usually minimal (most models answer deterministically), but open-ended generations (summaries, code, essays) can earn different scores across runs. Replicated trials make it possible to measure this run-to-run noise and average it down.

Annotator and judge disagreement. For tasks that cannot be scored automatically—creative writing, safety, open-ended reasoning—human annotators or LLM judges assess quality, and they disagree. This is a distinct source that neither item difficulty nor model ability captures, so it needs its own treatment, which generalizability theory provides (Section 5.6); the special case of LLM judges—quantifying their agreement (Cohen’s \(\kappa\)) and detecting their systematic biases—is the subject of Section 5.9.

Benchmark sampling. The benchmark is itself a sample from a larger domain: a 200-item coding benchmark is one draw from the space of all coding problems, and another draw would rank models somewhat differently. Longer tests sample the domain more thoroughly and are more reliable—the Spearman–Brown effect (Section 5.10)—and the D-optimal item selection of Chapter 6 minimizes this variability by choosing a maximally informative sample.

To see these mechanisms bite, simulate many evaluation runs that perturb item difficulty (a prompt change), model ability (stochasticity), and the scores themselves (rater disagreement), then slice the resulting score variance by knob.

Most of that variability falls on models with similar abilities, so their ranks reshuffle from run to run—the flicker this section opened with, reproduced.

Equivalently, the 95% confidence interval around each model’s mean score overlaps its neighbors’—marking exactly the comparisons a leaderboard cannot support.

Rater disagreement is the dominant fixable source here, while prompt and item perturbations largely cancel across models on a fixed, long benchmark and stochasticity is the run-to-run noise that replication would tame. Which mechanism dominates—and how to spend a budget of items and raters against it—is what the rest of the chapter quantifies: first by decomposing the variance (Section 5.2) and estimating its parts, then by turning the decomposition into a design tool (Section 5.6).

5.2 From the Response Model to Variance Components

In Chapter 3 we modeled a single response as a sigmoid of a person term and an item term, \(P_{ij} = \sigma(\theta_i - \beta_j)\). Based on this model, in addition to predictive measurement, we can ask another question: when a benchmark score varies, where does the variance come from? The answer turns out to be the entire vocabulary of this chapter—person variance, item variance, their interaction, and an irreducible residual. When we score a model on a benchmark, we rarely care about its performance on these particular items, graded by this judge, on this day; we care about how it would do across the whole universe of items, judges, and occasions we might have used. The observed test is only a sample from that universe, and a score is dependable only insofar as it generalizes from the sample back to the universe. Each source of variation is a way the sample can misrepresent the universe, so attributing variance to these sources is how we quantify the threat to generalization. The generalizability coefficient is the share of score variance carried by the thing we want to generalize about—genuine differences between subjects—rather than by the luck of which items or judges we happened to draw.

Following the Rasch model of Chapter 3, let \(Y_{ij} \in \{0,1\}\) be the response of person \(i \in \{1,\dots,N\}\) on item \(j \in \{1,\dots,M\}\), with ability \(\theta_i\) and difficulty \(\beta_j\). We add one a person-by-item interaction \(\gamma_{ij}\) to the model. The interaction \(\gamma_{ij}\) lets a particular person be unusually good (or bad) on a particular item, over and above what overall ability and difficulty predict: a model that has memorized one specific question, or a question that is idiosyncratically easy for one model family, lives in \(\gamma_{ij}\). We then write the log-odds of a correct response as function of the two main effects and an interaction:

\[ \operatorname{logit} P(Y_{ij} = 1) = \theta_i - \beta_j + \gamma_{ij} \iff P_{ij} \;=\; P(Y_{ij} = 1 \mid \theta_i, \beta_j, \gamma_{ij}) \;=\; \sigma\!\big(\theta_i - \beta_j + \gamma_{ij}\big). \tag{5.1}\]

Subjects and items are not fixed but drawn from populations: \[ \theta_i \sim \mathcal{N}(0, \sigma_p^2), \qquad \beta_j \sim \mathcal{N}(0, \sigma_i^2), \qquad \gamma_{ij} \sim \mathcal{N}(0, \sigma_{pi}^2), \tag{5.2}\]

and treat the population variances—\(\sigma_p^2\) across subjects, \(\sigma_i^2\) across items, and \(\sigma_{pi}^2\) for the person-by-item term—as the quantities to attribute, and ultimately to estimate. On the log-odds scale, the attribution is immediate, because independent effects add:

\[ \operatorname{Var}(\theta_I - \beta_J + \gamma_{IJ}) = \sigma_p^2 + \sigma_i^2 + \sigma_{pi}^2 . \tag{5.3}\]

The person share of this latent variance is the reliability, or one-facet generalizability coefficient,

\[ \rho_\eta = \frac{\sigma_p^2}{\sigma_p^2 + \sigma_i^2 + \sigma_{pi}^2}, \tag{5.4}\]

the fraction of variation that reflects genuine differences between subjects rather than item difficulty or idiosyncratic person–item effects. This is the idealized reliability—what we could achieve if we observed the log-odds \(\eta\) directly. A real benchmark gives only binary responses, whose Bernoulli sampling adds a further, irreducible error; carrying the decomposition through the sigmoid onto that observed scale, and watching the error shrink as items accumulate, is the subject of Section 5.5.

The decomposition is not limited to two facets. An evaluation that also varies raters—and, across runs, occasions—adds their main effects and interactions; because independent effects still add on the log-odds scale, each facet brings its own variance components. For models (\(p\)) answering items (\(i\)) scored by raters (\(r\)),

\[ \operatorname{Var}(\eta) = \sigma_p^2 + \sigma_i^2 + \sigma_r^2 + \sigma_{pi}^2 + \sigma_{pr}^2 + \sigma_{ir}^2 + \sigma_{pir}^2 , \tag{5.5}\]

one component per facet (\(\sigma_p^2\) model, \(\sigma_i^2\) item, \(\sigma_r^2\) rater) and one per interaction; the first three terms are Equation 5.3, the rest are what the rater facet adds. Every mechanism of Section 5.1 is one of these components—rater disagreement is \(\sigma_r^2\) and \(\sigma_{pr}^2\), run-to-run drift adds an occasion facet—and generalizability theory (Section 5.6) estimates them and spends a fixed budget of items and raters to drive the error down.

One caveat attaches to the rater facet. When a rater returns a categorical verdict—a label, or which of two responses is better—rather than a number, its facet is conventionally summarized not through \(\sigma_r^2\) directly but through a chance-corrected agreement coefficient, Cohen’s \(\kappa\), which plays the role of a generalizability coefficient for raters. We develop it where it is used, with LLM judges (Section 5.9).

With the estimand defined, the next three sections estimate it—plug-in from fixed effects (Section 5.3), Bayesian (Section 5.4), and method of moments (Section 5.5), in increasing distance from the response model we started with. Generalizability theory (Section 5.6) then turns the decomposition into a design tool for reliable AI evaluation.

5.3 Variance Estimation via Plug-in Method

We now have the estimand—the components \(\sigma_p^2, \sigma_i^2, \sigma_{pi}^2\) of Equation 5.2 and the reliability they define—but a finite benchmark only lets us estimate them. Three methods do so, and we take them in order of distance from the foundations of the earlier chapters. The most direct simply reuses the IRT machinery of Chapter 3 and Chapter 4: fit each model’s ability as a fixed effect, then ask how spread out the fitted abilities are.

Fit the Rasch abilities \(\hat\theta_1, \dots, \hat\theta_N\) by maximum likelihood with the item difficulties held fixed, and take their empirical variance as an estimate of \(\sigma_p^2\). Without correction, this is a biased estimation. Write each fitted ability as the truth plus estimation error, \(\hat\theta_i = \theta_i + e_i\), with \(\theta_i \sim \mathcal{N}(0, \sigma_p^2)\) the quantity of interest and \(e_i\) the Rasch fit’s error—mean zero, variance \(\operatorname{Var}(e_i) = \operatorname{SE}(\hat\theta_i)^2\). The fit’s error depends on the responses, not on the value of \(\theta_i\), so the two are independent and their variances add:

\[ \operatorname{Var}(\hat\theta_i) = \operatorname{Var}(\theta_i) + \operatorname{Var}(e_i) = \sigma_p^2 + \operatorname{SE}(\hat\theta_i)^2 . \]

The plug-in estimator is the empirical mean of these per-model variances, so (up to the usual \(\tfrac{N-1}{N}\) finite-sample factor) it targets their average, not \(\sigma_p^2\) alone:

\[ \mathbb{E}\Big[\tfrac{1}{N}\textstyle\sum_i (\hat\theta_i - \bar{\hat\theta})^2\Big] \;\approx\; \sigma_p^2 + \overline{\operatorname{SE}^2}, \qquad \overline{\operatorname{SE}^2} = \tfrac{1}{N}\textstyle\sum_i \operatorname{SE}(\hat\theta_i)^2 . \]

The inflation \(\overline{\operatorname{SE}^2}\) is the average sampling variance of the per-model estimates: noisy abilities spread out more than the true ones, so estimation error masquerades as between-model variance—the same confounding of signal and noise that makes the interaction and residual inseparable without replicated trials (Section 5.5). Subtracting it gives the bias-corrected estimate,

\[ \hat\sigma_p^2 = \tfrac{1}{N}\textstyle\sum_i (\hat\theta_i - \bar{\hat\theta})^2 - \overline{\operatorname{SE}^2}, \]

which is exactly the method-of-moments estimator of Section 5.5 seen on the logit scale. Each standard error is read from the Fisher information of the Rasch fit,

\[ \operatorname{SE}(\hat\theta_i) = \Big(\textstyle\sum_{j=1}^M \hat P_{ij}\,(1 - \hat P_{ij})\Big)^{-1/2} = I(\hat\theta_i)^{-1/2}, \qquad \hat P_{ij} = \sigma(\hat\theta_i - \beta_j), \]

the inverse square root of the test information \(I(\theta)\) of Section 6.2.1—each item contributing \(\hat P_{ij}(1 - \hat P_{ij})\), most when it sits near the model’s ability. The bias therefore shrinks like \(1/M\): information accumulates with items, so \(\overline{\operatorname{SE}^2} \sim 1/M\), and on long benchmarks the naive and corrected estimates of \(\sigma_p^2\) nearly agree—but for \(\sigma_{pi}^2\) they never do without replication.

The simulation plants abilities of known variance \(\sigma_p^2 = 1\), draws one binary trial per cell, fits the abilities, and compares the naive and bias-corrected estimates as the benchmark lengthens.

The naive estimate overshoots by exactly the average squared standard error; subtracting it recovers \(\sigma_p^2\). The correction is cheap, but the plug-in still treats each ability as a free constant, ignoring that the abilities are themselves a sample from a population—a limitation the Bayesian view of Section 5.4 removes by modelling that population directly.

The standard error of the plug-in, \(\operatorname{SE}(\hat\theta_i) = I(\hat\theta_i)^{-1/2}\), depends on ability through the information \(I(\theta)\): a model is pinned down precisely only where the items inform it. Reliability inherits that dependence. The single \(\rho\) of Equation 5.4 is a population average; the same response model gives a conditional reliability at each ability \(\theta\),

\[ \rho(\theta) = 1 - \frac{1}{I(\theta)\,\sigma_p^2}, \qquad I(\theta) = \sum_{j=1}^M P_j(\theta)\,\big(1 - P_j(\theta)\big), \tag{5.6}\]

where \(\sigma_p^2\) is the spread of abilities and \(I(\theta)\) is the test information of Section 6.2.1. Note that \(I(\theta)\) is nothing but the per-item residual \(P_j(1 - P_j)\)—the irreducible Bernoulli noise of Equation 5.11—summed at a fixed ability: the same variance that limits a single response is exactly the information that response carries.

For AI evaluation the consequence is sharp. A benchmark can post a high overall \(\rho\) (or Cronbach’s \(\alpha\)) yet be unreliable for the models that matter most. Once the frontier passes the difficulty of every item, \(P_j(\theta) \to 1\), the information \(I(\theta) \to 0\), and \(\rho(\theta)\) collapses—benchmark saturation that a single reliability number hides.

The fix is the design side of efficiency (Chapter 6): pick items whose difficulty matches the ability range of interest, or select them adaptively per model (Section 6.2.1, Section 6.2.2). Targeted items hold \(\rho(\theta)\) high exactly where models must be told apart.

5.4 Variance Estimation via Bayesian Method

The random-effects model Equation 5.2 is already a Bayesian hierarchical model: the distribution \(\theta_i \sim \mathcal{N}(0, \sigma_p^2)\) is a prior on each model’s ability, and the variance components \(\sigma_p^2, \sigma_i^2, \sigma_{pi}^2\) are its hyperparameters. Estimation then comes in two flavors, depending on what we do with those hyperparameters.

Empirical Bayes (marginal maximum likelihood). Treat the abilities as latent and integrate them out, leaving the marginal likelihood—the evidence—as a function of the hyperparameter, then pick the value that maximizes it. Writing the evidence at a candidate variance \(\boldsymbol{s}^2\),

\[ \mathcal{L}(\boldsymbol{s}^2) = \int \Big[\textstyle\prod_{i,j} P_{ij}^{\,Y_{ij}}(1 - P_{ij})^{1 - Y_{ij}}\Big]\, p(\theta, \beta, \gamma \mid \boldsymbol{s}^2)\; d\theta\, d\beta\, d\gamma, \qquad P_{ij} = \sigma(\theta_i - \beta_j + \gamma_{ij}), \]

with the population (prior) density \(p(\theta, \beta, \gamma \mid \boldsymbol{s}^2) = \prod_i \mathcal{N}(\theta_i; 0, s_p^2)\,\prod_j \mathcal{N}(\beta_j; 0, s_i^2)\,\prod_{ij}\mathcal{N}(\gamma_{ij}; 0, s_{pi}^2)\), the estimate is the maximizer

\[ \hat{\boldsymbol{\sigma}}^2 = \arg\max_{\boldsymbol{s}^2}\, \log \mathcal{L}(\boldsymbol{s}^2), \]

a search for the prior scale that best explains the data—a hyperparameter point estimate. The Bernoulli noise sits inside the likelihood, so no separate residual correction is needed and the estimates are logit-scale. With item difficulties fixed and only abilities random, the evidence is a product of one-dimensional integrals,

\[ \mathcal{L}(s^2) = \prod_{i=1}^N \int_{\mathbb{R}} \prod_{j=1}^M \sigma(\theta - \beta_j)^{Y_{ij}}\big(1 - \sigma(\theta - \beta_j)\big)^{1 - Y_{ij}}\, \mathcal{N}(\theta; 0, s^2)\, d\theta, \]

which a grid evaluates exactly.

Full Bayes. Rather than point-estimate the hyperparameter, place a hyperprior \(p(\boldsymbol{s}^2)\) on it and report the whole posterior,

\[ p(\boldsymbol{s}^2 \mid Y) \;\propto\; \mathcal{L}(\boldsymbol{s}^2)\, p(\boldsymbol{s}^2), \]

summarized by its mean and a 95% credible interval. Empirical Bayes is then the special case that keeps only the peak of this posterior (the evidence maximizer, under a flat hyperprior); full Bayes keeps the spread, which is the honest uncertainty about \(\sigma_p^2\) from a finite sample. For the multi-facet model the posterior is high-dimensional and is sampled by MCMC (NumPyro or PyMC); for the one-facet case with difficulties fixed, the same one-dimensional grid that gave the evidence gives the posterior directly.

The simulation evaluates the evidence over candidate ability spreads \(s\), marks the empirical-Bayes maximizer, and overlays the full posterior (evidence times a flat hyperprior, normalized) with its mean and 95% credible interval.

The two views answer different questions: empirical Bayes returns the single most-supported \(\sigma_p\), while full Bayes returns the range the data leave open—wide here, because one binary response per cell is weak evidence. Both, like the plug-in, work on the latent scale. The last method drops the model entirely and reads the components off the observed scores.

5.5 Variance Estimation via Method of Moments

The plug-in and Bayesian methods fit the response model; the third estimator fits nothing, reading the components off the observed scores by method of moments. Because the scores are binary, the components it exposes live on the probability scale, which we develop first. Section 5.2 worked on the log-odds scale, where the three components add cleanly. A benchmark, though, produces only binary responses \(Y \sim \text{Bernoulli}(P_{IJ})\)—not the latent predictor. We now carry the decomposition through the sigmoid to \(\operatorname{Var}(Y)\): the nonlinear link recasts the three terms onto the probability scale \(P_{ij}\), where the decomposition stays exact and orthogonal, and adds a fourth—the Bernoulli noise. We work on the probability scale because it yields that residual for free. (The orthogonal decomposition that follows needs only that persons and items are sampled independently, not that they are Gaussian.) The recipe is analysis of variance (ANOVA): split the total variance into additive, non-overlapping contributions, one per source—here persons, items, their interaction, and noise—by measuring how far each person’s and each item’s average sits from the overall average. Carried out on the cell-probability function \(P_{ij}\) rather than on raw scores, the version we use is the functional ANOVA, or Hoeffding–Sobol, decomposition.

We introduce some notation \[ \bar P = \mathbb{E}_{I,J}[P_{IJ}], \qquad P_{i\bullet} = \mathbb{E}_J[P_{iJ}], \qquad P_{\bullet j} = \mathbb{E}_I[P_{Ij}], \tag{5.7}\]

and define the person main effect \(P_{i\bullet} - \bar P\), the item main effect \(P_{\bullet j} - \bar P\), and the interaction \(P_{ij} - P_{i\bullet} - P_{\bullet j} + \bar P\), so that

\[ P_{ij} = \bar P + \underbrace{(P_{i\bullet} - \bar P)}_{\text{person}} + \underbrace{(P_{\bullet j} - \bar P)}_{\text{item}} + \underbrace{(P_{ij} - P_{i\bullet} - P_{\bullet j} + \bar P)}_{\text{interaction}} . \tag{5.8}\]

Each term removes progressively more conditioning: the person effect is how much person \(i\) shifts the success probability after averaging over all items, the item effect is the analogous shift for item \(j\), and the interaction is whatever remains—the part of the cell that neither margin explains. By construction these effects are mean-zero and mutually orthogonal (each averages to zero over any index it retains), so every cross-covariance vanishes and the variance of the success probability splits with no cross terms:

\[ \operatorname{Var}(P_{IJ}) = \underbrace{\operatorname{Var}_I\!\big(P_{I\bullet}\big)}_{V_{\text{person}}} \;+\; \underbrace{\operatorname{Var}_J\!\big(P_{\bullet J}\big)}_{V_{\text{item}}} \;+\; \underbrace{\mathbb{E}\big[(P_{IJ} - P_{I\bullet} - P_{\bullet J} + \bar P)^2\big]}_{V_{\text{interaction}}} . \tag{5.9}\]

The three terms above account for variance in the probability of a correct response, but the graded outcome \(Y\) is a Bernoulli draw from that probability. The law of total variance, conditioning on the sampled cell \((I,J)\), supplies the last piece:

\[ \operatorname{Var}(Y) = \underbrace{\operatorname{Var}\big(\mathbb{E}[Y \mid I,J]\big)}_{\operatorname{Var}(P_{IJ})\,=\,V_{\text{person}} + V_{\text{item}} + V_{\text{interaction}}} \;+\; \underbrace{\mathbb{E}\big[\operatorname{Var}(Y \mid I,J)\big]}_{V_{\text{residual}}} . \tag{5.10}\]

Since \(\operatorname{Var}(Y \mid I,J) = P_{IJ}\big(1 - P_{IJ}\big)\), the residual is the average Bernoulli noise,

\[ V_{\text{residual}} = \mathbb{E}\big[P_{IJ}\,(1 - P_{IJ})\big], \tag{5.11}\]

the irreducible measurement noise of a single graded trial—largest near \(P_{ij} = 0.5\), vanishing as items become trivially easy or impossibly hard. Collecting the four pieces gives the decomposition that organizes the rest of this chapter:

\[ \operatorname{Var}(Y) = V_{\text{person}} + V_{\text{item}} + V_{\text{interaction}} + V_{\text{residual}} \tag{5.12}\]

Dividing through by \(\operatorname{Var}(Y)\) turns each term into a share of total variance. The person share is the observed-scale reliability—the analogue of the idealized \(\rho_\eta\) (Equation 5.4), now with the Bernoulli residual in the denominator, which a single graded response cannot escape:

\[ \rho \;=\; \frac{V_{\text{person}}}{V_{\text{person}} + V_{\text{item}} + V_{\text{interaction}} + V_{\text{residual}}} . \tag{5.13}\]

Equation 5.12 is the variance-component decomposition of a one-facet Generalizability study, derived here from the response model rather than postulated. Each probability-scale term names a classical G-theory component—the score-scale image of a latent variance from Equation 5.2:

Term in Equation 5.12 G-theory component What it measures
\(V_{\text{person}}\) universe-score (person) variance real differences between test-takers (the signal)
\(V_{\text{item}}\) item variance some items are just harder than others
\(V_{\text{interaction}}\) person\(\times\)item variance a person is unusually strong/weak on specific items
\(V_{\text{residual}}\) residual irreducible Bernoulli grading noise
\(\rho\) (Equation 5.13) generalizability coefficient / ICC reliability

The same logic extends to more facets—raters, occasions—each adding its own variance components, the multi-facet decomposition of Equation 5.5; generalizability theory (Section 5.6) estimates those components and designs around them.

ImportantWhy replicated trials are needed

With a single graded response per cell, the estimated cell probability \(\hat P_{ij}\) is just \(0\) or \(1\), and the interaction \(V_{\text{interaction}}\) is perfectly confounded with the Bernoulli residual \(V_{\text{residual}}\)—there is no way to tell an idiosyncratic person-item effect apart from a lucky guess. Separating them requires replication: multiple trials per person-item cell, so that \(\hat P_{ij}\) is a within-cell average. This is the statistical reason a long-form evaluation record carries a trial field alongside the person and item identifiers, and it is exactly the move from a one-facet to a two-facet G-study.

These observed-scale components are estimated straight from the scores by the classic G-study route—analysis of variance / method of moments. For a crossed person-by-item design with \(n_p\) persons, \(n_i\) items, and one trial per cell, write the person (row), item (column), and grand means of the observed responses,

\[ \bar Y_{i\bullet} = \frac{1}{n_i}\sum_{j} Y_{ij}, \qquad \bar Y_{\bullet j} = \frac{1}{n_p}\sum_{i} Y_{ij}, \qquad \bar Y_{\bullet\bullet} = \frac{1}{n_p n_i}\sum_{i,j} Y_{ij}, \tag{5.14}\]

and form the person, item, and interaction mean squares

\[ MS_p = \frac{n_i}{n_p - 1}\sum_{i}\big(\bar Y_{i\bullet} - \bar Y_{\bullet\bullet}\big)^2, \quad MS_i = \frac{n_p}{n_i - 1}\sum_{j}\big(\bar Y_{\bullet j} - \bar Y_{\bullet\bullet}\big)^2, \quad MS_{pi} = \frac{\sum_{i,j}\big(Y_{ij} - \bar Y_{i\bullet} - \bar Y_{\bullet j} + \bar Y_{\bullet\bullet}\big)^2}{(n_p - 1)(n_i - 1)} , \tag{5.15}\]

the average squared person, item, and residual-cell deviations, scaled by their degrees of freedom. Treating persons and items as randomly sampled (the only assumption behind Equation 5.9), their expectations are

\[ \mathbb{E}[MS_p] = (V_{\text{interaction}} + V_{\text{residual}}) + n_i\,V_{\text{person}}, \quad \mathbb{E}[MS_i] = (V_{\text{interaction}} + V_{\text{residual}}) + n_p\,V_{\text{item}}, \quad \mathbb{E}[MS_{pi}] = V_{\text{interaction}} + V_{\text{residual}} , \tag{5.16}\]

so equating observed to expected and solving gives \(\hat V_{\text{person}} = (MS_p - MS_{pi})/n_i\) and \(\hat V_{\text{item}} = (MS_i - MS_{pi})/n_p\), while \(MS_{pi}\) estimates the interaction and residual together—fused until trials are replicated (Section 5.6). This last step is what names the method: a moment is a summary statistic whose expected value the model predicts—a mean, a variance, or here a mean square—and the method of moments estimates parameters by setting each observed moment equal to its predicted expectation and solving the resulting equations. Unlike the plug-in and Bayesian routes it maximizes no likelihood and assumes no prior, which makes it the cheapest of the three but also the bluntest—it uses only these low-order summaries of the data. The empirical cell proportions \(\hat P_{ij}\) give the same components through the functional ANOVA of Equation 5.9.

The simulation below builds the response model, reads off the four population components, and then asks two questions: can we recover them from finite data, and how does the interaction erode reliability? We take it in three short steps.

First, construct the cell-probability matrix from the model and compute the four components directly from it.

Estimating the components from finite data recovers the main effects, but with a single trial per cell the interaction is inseparable from Bernoulli noise: its estimate is badly inflated, and only converges to the truth once each cell is replicated.

Finally, the interaction is what erodes reliability: as the person-by-item term grows, more of the variance is idiosyncratic, and the person share—the generalizability coefficient—falls.

These estimates use the full response matrix; when only total scores are available, the classical test theory coefficients of Section 5.10 recover the same reliability directly.

5.6 Design for Generalizability

Equation 5.5 wrote the variance of a multi-facet design as a sum of components. A G-study estimates those components from data—by the method-of-moments or REML routes of Section 5.5 and Section 5.4. Generalizability theory’s distinctive step is the D-study (Cronbach et al. 1972): given the components, design the evaluation—how many items and raters buy a target reliability. That, and the design principles it implies, close the chapter.

The generalizability coefficient is the ratio of true score variance to true-plus-relative-error variance:

\[ G = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_{\text{rel}}} \]

where the relative error variance for a design with \(n_i\) items and \(n_r\) raters is:

\[ \sigma^2_{\text{rel}} = \frac{\sigma^2_{pi}}{n_i} + \frac{\sigma^2_{pr}}{n_r} + \frac{\sigma^2_{pir,e}}{n_i \cdot n_r} \]

The D-study plugs the estimated components into this coefficient for a planned design and asks how many items and raters it takes to reach, say, \(G \geq 0.90\). The key insight is that facets can be traded off against each other. If rater variance dominates (\(\sigma^2_{pr}\) is large), increasing \(n_r\) is more cost-effective than increasing \(n_i\). If item sampling variance dominates (\(\sigma^2_{pi}\) is large), more items help more than more raters. The D-study makes these tradeoffs explicit and quantitative. Take the seven variance components of a simulated model×item×rater design as given (a G-study estimates them; Section 5.5, Section 5.4) and read off the breakdown.

The D-study plugs those components into the generalizability coefficient for a planned design of \(n_i\) items and \(n_r\) raters. Sweeping both traces the design frontier—and shows that items and raters substitute for each other only up to a point.

Add costs—items are cheap (an API call), raters expensive (annotation)—and the D-study becomes a constrained optimization: for each budget, find the \((n_i, n_r)\) that maximizes \(G\).

5.7 Model Selection: Which Facets to Control

The D-study fixed how many items and raters to collect, but it took the facet structure as given—person and item main effects, a person-by-item interaction, perhaps a rater facet. Choosing that structure is the other half of design. The chapter opened with the slogan that controlling more sources of variation yields more dependable conclusions, and that is true for bias: a richer decomposition stops misattributing variance. But every added term costs identifiability and estimation variance, so more is not free. Deciding which terms earn their place is model selection.

Identifiability comes first. A term that cannot be separated from another cannot be selected, no matter how clever the criterion. The leading example is already in hand: with one trial per cell, the person-by-item interaction \(\sigma^2_{pi}\) is perfectly confounded with the Bernoulli residual (Section 5.5)—no statistic distinguishes an idiosyncratic person–item effect from an unlucky draw. So model selection and design are coupled: the design decides which models are even askable. To test for an interaction, the D-study must budget replicated trials per cell; otherwise the question is unanswerable and the honest move is to fuse interaction and residual into one term.

Two interactions, two held-out tests. “Do we need the interaction?” hides two very different questions. As an iid effect \(\gamma_{ij}\sim\mathcal N(0,\sigma^2_{pi})\), the interaction says nothing about an unseen \((\text{model},\text{item})\) cell—its posterior there is the prior—so it is testable only by holding out replicated trials of seen cells. As a structured effect—unequal item discriminations (the 2PL of Chapter 3) or extra latent dimensions—it changes predictions for unseen cells, and is tested by holding out cells with the masking schemes of Section 4.7.1. The second is usually the real question: “is Rasch enough, or do we need 2PL/multidimensional structure?” is prescriptive-vs-descriptive model choice in disguise.

The criterion. The natural starting point—comparing likelihood on held-out data—is correct, with three refinements. (i) Use the marginal predictive likelihood, integrating out the random effects, not the conditional likelihood at point estimates, which always flatters the richer model. (ii) Hold out the right thing (cells vs. trials, per the previous paragraph). (iii) WAIC and PSIS-LOO (Watanabe 2010; Vehtari et al. 2017) approximate that held-out likelihood cheaply and handle the random-effect integration; a likelihood-ratio test of \(\sigma^2_{pi}=0\) is a boundary test (null on the edge of the parameter space), so its reference law is a \(\tfrac12\chi^2_0+\tfrac12\chi^2_1\) mixture rather than a plain \(\chi^2_1\)—easy to misuse, another reason to prefer cross-validation. Pair any of these with residual diagnostics—Rasch infit/outfit, or the inter-item correlations of the item analysis (Section 3.3)—which show where the main-effects model misfits, not merely that it does.

The simulation generates replicated responses from an additive-plus-interaction model and compares two fits on held-out trials—a main-effects model (person and item effects only) and a saturated model (a free probability per cell, the interaction taken to the limit)—as replication grows.

What actually decides it. Held-out likelihood answers “does the term improve prediction,” but the chapter’s framing points at the sharper question: does the term change the conclusion of interest—the reliability number, the ranking? Adding \(\sigma^2_{pi}\) lowers bias in the reliability estimate but raises its variance, and in the small-data regime the simpler model can yield a more dependable conclusion even while slightly biased. A term that is statistically “significant” yet leaves the ranking untouched has not earned its place; one that drops reliability from \(0.9\) to \(0.6\) plainly has. Select the model against the decision, not against fit in the abstract—which is exactly why model selection belongs next to design rather than apart from it.

5.8 Generalizability for Factor Models

Every reliability so far has assumed items are equally diagnostic: in the Rasch model behind Equation 5.4 each item relates to ability with the same unit slope, differing only in difficulty. Real benchmark items violate this—a few separate strong models from weak ones sharply, while many barely discriminate at all. The factor model (Section 3.2.3) captures that heterogeneity by giving each item its own loading, and its reliability coefficient is McDonald’s omega. Rather than quote the formula, we derive it from a generative model.

Give model \(i\) a latent ability \(\theta_i \sim \mathcal{N}(0, \sigma_\theta^2)\), and let item \(j\) load on it with its own slope \(\lambda_j\)—the loading, the linear cousin of an IRT discrimination—around a baseline \(\nu_j\) (the cousin of difficulty):

\[ X_{ij} = \nu_j + \lambda_j\,\theta_i + \varepsilon_{ij}, \qquad \varepsilon_{ij} \sim \mathcal{N}(0, \psi_j), \quad \theta \perp \varepsilon . \tag{5.17}\]

This is the one-factor (congeneric) model: a single common factor \(\theta\) generates every item, but with item-specific strength \(\lambda_j\) and item-specific noise variance \(\psi_j\) (the uniqueness). For graded or continuous scores—rubric ratings, reward-model outputs—it holds directly; for binary items it is the linear approximation to the logistic model of Chapter 3. Equal loadings \(\lambda_j \equiv \lambda\) recover the Rasch/tau-equivalent case behind Equation 5.4.

Summing Equation 5.17 across items for a subject \(i\),

\[ X_i = \underbrace{\textstyle\sum_j \nu_j}_{\text{constant}} \;+\; \Big(\textstyle\sum_j \lambda_j\Big)\theta_i \;+\; \textstyle\sum_j \varepsilon_{ij} . \]

The baselines \(\sum_j \nu_j\) are identical for every model, so they shift all scores equally and add nothing to between-model variance. With \(\theta \perp \varepsilon\) and independent errors, the total-score variance splits cleanly into a common-factor (true) part and an error part,

\[ \sigma_X^2 = \Big(\textstyle\sum_j \lambda_j\Big)^2 \sigma_\theta^2 \;+\; \textstyle\sum_j \psi_j , \]

and reliability is the true share of it—McDonald’s omega:

\[ \omega = \frac{\big(\sum_j \lambda_j\big)^2 \sigma_\theta^2}{\big(\sum_j \lambda_j\big)^2 \sigma_\theta^2 + \sum_j \psi_j} . \tag{5.18}\]

(Fixing \(\sigma_\theta^2 = 1\) for identification gives the textbook form.) The signal is the squared sum of loadings—items reinforce one another, so loadings add before squaring—while the noise is the plain sum of uniquenesses.

Heterogeneous loadings are the rule on AI benchmarks—a handful of items separate frontier models while most are saturated—so \(\alpha\) systematically understates reliability and \(\omega\), which models each item’s diagnosticity, is the coefficient to prefer. A bifactor refinement, \(\omega_h\), isolates the share carried by a single general factor, quantifying how nearly unidimensional a benchmark is—the question \(\alpha\) cannot answer (Section 5.10).

5.9 Generalizability for Ordinal Response

Whenever scoring cannot be automated—creative writing, safety, instruction following—a rater supplies the score, and the rater facet of Section 5.6 turns from a footnote into the dominant source of variability. Two rater settings recur in modern AI evaluation: an LLM judge issuing categorical or pairwise verdicts, and a human or model assigning an ordinal rating on a scale. The generalizability machinery of this chapter governs both; each adds its own practical wrinkles—systematic biases for the judge, scale usage and unequal spacing for the rating.

Judges need not return a binary or pairwise verdict. Much of modern AI evaluation is ordinal: a judge rates helpfulness on a 1–5 scale, a rubric scores an answer 0–4, MT-Bench grades a response from 1 to 10, a human marks fluency poor / fair / good / excellent. The categories are ordered but not equally spaced—the gap from poor to fair need not equal the gap from good to excellent. Reliability for such scores needs one extra idea, because the obvious move—average the numbers and decompose their variance—silently assumes the spacing is uniform. Two routes handle ordinal data, trading rigor for convenience.

Treating the scale as interval. Classical generalizability theory was built for numeric ratings, so the simplest route takes the categories at face value: assign them scores \(1, \dots, K\), treat the gaps as equal, and run the variance decomposition of Section 5.6 unchanged. For models (\(p\)) scored by judges (\(r\)),

\[ X_{pr} = \mu + \nu_p + \nu_r + \nu_{pr} + \varepsilon, \tag{5.19}\]

with the generalizability coefficient \(G = \sigma_p^2 / (\sigma_p^2 + \sigma_\delta^2)\) exactly as before. This is what most G-studies do in practice, and it is usually defensible when \(K\) is moderate (\(K \ge 5\)) and the categories are roughly evenly spaced. The cost is the equal-spacing assumption, and a specific artifact it invites: a judge who simply uses the high end of the scale—awarding 4s and 5s where another gives 2s and 3s—inflates the judge variance \(\sigma_r^2\) as pure scale-usage bias, even when the two judges rank models identically. Such a main effect hurts absolute generalizability (the level of a score) but not relative generalizability (the ranking), and the D-study (Section 5.6) reads off how many judges wash it out.

A latent threshold model. The model-based route keeps the latent decomposition of Section 5.2 and adds ordered cut points. Let the latent quality be \(\eta_{ij} = \theta_i - \beta_j + \gamma_{ij}\) (plus rater and occasion facets), with \(\operatorname{Var}(\eta) = \sigma_p^2 + \sigma_i^2 + \cdots\) exactly as on the log-odds scale before. A set of ordered thresholds \(\tau_1 < \cdots < \tau_{K-1}\) then carves \(\eta\) (plus logistic noise) into categories, giving the graded-response or cumulative-link model (Samejima 1969; McCullagh 1980):

\[ P(Y_{ij} \ge k) = \sigma(\eta_{ij} - \tau_k), \qquad k = 1, \dots, K-1. \tag{5.20}\]

The thresholds are where a rater draws the line between a 3 and a 4, so a judge reluctant to award 5s simply has high upper thresholds—scale usage now lives in \(\tau\), a property of the rater, instead of contaminating the merit \(\theta\). Reliability is the same signal share, \(\rho = \sigma_p^2 / \operatorname{Var}(\eta)\), computed on the latent scale, and the multinomial spread across categories is the residual—the ordinal analog of the Bernoulli residual of Section 5.5. (The Rasch family has its own ordinal members, the partial-credit and rating-scale models (Masters 1982), for thresholds shared across items.) This route is more work but respects ordinality and isolates scale usage cleanly; Route 1 is its fast approximation.

Agreement for ordinal verdicts. The plain \(\kappa\) of Section 5.9 ignores how far apart two ratings are: a 1-vs-2 disagreement counts the same as a 1-vs-5. For ordered categories use weighted kappa (Cohen 1968), which penalizes each disagreement by its distance on the scale (linear or quadratic weights). Quadratic-weighted kappa is, conveniently, an estimate of the intraclass correlation—so it is again the rater-facet generalizability coefficient, now respecting order—and Krippendorff’s \(\alpha\) with an ordinal metric does the same for many raters and missing data. Unweighted \(\kappa\) understates ordinal agreement; the weighted version recovers it.

For open-ended tasks—summarization, safety, instruction following—the dominant modern rater is an LLM judge (Zheng et al. 2023), scoring outputs against a rubric or making pairwise preference calls. Everything above applies unchanged: a judge is a rater with its own leniency (\(\gamma_r\)), model-specific quirks (\(\sigma^2_{pr}\)), and run-to-run variability, so aggregating \(n_r > 1\) judges buys reliability at the \(1/n_r\) rate of the D-study. Two things, though, are special to judges.

Agreement is categorical. Judge outputs are labels or pairwise winners. As flagged with the rater facet in Section 5.2, their agreement is summarized by a chance-corrected agreement coefficient rather than a variance ratio. Three questions recur—do different judges agree (inter-judge), does a judge agree with humans (human–LLM), is a judge self-consistent across repeats (intra-judge)—and each is such a coefficient between the relevant raters.

NoteDefinition: Cohen’s Kappa

For two raters labeling the same items, let \(p_o\) be the proportion on which they agree and \(p_e\) the agreement expected if each labeled at random from their own marginal frequencies. Cohen’s kappa (Cohen 1960) is the share of the non-chance agreement that is actually achieved:

\[ \kappa = \frac{p_o - p_e}{1 - p_e} . \]

Chance agreement \(p_e\) is the noise floor and perfect agreement (\(p_o = 1\)) the ceiling, so \(\kappa\) runs from \(0\) (no better than chance) to \(1\) (perfect)—the agreement counterpart of the generalizability coefficient. Negative values signal systematic disagreement.

Simulate five judges of differing quality on 200 pairwise comparisons, and their pairwise \(\kappa\) shows how much they agree beyond chance.

One coefficient for many raters. The matrix above already shows the limit of Cohen’s \(\kappa\): with \(R\) judges it has \(\binom{R}{2}\) entries and no single summary. Two coefficients collapse it to one number while turning the knobs Cohen’s \(\kappa\) holds fixed. Fleiss’s \(\kappa\) (Fleiss 1971) keeps the chance-corrected form but averages agreement over all rater pairs per item, so it handles any number of raters at once. Krippendorff’s \(\alpha\) (Krippendorff 2011) is the general case: writing the observed and expected disagreement under a category-distance metric \(\delta\),

\[ \alpha = 1 - \frac{D_o}{D_e}, \tag{5.21}\]

it recovers the rest by the choice of \(\delta\) (nominal \(\to\) unweighted; interval or ordinal \(\to\) weighted), accepts any number of raters, and—because it sums over whatever rating pairs are actually present—is the only one that survives missing data. All four are the same coefficient with three knobs:

Coefficient Raters Category metric Missing data
Cohen’s \(\kappa\) 2 nominal no
Weighted \(\kappa\) 2 ordinal / interval no
Fleiss’s \(\kappa\) any nominal no
Krippendorff’s \(\alpha\) any any (\(\delta\)) yes

Fleiss’s \(\kappa\) and nominal Krippendorff’s \(\alpha\) land together; the interval metric lifts the coefficient by crediting near-misses (the multi-rater echo of weighted \(\kappa\)); and only Krippendorff’s \(\alpha\) still returns a number once ratings go missing—which is why it is the default for messy, multi-annotator evaluation data.

Aggregating judges. Because a judge is a rater, the D-study remedy applies directly: pool several independent judges by majority vote and accuracy climbs at the \(1/n_r\) rate, just as adding raters raised \(G\) above.

Bias is not noise. Everything so far—CTT, G-theory, agreement coefficients—measures random error, which averages out with more data. Judges also have systematic errors that do not: position bias (favoring the response shown first (Wang et al. 2023)), length preference, self-preference. These shift the verdict, not its variance, so they are invisible to reliability.

ImportantSystematic vs. random error

Reliability analysis detects only random error—zero-mean, averaging away with replication. A systematic error like position bias is consistent, so a biased judge can be perfectly reliable yet invalid, and high inter-judge agreement does not rescue it: two judges can share the same bias. On a balanced benchmark such bias is even invisible to aggregate accuracy, because favoring “first” helps exactly as often as it hurts. Detecting it requires the validity tools of Chapter 1, or a design trick: swap the presentation order. A position-biased judge then contradicts itself, and the contradiction rate rises with the bias.

Designing a judge protocol. The remedies combine the D-study with bias control: aggregate several independent judges; randomize the presentation order and average, cancelling position bias into noise; score at temperature 0 to remove sampling stochasticity; write concrete rubrics, which shrink \(\sigma^2_r\) and \(\sigma^2_{ir}\); and calibrate against a human gold set, since agreement alone cannot catch a bias the judges share.

5.10 Generalizability for CTT

The estimators so far—plug-in (Section 5.3), Bayesian (Section 5.4), and method of moments (Section 5.5)—read the variance components off the full response matrix. Classical test theory (CTT), their early-twentieth-century antecedent, asks for less: it recovers reliability from total scores alone, with no fitted item parameters. It keeps the same decomposition but bundles every non-person term into a single error, writing the observed score \(X = T + E\) with true score \(T_i = \mathbb{E}[X_i]\), \(\sigma_T^2 = V_{\text{person}}\), and \(\sigma_E^2 = V_{\text{item}} + V_{\text{interaction}} + V_{\text{residual}}\). Reliability is the true-score share \(\rho_{XX'} = \sigma_T^2 / \sigma_X^2\)—for a single graded response exactly the person share of Equation 5.13, the same object the response model and generalizability theory estimate, now read off the scores rather than fitted.

Because true scores are unobservable, \(\rho_{XX'}\) must be inferred from how scores covary, and the classical coefficients are just different cuts of that one quantity. They also settle a puzzle from Section 5.5: a single graded response had reliability near \(0.09\), yet hundred-item benchmarks rank models stably. The reconciliation is aggregation, made exact by the Spearman–Brown length law—a test of \(K\) parallel length-\(M\) forms with per-form reliability \(\rho\) has reliability

\[ \rho_K = \frac{K \rho}{1 + (K-1)\rho}, \tag{5.22}\]

so reliability climbs predictably as items accumulate (a 50-item benchmark at \(\rho = 0.80\) reaches \(0.89\) at 100 items and \(0.94\) at 200): averaging many noisy items multiplies signal over noise.

Three coefficients estimate the same \(\rho_{XX'}\) from different data structures, each trading what it assumes for what it requires:

  • Test–retest correlates scores from two occasions—re-running at temperature \(> 0\), or re-scoring the same outputs—so the Pearson correlation estimates \(\rho_{XX'}\), but folds every occasion-to-occasion source into one number.
  • Split-half correlates two half-tests and rescales the half-length correlation by Spearman–Brown, \(\rho_{XX'} = 2 r_{12}/(1 + r_{12})\), needing no re-run.
  • Cronbach’s \(\alpha\) averages over all splits at once,

\[ \alpha = \frac{M}{M - 1}\left(1 - \frac{\sum_{j=1}^M \sigma^2_j}{\sigma^2_X}\right), \tag{5.23}\]

with \(\sigma^2_j\) the variance of item \(j\) and \(\sigma^2_X\) the total-score variance—the internal-consistency coefficient, and (from Section 5.8) the tau-equivalent special case of McDonald’s \(\omega\).

WarningWhat Cronbach’s Alpha Does and Does Not Reveal

Alpha measures how much items covary relative to total variance. High alpha means the items are measuring something in common. But alpha does not indicate: (a) whether that common thing is unidimensional, (b) whether that thing is the intended construct, or (c) whether the test is free of systematic bias.

A benchmark with \(\alpha = 0.95\) might be reliably measuring training data contamination rather than the intended capability. High reliability is necessary but not sufficient for good measurement.

Simulate a 100-model, 100-item Rasch benchmark: split-half correlates two half-test scores, then Spearman–Brown rescales to the full length.

Cronbach’s \(\alpha\) averages over all possible splits; as items accumulate it climbs along the Spearman–Brown curve predicted from a short test.

From reliability to an error bar. A coefficient earns its keep once turned into uncertainty on each score. The standard error of measurement does exactly that, shrinking the spread of observed scores by the fraction that is noise:

\[ \text{SEM} = \sigma_X \sqrt{1 - \rho_{XX'}}. \]

A benchmark with score spread \(\sigma_X = 5\) points and \(\rho_{XX'} = 0.90\) has \(\text{SEM} \approx 1.58\) points, so a 95% interval around any model spans \(\pm 2\,\text{SEM} \approx \pm 3.2\) points: two models closer than that may swap ranks on a replication—the usual situation on crowded leaderboards where dozens of models cluster within a few points. As reliability rises with test length, the SEM shrinks in step.

5.11 Item-Level Diagnostics

The diagnostics in this section read structure out of the same variance decomposition that defines reliability. Recall the person-by-item interaction \(\gamma_{ij}\) of Section 5.2, whose canonical example was a question “idiosyncratically easy for one model family”: once that interaction is indexed by a meaningful group, the same term becomes a bias to detect rather than a variance to average down—which is exactly what differential item functioning does. The tools below—DIF, dimensionality assessment, item fit, and the multitrait-multimethod matrix—operationalize the validity threats catalogued conceptually in Section 1.4: each is a structured reading of how items depart from a single, shared construct.

5.11.1 Differential Item Functioning

Differential Item Functioning (DIF) occurs when an item is systematically easier or harder for one subgroup of examinees than for another, after controlling for the overall ability level. In classical psychometrics, DIF analysis compares demographic groups; in AI evaluation, the “groups” might be model families, architectural types, or training paradigms.

Formally, an item \(i\) exhibits DIF if:

\[ P(X_{ij} = 1 \mid \theta_j, g = 1) \neq P(X_{ij} = 1 \mid \theta_j, g = 0) \tag{5.24}\]

where \(\theta_j\) is model \(j\)’s ability and \(g\) is group membership. The conditioning on \(\theta_j\) is critical: we are asking whether the item behaves differently for equally able models from different groups.

DIF comes in two forms:

  • Uniform DIF: The item is consistently easier (or harder) for one group across all ability levels. The ICCs for the two groups are shifted but do not cross.
  • Non-uniform DIF: The item favors one group at some ability levels and the other group at different ability levels. The ICCs cross.

The Mantel-Haenszel (MH) procedure (Holland and Wainer 1993) is the most widely used DIF detection method. It stratifies examinees by ability (using total score as a proxy), computes odds ratios at each stratum, and combines them into a summary statistic:

\[ \alpha_{MH} = \frac{\sum_k A_k D_k / N_k}{\sum_k B_k C_k / N_k} \tag{5.25}\]

where \(A_k, B_k, C_k, D_k\) are the entries of the \(2 \times 2\) table (correct/incorrect \(\times\) group) at ability stratum \(k\), and \(N_k\) is the stratum size. An odds ratio of 1 indicates no DIF; values significantly different from 1 indicate DIF.

AI example. Consider a mathematics benchmark comparing open-source and proprietary models. Some items might rely on LaTeX-formatted mathematical expressions that proprietary models handle better due to training data composition, independent of mathematical reasoning ability. These items would exhibit DIF: the formatting advantage inflates scores for one group without reflecting the target construct.

5.11.2 DIF Analysis

The Mantel-Haenszel procedure described in Section 5.11.1 provides a nonparametric DIF test. A complementary approach uses logistic regression (Zumbo 1999):

\[ \log \frac{P(X_{ij}=1)}{P(X_{ij}=0)} = \beta_0 + \beta_1 \theta_j + \beta_2 g_j + \beta_3 (\theta_j \times g_j) \tag{5.26}\]

where \(\theta_j\) is the ability proxy (total score), \(g_j\) is group membership, and \(\beta_3\) captures the interaction. Testing \(\beta_2 = 0\) detects uniform DIF; testing \(\beta_3 = 0\) detects non-uniform DIF.

The following simulation demonstrates DIF analysis on a benchmark with planted DIF items.

The left panel shows the Mantel-Haenszel log-odds ratio for each item. Items near zero show no DIF; items with large absolute values are flagged. The planted DIF items (red) cluster at the extremes, while non-DIF items (blue) remain near zero. The right panel compares Item Characteristic Curves: for a non-DIF item, the two groups share the same ICC; for a DIF item, Group 1’s curve is shifted leftward (the item is easier for them at every ability level), indicating uniform DIF.

In practice, DIF analysis for AI benchmarks would compare model families (e.g., autoregressive vs. encoder-decoder), training paradigms (e.g., RLHF vs. base models), or architectural features (e.g., mixture-of-experts vs. dense transformers).

5.11.3 Dimensionality Assessment

A benchmark that claims to measure a single construct (e.g., “reasoning”) should exhibit a unidimensional factor structure: a single latent factor should explain the bulk of the covariance among items. If multiple factors are needed, the benchmark is actually measuring multiple constructs, and a single summary score conflates them.

The primary tool for dimensionality assessment is eigenvalue analysis of the item correlation matrix. In a unidimensional benchmark, the first eigenvalue should be substantially larger than the rest. However, raw eigenvalues can be misleading because even random data produces a non-trivial first eigenvalue.

Parallel analysis (Horn 1965) addresses this by comparing the observed eigenvalues to those from random data of the same dimensions. A factor is retained only if its eigenvalue exceeds the corresponding random-data eigenvalue. This is connected to the factor models discussed in Chapter 3: if the data require \(k\) factors, the benchmark is measuring \(k\) distinct constructs, not one.

The scree plot (left) shows two eigenvalues clearly exceeding the parallel analysis threshold (red dashed line), confirming two-dimensional structure. The loading plot (right) reveals the two-factor structure: verbal items (blue, 1–10) load on the first component, while quantitative items (red, 11–20) load on the second. A benchmark reporting a single “reasoning” score would conflate these two distinct abilities.

This analysis connects directly to the factor models introduced in Chapter 3. When dimensionality assessment reveals multiple factors, the appropriate response is to either (a) report subscores for each factor rather than a single composite, or (b) revise the benchmark to focus on a single construct.

5.11.4 Item Fit and Contamination Detection

Item-fit statistics from Rasch modeling (Wright and Masters 1982) measure how well individual items conform to the measurement model. Items that misfit—those whose response patterns deviate from model predictions—may indicate contamination, construct-irrelevant features, or other validity problems.

The two standard fit statistics are:

  • Infit (information-weighted mean square): Sensitive to unexpected responses near an item’s difficulty level. Infit emphasizes on-target responses and is less affected by extreme scores.
  • Outfit (unweighted mean square): Sensitive to unexpected responses far from an item’s difficulty level. Large outfit values indicate that the item behaves erratically for very high- or very low-ability models.

For a Rasch model with predicted probability \(P_{ij} = \text{expit}(\theta_j - \beta_i)\), the standardized residual is:

\[ z_{ij} = \frac{X_{ij} - P_{ij}}{\sqrt{P_{ij}(1 - P_{ij})}} \]

The outfit mean square for item \(i\) is:

\[ \text{Outfit}_i = \frac{1}{N} \sum_{j=1}^{N} z_{ij}^2 \]

and infit weights by the variance:

\[ \text{Infit}_i = \frac{\sum_{j=1}^{N} (X_{ij} - P_{ij})^2}{\sum_{j=1}^{N} P_{ij}(1 - P_{ij})} \]

Expected value under the model is 1.0. Values above 1.3 suggest underfit (more noise than the model predicts, possibly multidimensionality or noise). Values below 0.7 suggest overfit (too predictable, possibly item redundancy or dependency). Very high values (above 2.0) indicate serious misfit.

Contamination leaves a distinctive item-fit signature: contaminated items become much easier for models that have memorized them (producing unexpectedly correct responses from low-ability models) while remaining at their true difficulty for uncontaminated models. This inflates outfit statistics because the misfitting responses come from models far from the item’s difficulty.

The left panel shows the infit-outfit scatter: clean items (blue) cluster near the expected value of 1.0, while contaminated items (red) show inflated fit statistics, especially outfit. This is because contaminated models produce unexpectedly correct responses on these items, creating large residuals. The right panel shows the ICC for a single contaminated item: the observed proportions (red dots) consistently exceed the Rasch-expected curve (blue line), particularly at lower ability levels where the memorization effect is most visible.

Item-fit analysis is a powerful diagnostic because it does not require knowing which models are contaminated or which items are compromised. The Rasch model’s prediction residuals reveal anomalies regardless of their source.

5.11.5 The Multitrait-Multimethod Matrix

The Multitrait-Multimethod (MTMM) matrix, introduced by Campbell and Fiske (1959), provides a systematic framework for assessing both convergent and discriminant validity simultaneously. The idea is to measure multiple traits (constructs) using multiple methods (measurement formats) and examine the resulting correlation structure.

In AI evaluation:

  • Traits are the constructs of interest: reasoning, knowledge retrieval, creativity, instruction following, etc.
  • Methods are the benchmark formats: multiple-choice questions, open-ended generation, pairwise comparison, code completion, etc.

The MTMM matrix organizes correlations into four blocks:

  1. Monotrait-heteromethod (same trait, different methods): These should be high—this is convergent validity. If “reasoning” measured by MCQ and by open-ended generation produces highly correlated scores, the construct is robust across methods.
  2. Heterotrait-monomethod (different traits, same method): These should be lower than monotrait-heteromethod correlations—this is discriminant validity. If “reasoning-MCQ” correlates as highly with “knowledge-MCQ” as with “reasoning-open-ended,” then method variance dominates trait variance.
  3. Heterotrait-heteromethod (different traits, different methods): These should be lowest—neither shared trait nor shared method.
  4. Reliability diagonals (same trait, same method, repeated): These should be highest—they set the ceiling for validity.

Campbell and Fiske’s criteria for construct validity require:

  1. Convergent validity coefficients (monotrait-heteromethod) should be significantly different from zero and large enough to warrant further investigation.
  2. Convergent validity coefficients should be higher than the values in the same row and column of the heterotrait-heteromethod block.
  3. Convergent validity coefficients should be higher than the heterotrait-monomethod correlations.
  4. The pattern of trait intercorrelations should be consistent across method blocks.

The heatmap (left) shows the full MTMM correlation matrix with trait blocks separated by black lines. Strong convergent validity appears as high correlations along the off-diagonal within each trait block (same trait measured by different methods). The bar chart (right) confirms the expected ordering: convergent validity coefficients (green) are highest, discriminant-monomethod correlations (orange) are moderate (reflecting shared method variance), and discriminant-heteromethod correlations (red) are lowest. This ordering satisfies the Campbell-Fiske criteria for construct validity.

For AI evaluation, constructing an MTMM study requires measuring the same constructs with genuinely different methods. If all benchmarks use the same multiple-choice format, discriminant validity cannot be assessed because method variance is confounded with trait variance.

5.12 Building Trustworthy AI Evaluations

Having developed tools to detect validity problems (Section 5.11), we turn to building benchmarks with validity designed in from the start. Those tools double as construction aids: the multitrait-multimethod analysis (Section 5.11.5) supplies evidence for the nomological network by testing whether the expected convergent and discriminant relationships hold, and the item-level diagnostics of Section 5.11.2 and Section 5.11.4 confirm that individual items behave as the construct theory predicts.

5.12.1 Principled Item Construction and Revision

Constructing a valid benchmark requires more than assembling a large set of items. The process should be systematic:

  1. Define the construct: Write an explicit construct definition specifying what the benchmark is intended to measure and what it is not intended to measure. This definition serves as the foundation for content validity.

  2. Specify the domain: Create a detailed content specification (sometimes called a “test blueprint” or “table of specifications”) that enumerates the sub-domains, difficulty levels, and item types that should be represented.

  3. Generate candidate items: Items can be written by domain experts, sourced from existing materials, or generated synthetically using LLMs. For synthetic generation, the construct definition and difficulty targets serve as prompts.

  4. Pilot and analyze: Administer items to a representative sample of models. Compute item statistics: difficulty (\(p\)-value or IRT \(\beta\)), discrimination (\(a\) parameter or point-biserial correlation), and fit (infit/outfit from Section 5.11.4).

  5. Screen for validity threats: Run DIF analysis (Section 5.11.2) to check for group bias. Run dimensionality analysis (Section 5.11.3) to verify unidimensionality (if claimed). Check for contamination using fit statistics and public/private splits.

  6. Revise and iterate: Remove or revise items with poor statistics. Replace them with new items and re-pilot. This cycle continues until the item pool meets quality standards.

The following simulation illustrates the pilot-and-analyze phase.

The three panels show the item screening workflow. The difficulty-discrimination plot (left) reveals that low-discrimination items (red) cluster below the 0.2 threshold, providing little measurement information, while contaminated items (orange) show inflated proportion-correct values. The infit plot (center) highlights items that deviate from model predictions. The summary (right) shows the yield: how many candidate items survive screening. Items that are flagged should be revised or replaced before the benchmark is finalized.

5.12.2 Item-Design Principles

We close with six design principles that synthesize the concepts developed in this chapter.

1. Start with the construct, not the data. Define what is being measured before collecting items. Write an explicit construct definition and content specification. This prevents construct underrepresentation and makes content validity assessable.

2. Use multiple methods. Measure each construct with at least two different formats (e.g., MCQ and open-ended). This enables MTMM analysis and separates trait variance from method variance. A construct that is robust across methods has stronger validity evidence.

3. Check dimensionality. Before reporting a single composite score, verify that the benchmark is unidimensional (or report subscores for each dimension). Parallel analysis is cheap and can prevent misleading composites.

4. Screen for DIF. Whenever the model population includes distinct subgroups (architectural families, training paradigms, model sizes), run DIF analysis to detect items that function differently across groups.

5. Monitor for contamination. Maintain a held-out private item pool. Compare performance on public and private items. Use item-fit statistics to flag items where some models perform unexpectedly well.

6. Document validity evidence. Follow the claim-centered approach: for each claim the benchmark is intended to support, document the available validity evidence and its limitations. This transparency allows users to judge whether the benchmark supports their intended interpretation.

Worked example: evaluating summarization ability. Suppose the goal is to build a benchmark for evaluating how well language models summarize documents.

  • Construct definition: The ability to produce concise, accurate, and coherent summaries of source documents, preserving key information while omitting irrelevant details.
  • Content specification: Documents from 5 domains (news, scientific, legal, medical, conversational), 3 lengths (short, medium, long), 2 genres (informative, narrative).
  • Multiple methods: Evaluate summaries with (a) ROUGE against reference summaries, (b) LLM-as-judge on a rubric, (c) human expert ratings.
  • Dimensionality check: Factor analysis reveals two dimensions: factual accuracy and coherence. Report subscores for each.
  • DIF analysis: Compare models fine-tuned for summarization vs. general-purpose models. Flag items where domain-specific vocabulary creates an irrelevant advantage.
  • Contamination check: Use held-out documents not available online. Compare private vs. public document performance gaps.

5.13 Design principles

The theory developed in this chapter yields concrete design principles:

  1. Identify the dominant source of variability (Section 5.6). Run a G-study before committing to a design. If rater variability dominates, invest in more raters or clearer rubrics; if item-sampling variability dominates, use more items or D-optimal selection.

  2. Match items to the ability range of interest (Section 5.3). Conditional reliability varies across ability levels; ranking frontier models needs items hard enough to discriminate among them.

  3. Use multiple judges when scoring is subjective (Section 5.9). The D-study quantifies the gain: three independent judges with majority vote are substantially more reliable than one, and often cheaper than more items.

  4. Report confidence intervals, not just point estimates (Section 5.10). Two models whose scores overlap within \(\pm 2\,\text{SEM}\) may not be meaningfully different; leaderboards should mark which ranking differences are reliable.

  5. Control what can be controlled (Section 5.1). Score at temperature 0, randomize and average over prompt formats, standardize rubrics—these eliminate sources of variability rather than averaging over them.

  6. Separate reliability from validity (Chapter 1). High reliability is necessary but not sufficient: a benchmark can be reliable while systematically measuring the wrong thing (contamination, shortcut features, length bias).

  7. Audit items, don’t just score them (Section 3.3). Reliability diagnostics flag broken benchmark items—negative item–total correlations signal flipped keys, ambiguity, or grading bugs.

5.14 Bibliographic Notes

Classical test theory. The foundational reference is Lord and Novick (1968), which formalized CTT axiomatically. Cronbach (1951) introduced coefficient alpha, the most widely used reliability statistic. The Spearman–Brown formula dates to Spearman (1910) and Brown (1910) independently. For the connection between CTT and IRT-based reliability, see Chapter 7 of Hambleton and Swaminathan (1985).

Variance decomposition and component estimation. The decomposition of a function’s variance into additive main-effect and interaction terms—the functional ANOVA this chapter builds on—was introduced for \(U\)-statistics by Hoeffding (1948) and, in the global-sensitivity literature, by Sobol (2001); Efron and Stein (1981) give the closely related variance bound. Estimating the resulting variance components from data is the subject of Searle et al. (1992): the method-of-moments (expected mean squares) route traces to Henderson (1953), and restricted maximum likelihood (REML)—the modern default for unbalanced designs—to Patterson and Thompson (1971).

Generalizability theory. G-theory was developed by Cronbach et al. (1972), building on earlier work by Cronbach on the multiple sources of measurement error. Brennan (2001) provides the comprehensive modern treatment. Shavelson and Webb (1991) offers an accessible primer. The connection between G-theory and mixed-effects models makes modern software (e.g., lme4 in R) directly applicable to G-studies.

Inter-rater reliability. Cohen (1960) introduced Cohen’s kappa. Fleiss (1971) extended it to multiple raters. Krippendorff (2011) provides Krippendorff’s alpha, which handles missing data, ordinal scales, and any number of raters. For a comprehensive treatment with practical guidance, see Gwet (2014).

Model selection. For choosing a facet structure by predictive accuracy, Watanabe (2010) introduced the widely applicable information criterion (WAIC) and Vehtari et al. (2017) the practical PSIS-LOO cross-validation estimate; both approximate held-out likelihood while integrating out random effects (Section 5.7).

Factor-analytic reliability. McDonald’s \(\omega\) derives from the common-factor model (McDonald 1999); Revelle and Zinbarg (2009) compare \(\alpha\), \(\omega\), and the greatest lower bound, and develop \(\omega_h\) for general-factor saturation. The lower-bound family in which Cronbach’s \(\alpha\) is one member (\(\lambda_3\)) traces to Guttman (1945) (Section 5.8).

Ordinal and graded responses. The graded-response model is due to Samejima (1969); McCullagh (1980) developed the cumulative-link (proportional-odds) regression that underlies it, and Masters (1982) the partial-credit Rasch variant. Weighted kappa for ordered categories is Cohen (1968); with quadratic weights it estimates the intraclass correlation, linking ordinal agreement back to the generalizability coefficient (Section 5.9).

LLM-as-a-judge. Zheng et al. (2023) introduced the LLM-as-a-judge paradigm and the MT-Bench evaluation framework. Position bias was documented by Wang et al. (2023). Shankar et al. (2024) addresses the question of validating LLM judges against human preferences. For prompt sensitivity in evaluation, see Mizrahi et al. (2024). Biderman et al. (2024) discusses reproducibility challenges in language model evaluation more broadly.

Item-level diagnostics. The Multitrait-Multimethod matrix was introduced by Campbell and Fiske (1959) and remains one of the most systematic approaches to convergent and discriminant validity. Differential Item Functioning methods are covered comprehensively by Holland and Wainer (1993) and Zumbo (1999). Item-fit statistics for Rasch models are developed by Wright and Masters (1982), and parallel analysis for dimensionality assessment by Horn (1965).

5.15 Exercises

The first six exercises are conceptual; the rest involve derivation or implementation, marked \(\star\) (straightforward) to \(\star\star\star\) (challenging).

  1. Determinism and reliability. At temperature 0, a language model gives the same output every time. Does this mean the evaluation has perfect test–retest reliability? What sources of unreliability remain even with deterministic models?

  2. Reliability vs. number of items. Spearman–Brown predicts that longer tests are more reliable, and AI benchmarks often have thousands of items—far more than typical educational tests. Does this make them automatically reliable? What assumptions might be violated?

  3. LLM-as-a-judge tradeoffs. Using an LLM as a judge is cheaper than human annotation but introduces model-specific biases. Under what conditions is a less accurate but more reliable judge preferable? How does generalizability theory formalize the tradeoff?

  4. Reliability across model generations. When a new generation of models is released, a benchmark’s reliability can change as item difficulties shift relative to model abilities. How should benchmark developers monitor and maintain reliability over time?

  5. The reliability–validity tension. A benchmark of 1000 copies of the same easy question would have extremely high internal consistency but near-zero validity. How does this illustrate the distinction, and what design principles prevent it?

  6. The item-removal paradox. Spearman–Brown says shortening a test lowers reliability, yet dropping an item that correlates poorly with the rest can raise Cronbach’s \(\alpha\). Reconcile these: when does removing an item help reliability on net, and what does that reveal about the equal-quality (parallel-items) assumption behind Spearman–Brown? (The item-quality diagnostics that flag such items are developed in Section 3.2.2.)

  7. Deriving reliability (\(\star\)). Starting from the CTT model \(X = T + E\) with the standard assumptions (\(\mathbb{E}[E] = 0\), \(\operatorname{Cov}(T, E) = 0\)), derive \(\rho_{XX'} = \sigma^2_T / \sigma^2_X\), and show reliability equals the squared correlation between observed and true scores, \(\rho_{XX'} = r^2_{XT}\).

  8. Spearman–Brown (\(\star\star\)). Derive the prophecy formula: a test of length \(KM\) built from \(K\) parallel forms of a length-\(M\) test with reliability \(\rho\) has reliability \(\rho_K = K\rho / (1 + (K-1)\rho)\). What happens as \(K \to \infty\)?

  9. Generalizability coefficient (\(\star\star\)). In a \(p \times i \times r\) design, derive the generalizability coefficient \(G\), and show that increasing \(n_r\) reduces the contribution of rater variance but not item variance, and vice versa.

  10. Cronbach’s \(\alpha\) as a mean split-half (\(\star\star\)). Show that \(\alpha\) is the mean of all possible split-half reliabilities (after Spearman–Brown correction) for an \(M\)-item test. Under what conditions is \(\alpha\) a lower bound for reliability?

  11. G-study implementation (\(\star\star\)). Implement a G-study for a model\(\times\)item\(\times\)rater design: given a 3-way array \(X[p,i,r]\), estimate all variance components by expected mean squares, and verify on simulated data that the estimates recover the truth.

  12. LLM-judge simulation (\(\star\star\)). Simulate an LLM-as-a-judge evaluation with 5 judges of varying quality (agreement with a gold standard from 0.70 to 0.90). Compute (a) pairwise Cohen’s \(\kappa\) and (b) majority-vote accuracy for 1, 3, and 5 judges. How does reliability change if all judges share a systematic bias, such as preferring longer responses?

  13. D-study optimizer (\(\star\star\star\)). Given variance components and costs per item (\(c_i\)) and per rater–item pair (\(c_r\)), find the cheapest design \((n_i, n_r)\) achieving \(G \geq G^\star\). Solve it as a constrained optimization and visualize the cost surface.

  14. Chatbot Arena reliability (\(\star\star\)). The Chatbot Arena uses a single human preference judgment per comparison. Using generalizability theory, analyze the reliability of the resulting Elo ratings: how many comparisons per model pair would reach \(G \geq 0.90\), and what are the cost implications?

  15. Ordinal reliability: interval vs. latent (\(\star\star\)). Simulate 1–5 judge ratings from the graded-response model of Equation 5.20 with a judge scale-usage effect. Compute the generalizability coefficient (a) by treating the scale as interval (Section 5.9) and (b) from the fitted latent threshold model. When do the two diverge, and how does the gap depend on the number of categories \(K\) and the size of the scale-usage effect?

  16. Weighted kappa and the ICC (\(\star\star\)). Show that quadratic-weighted kappa equals the intraclass correlation for two raters on an interval scale. Then simulate ordinal ratings and verify numerically that unweighted kappa understates agreement while quadratic-weighted kappa tracks the Pearson correlation between raters. Why does this make quadratic-weighted kappa the natural rater-facet generalizability coefficient for ordinal data?

  17. McDonald’s \(\omega\) from first principles (\(\star\star\)). Starting from the congeneric one-factor model \(X_{ij} = \nu_j + \lambda_j\theta_i + \varepsilon_{ij}\) (Equation 5.17), derive the total-score reliability \(\omega = (\sum_j \lambda_j)^2\sigma_\theta^2 / [(\sum_j \lambda_j)^2\sigma_\theta^2 + \sum_j \psi_j]\). Simulate items with unequal loadings and verify numerically that \(\alpha < \omega \approx\) the true reliability; then set all loadings equal and confirm \(\omega\) collapses to the Spearman–Brown step-up of the single-item reliability (Equation 5.4).

  18. Does the interaction earn its place? (\(\star\star\star\)). Generate replicated binary responses from a model with a person-by-item interaction (\(\sigma^2_{pi} > 0\)) and from one without. For each, fit a main-effects model and a saturated (per-cell) model and compare them by held-out log-likelihood on (a) masked cells and (b) masked trials. Show that the iid interaction is selectable only from held-out trials, and only once replication is sufficient (Section 5.7). Why does a held-out-cell comparison fail to detect it?

  19. Empirical Bayes vs. full Bayes (\(\star\star\)). With one binary trial per cell and abilities \(\theta_i \sim \mathcal{N}(0, \sigma_p^2)\), estimate \(\sigma_p\) two ways: by marginal maximum likelihood (the evidence peak) and by reporting the full posterior mean and 95% credible interval under a flat hyperprior (Section 5.4). Explain why the two point estimates differ and why the credible interval is wide when there is only one trial per cell.

Item-level diagnostics. These accompany the diagnostics of Section 5.11.

  1. Prove that if an item exhibits uniform DIF between two groups (i.e., the item is uniformly easier for one group at all ability levels), the standard Rasch model \(P(X_{ij}=1) = \text{expit}(\theta_j - \beta_i)\) cannot hold simultaneously for both groups with the same item parameter \(\beta_i\). What does this imply about the meaning of “item difficulty” when DIF is present?

  2. Consider a 2-trait \(\times\) 2-method MTMM design. Show that if all convergent validity coefficients equal 1 (perfect convergent validity) and all heterotrait correlations equal 0 (perfect discriminant validity), then the method factors contribute zero variance. What does this imply about the interpretability of scores in practice, where method variance is nonzero?

  3. Define infit mean square as \(\text{Infit}_i = \sum_j (X_{ij} - P_{ij})^2 / \sum_j P_{ij}(1-P_{ij})\). Suppose a fraction \(\gamma\) of models have memorized item \(i\) and respond correctly with probability 1 regardless of ability. Derive the expected infit as a function of \(\gamma\) and the item difficulty \(\beta_i\). For what values of \(\gamma\) does the infit exceed the conventional 1.3 threshold?

  4. Download a publicly available AI benchmark dataset (e.g., MMLU, HellaSwag, or ARC). Divide the models into two groups (e.g., models above and below 7B parameters). Conduct a Mantel-Haenszel DIF analysis and report which items, if any, show significant DIF. Interpret the results: does the DIF reflect genuine ability differences or construct-irrelevant features?

  5. Simulate a benchmark with 3 underlying factors (e.g., verbal, quantitative, and spatial reasoning) using a factor model from Chapter 3. Generate responses for 300 models on 30 items (10 per factor). Implement parallel analysis to determine the correct dimensionality. How does the accuracy of parallel analysis depend on the factor correlations and sample size?

  6. Using the MTMM simulation framework from Section 5.11.5, vary the relative strength of trait loadings vs. method loadings. At what ratio do the Campbell-Fiske criteria begin to fail? Plot the convergent and discriminant validity coefficients as a function of the trait-to-method variance ratio.

  7. Design an MTMM study for evaluating “mathematical reasoning” in large language models. Specify the traits, methods, and the expected pattern of convergent and discriminant correlations.

References

Biderman, Stella et al. 2024. “Lessons from the Trenches on Reproducible Evaluation of Language Models.” arXiv Preprint arXiv:2405.14782.
Brennan, Robert L. 2001. Generalizability Theory. Springer.
Brown, William. 1910. “Some Experimental Results in the Correlation of Mental Abilities.” British Journal of Psychology 3 (3): 296–322.
Campbell, Donald T., and Donald W. Fiske. 1959. “Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix.” Psychological Bulletin 56 (2): 81–105.
Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46.
Cohen, Jacob. 1968. “Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit.” Psychological Bulletin 70 (4): 213–20.
Cronbach, Lee J. 1951. “Coefficient Alpha and the Internal Structure of Tests.” Psychometrika 16 (3): 297–334.
Cronbach, Lee J., Goldine C. Gleser, Harinder Nanda, and Nageswari Rajaratnam. 1972. The Dependability of Behavioral Measurements. Wiley.
Efron, Bradley, and Charles Stein. 1981. “The Jackknife Estimate of Variance.” The Annals of Statistics 9 (3): 586–96.
Fleiss, Joseph L. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82.
Guttman, Louis. 1945. “A Basis for Analyzing Test-Retest Reliability.” Psychometrika 10 (4): 255–82.
Gwet, Kilem Li. 2014. Handbook of Inter-Rater Reliability. 4th ed. Advanced Analytics.
Hambleton, Ronald K., and Hariharan Swaminathan. 1985. Item Response Theory: Principles and Applications. Kluwer-Nijhoff.
Henderson, Charles R. 1953. “Estimation of Variance and Covariance Components.” Biometrics 9 (2): 226–52.
Hoeffding, Wassily. 1948. “A Class of Statistics with Asymptotically Normal Distribution.” The Annals of Mathematical Statistics 19 (3): 293–325.
Holland, Paul W., and Howard Wainer. 1993. Differential Item Functioning. Lawrence Erlbaum Associates.
Horn, John L. 1965. “A Rationale and Test for the Number of Factors in Factor Analysis.” Psychometrika 30 (2): 179–85.
Krippendorff, Klaus. 2011. “Computing Krippendorff’s Alpha-Reliability.” Departmental Papers (ASC).
Lord, Frederic M., and Melvin R. Novick. 1968. Statistical Theories of Mental Test Scores. Addison-Wesley.
Masters, Geoff N. 1982. “A Rasch Model for Partial Credit Scoring.” Psychometrika 47 (2): 149–74.
McCullagh, Peter. 1980. “Regression Models for Ordinal Data.” Journal of the Royal Statistical Society: Series B (Methodological) 42 (2): 109–27.
McDonald, Roderick P. 1999. Test Theory: A Unified Treatment. Lawrence Erlbaum Associates.
Mizrahi, Moran et al. 2024. “State of What Art? A Call for Multi-Prompt LLM Evaluation.” Transactions of the Association for Computational Linguistics.
Patterson, H. Desmond, and Robin Thompson. 1971. “Recovery of Inter-Block Information When Block Sizes Are Unequal.” Biometrika 58 (3): 545–54.
Revelle, William, and Richard E. Zinbarg. 2009. “Coefficients Alpha, Beta, Omega, and the Glb: Comments on Sijtsma.” Psychometrika 74 (1): 145–54.
Samejima, Fumiko. 1969. “Estimation of Latent Ability Using a Response Pattern of Graded Scores.” Psychometrika Monograph Supplement 34 (4, Pt. 2): 1–97.
Searle, Shayle R., George Casella, and Charles E. McCulloch. 1992. Variance Components. Wiley.
Shankar, Shreya et al. 2024. “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.” arXiv Preprint arXiv:2404.12272.
Shavelson, Richard J., and Noreen M. Webb. 1991. Generalizability Theory: A Primer. Sage.
Sobol, Ilya M. 2001. “Global Sensitivity Indices for Nonlinear Mathematical Models and Their Monte Carlo Estimates.” Mathematics and Computers in Simulation 55 (1–3): 271–80.
Spearman, Charles. 1910. “Correlation Calculated from Faulty Data.” British Journal of Psychology 3 (3): 271–95.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.
Wang, Peiyi et al. 2023. “Large Language Models Are Not Fair Evaluators.” arXiv Preprint arXiv:2305.17926.
Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11: 3571–94.
Wright, Benjamin D., and Geofferey N. Masters. 1982. Rating Scale Analysis: Rasch Measurement. MESA Press.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Advances in Neural Information Processing Systems.
Zumbo, Bruno D. 1999. A Handbook on the Theory and Methods of Differential Item Functioning (DIF). National Defense Headquarters.