5 Reliability
By the end of this chapter, you will be able to:
- Define reliability in the context of AI evaluation and distinguish it from validity.
- Apply Classical Test Theory to decompose observed scores into signal and noise.
- Compute reliability coefficients (Cronbach’s alpha, split-half, test-retest) for AI benchmarks.
- Explain Generalizability Theory and use it to quantify multiple sources of measurement error (items, raters, occasions).
- Analyze sources of noise in AI evaluation: annotation variability, prompt sensitivity, model stochasticity, and benchmark sampling.
- Evaluate the reliability of LLM-as-a-judge evaluation protocols.
- Design evaluation procedures that control for identified sources of unreliability.
This chapter can be covered in 3 lectures (75-90 minutes each):
Lecture 1: Signal and Noise in AI Measurement
- When leaderboards flicker: motivating reliability (15 min)
- Classical Test Theory: true score + error (25 min)
- Reliability coefficients for AI benchmarks (25 min)
- Hands-on: split-half reliability and Cronbach’s alpha (10 min)
Lecture 2: Generalizability Theory
- Sources of noise: taxonomy for AI evaluation (15 min)
- From CTT to G-theory: multiple error facets (25 min)
- G-studies and D-studies for benchmark design (25 min)
- Hands-on: variance decomposition and D-study optimization (10 min)
Lecture 3: Reliability in Practice
- LLM-as-a-judge reliability (25 min)
- Inter-rater agreement: kappa and beyond (15 min)
- Reliability under IRT: conditional precision (20 min)
- Design principles for reliable AI evaluation (15 min)
This chapter introduces Classical Test Theory notation: \(X_{ij} = T_i + E_{ij}\) (observed = true + error), \(\rho_{XX'}\) (reliability), \(\alpha\) (Cronbach’s alpha), \(\sigma^2_p, \sigma^2_i, \sigma^2_r\) (G-theory variance components), and \(\kappa\) (Cohen’s kappa). See ?sec-notation for the complete notation reference.
5.1 When Leaderboards Flicker
Imagine you evaluate ten language models on a reasoning benchmark. You report the rankings to your team. The next day, a colleague re-runs the same evaluation and gets a different ranking. The top two models swap places. A model that was fourth is now second. Nothing about the models changed—only the measurement procedure differed in small ways: a slightly different prompt template, a different random seed for sampling, a different subset of annotators scoring the open-ended responses.
This is the reliability problem. Before we can ask whether a benchmark measures what it claims to measure (the validity question, addressed in Chapter 6), we must first ask a more basic question: does the evaluation give the same answer when applied to the same thing twice?
Reliability is not accuracy. A thermometer that consistently reads two degrees too high is reliable—it gives the same answer every time—even though it is not accurate. Conversely, a thermometer that fluctuates randomly between readings is unreliable, and therefore cannot be accurate in any useful sense. In measurement science, reliability refers to the consistency or reproducibility of a measurement procedure.
If an evaluation procedure produces different results every time it is applied to the same model, it cannot be measuring anything about the model. Unreliable measurements are noise, regardless of how carefully the benchmark was designed. Chapter 6 addresses validity—whether we are measuring the right thing. This chapter addresses the prior question: are we measuring anything at all?
The sources of inconsistency in AI evaluation are diverse: prompt formatting, temperature sampling, annotator disagreement, benchmark item selection, scoring rubric ambiguity, and even API version drift. Each source contributes noise that can distort model rankings. This chapter provides the theoretical tools to quantify, decompose, and control these sources of noise, drawing on Classical Test Theory and Generalizability Theory from the measurement sciences, applied throughout to the specific challenges of evaluating AI systems.
5.2 Classical Test Theory for AI Evaluation
5.2.1 The True Score Model
Recall from Chapter 2 the decomposition \(X = T + E\): every observed score is the sum of a true score and an error. We now develop this formally.
Let \(X_{ij}\) be the observed score for model \(i\) on measurement occasion \(j\) (where “occasion” might mean a particular run, a particular prompt template, or a particular annotator). The true score \(T_i\) is defined as the expectation over hypothetical replications:
\[ T_i = \mathbb{E}_j[X_{ij}] \]
The error \(E_{ij} = X_{ij} - T_i\) satisfies three axioms:
- Zero mean: \(\mathbb{E}_j[E_{ij}] = 0\) for all \(i\)
- Uncorrelated with true scores: \(\text{Cov}(T_i, E_{ij}) = 0\)
- Uncorrelated across occasions: \(\text{Cov}(E_{ij}, E_{ik}) = 0\) for \(j \neq k\)
These assumptions give us the fundamental variance decomposition:
\[ \sigma^2_X = \sigma^2_T + \sigma^2_E \]
Reliability is the proportion of observed score variance that is true score variance:
\[ \rho_{XX'} = \frac{\sigma^2_T}{\sigma^2_X} = \frac{\sigma^2_T}{\sigma^2_T + \sigma^2_E} = 1 - \frac{\sigma^2_E}{\sigma^2_X} \]
Equivalently, reliability is the squared correlation between observed scores and true scores: \(\rho_{XX'} = r^2_{XT}\).
In AI evaluation, the “true score” \(T_i\) represents the stable capability of model \(i\) on the construct being measured. The “error” \(E_{ij}\) encompasses everything that causes the observed score to deviate from this stable capability: prompt variation, sampling randomness, annotator disagreement, item selection, and so on.
This connects directly to the reliability formula used in Chapter 3’s CAT stopping rule: \(R = 1 - \sigma^2_{\text{post}} / \sigma^2_{\text{prior}}\). The IRT version and CTT version express the same idea—both measure the fraction of variance that is signal rather than noise.
5.2.2 Standard Error of Measurement
The standard error of measurement (SEM) translates reliability into a confidence interval around individual scores:
\[ \text{SEM} = \sigma_X \sqrt{1 - \rho_{XX'}} \]
For AI benchmarks, SEM has direct practical implications. Suppose a benchmark reports accuracy scores with a standard deviation of \(\sigma_X = 5\) percentage points and reliability \(\rho_{XX'} = 0.90\). Then:
\[ \text{SEM} = 5 \sqrt{1 - 0.90} \approx 1.58 \text{ points} \]
A 95% confidence interval around any model’s score spans approximately \(\pm 2 \times \text{SEM} \approx \pm 3.2\) points. Two models whose scores differ by less than this margin may not be meaningfully different—their ranking could easily reverse on a replication. This is a common situation on crowded leaderboards where dozens of models cluster within a few percentage points of each other.
5.2.3 Reliability Coefficients
Since true scores are unobservable, reliability must be estimated from data. Three classical approaches correspond to different experimental designs, each suited to different AI evaluation scenarios.
Test-retest reliability. Administer the same benchmark to the same models on two occasions and correlate the scores. For AI: re-run the evaluation with temperature \(> 0\) (different random seeds), or have the same annotators score the same outputs on two occasions. The Pearson correlation between the two sets of scores estimates \(\rho_{XX'}\). This method is simple but conflates all sources of occasion-to-occasion variability into a single number.
Split-half reliability. Randomly divide the benchmark items into two halves, score each half separately, and correlate the half-scores. The Spearman-Brown correction adjusts for the fact that each half is only half as long as the full test:
\[ \rho_{XX'} = \frac{2 r_{12}}{1 + r_{12}} \]
where \(r_{12}\) is the correlation between the two half-scores. For AI evaluation, this is easy to compute from existing data—no re-running required.
Internal consistency (Cronbach’s alpha). The most widely used reliability coefficient generalizes split-half reliability by averaging over all possible splits:
\[ \alpha = \frac{M}{M - 1} \left(1 - \frac{\sum_{j=1}^M \sigma^2_j}{\sigma^2_X}\right) \tag{5.1}\]
where \(M\) is the number of items, \(\sigma^2_j\) is the variance of item \(j\) across models, and \(\sigma^2_X\) is the variance of total scores.
Alpha measures how much items covary relative to total variance. High alpha means the items are measuring something in common. But alpha does not tell you: (a) whether that common thing is unidimensional, (b) whether that thing is the construct you intended, or (c) whether the test is free of systematic bias.
A benchmark with \(\alpha = 0.95\) might be reliably measuring training data contamination rather than the intended capability. High reliability is necessary but not sufficient for good measurement.
The Spearman-Brown prophecy formula predicts how reliability changes with test length. If a test of \(M\) items has reliability \(\rho\), a test of \(KM\) items (made of \(K\) parallel forms) has reliability:
\[ \rho_K = \frac{K \rho}{1 + (K-1)\rho} \tag{5.2}\]
This formula is practically important: it tells you how many items you need to achieve a target reliability. If your 50-item benchmark has \(\rho = 0.80\), you need \(K = 2\) (100 items) to reach \(\rho = 0.89\), or \(K = 4\) (200 items) to reach \(\rho = 0.94\).
5.3 Sources of Noise in AI Evaluation
Classical Test Theory lumps all error into a single term \(E\). To improve evaluation reliability, we need to understand where the noise comes from. This section provides a taxonomy of noise sources specific to AI evaluation.
5.3.1 A Taxonomy of Error Sources
We organize noise sources by the level at which they operate:
| Source | Level | CTT Interpretation | Example |
|---|---|---|---|
| Sampling stochasticity | Within-model | Test-retest error | Different outputs at temperature > 0 |
| Prompt sensitivity | Within-model | Format effect | Different prompt templates yield different scores |
| Item sampling | Within-benchmark | Parallel forms error | Different subsets of questions |
| Annotator disagreement | Within-scoring | Rater error | Different humans score the same output differently |
| LLM judge variability | Within-scoring | Rater error | Different LLM judges disagree |
| Rubric ambiguity | Within-scoring | Systematic + random | Vague criteria interpreted differently |
| API/version drift | Across-time | Test-retest error | Model weights updated silently |
Each source contributes to \(\sigma^2_E\), but they do so in different ways and require different remedies. Prompt sensitivity and annotator disagreement may be the dominant sources for open-ended evaluation, while item sampling variability dominates for fixed-format benchmarks.
5.3.2 Prompt Sensitivity
Even deterministic models (temperature = 0) produce different responses under different prompt formats. Adding “Let’s think step by step” to a reasoning prompt, changing the answer format from multiple choice to open-ended, or even rearranging the few-shot examples can shift model rankings (Mizrahi et al. 2024). This is analogous to test format effects in educational testing—the same knowledge is assessed differently by a multiple-choice exam versus a free-response exam.
Prompt sensitivity threatens reliability because the “true score” depends on which prompt template is used. If model \(A\) outperforms model \(B\) on one template but not another, the models’ relative standing is prompt-dependent, not ability-dependent.
5.3.3 Sampling Stochasticity
When temperature \(> 0\), the same model gives different outputs to the same prompt on each run. This is the closest analogue to classical test-retest error: the model’s response is a random draw from its distribution, and different draws yield different scores. The within-model variance across runs directly inflates \(\sigma^2_E\).
For binary items (correct/incorrect), this source is minimal—most models give the same answer deterministically. But for open-ended generation (summaries, code, essays), stochastic outputs can receive different quality scores across runs.
5.3.4 Annotator and Judge Variability
For tasks where responses cannot be automatically scored—creative writing, safety evaluation, open-ended reasoning—human annotators or LLM judges must assess quality. Different raters often disagree, and this disagreement is a major source of unreliability.
For two raters scoring the same set of items on a categorical scale, Cohen’s kappa (Cohen 1960) adjusts for chance agreement:
\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]
where \(p_o\) is the observed proportion of agreement and \(p_e\) is the expected proportion of agreement by chance. \(\kappa = 1\) indicates perfect agreement; \(\kappa = 0\) indicates agreement no better than chance; \(\kappa < 0\) indicates systematic disagreement.
For more than two raters, Fleiss’s kappa (Fleiss 1971) extends this to multiple annotators. Krippendorff’s alpha (Krippendorff 2011) provides a more general measure that handles missing data, ordinal scales, and any number of raters.
5.3.5 Benchmark Sampling Variability
The benchmark itself is a sample from a larger domain of possible items. Different samples yield different model rankings. A 200-item coding benchmark is a sample from the space of all possible coding problems; another sample of 200 problems from the same domain would likely produce somewhat different scores.
This is the item sampling component of reliability, and it is the source that Equation 5.2 addresses: longer tests sample the domain more thoroughly and are therefore more reliable. It also connects to Chapter 3’s design perspective: D-optimal item selection minimizes this variability by choosing a maximally informative sample.
5.4 Generalizability Theory
5.4.1 From One Error to Many
CTT gives a single reliability number, collapsing all sources of error into \(\sigma^2_E\). This is useful for a quick summary but insufficient when you need to improve reliability—you cannot fix what you have not diagnosed. If your evaluation is unreliable because annotators disagree, adding more items will not help. If it is unreliable because of item sampling variability, adding more annotators will not help.
Generalizability Theory (G-theory), developed by Cronbach et al. (1972), addresses this limitation by modeling multiple facets of measurement simultaneously. Where CTT asks “how reliable is this test?”, G-theory asks “how much variance is attributable to each source, and how can we design a measurement procedure that minimizes error?”
5.4.2 The G-Theory Framework
Consider an evaluation where \(N\) models are tested on \(M\) items, each scored by \(R\) raters. The observed score for model \(p\), item \(i\), rater \(r\) is decomposed as:
\[ X_{pir} = \mu + \alpha_p + \beta_i + \gamma_r + (\alpha\beta)_{pi} + (\alpha\gamma)_{pr} + (\beta\gamma)_{ir} + \epsilon_{pir} \]
where \(\mu\) is the grand mean and each Greek letter represents a random effect:
- \(\alpha_p\): model effect (the signal we want to measure), variance \(\sigma^2_p\)
- \(\beta_i\): item effect (some items are harder), variance \(\sigma^2_i\)
- \(\gamma_r\): rater effect (some raters are lenient), variance \(\sigma^2_r\)
- \((\alpha\beta)_{pi}\): model-by-item interaction (item \(i\) is especially hard for model \(p\)), variance \(\sigma^2_{pi}\)
- \((\alpha\gamma)_{pr}\): model-by-rater interaction (rater \(r\) is especially harsh on model \(p\)), variance \(\sigma^2_{pr}\)
- \((\beta\gamma)_{ir}\): item-by-rater interaction, variance \(\sigma^2_{ir}\)
- \(\epsilon_{pir}\): residual (everything else), variance \(\sigma^2_{pir,e}\)
The AI evaluation translation is direct:
| G-theory term | AI evaluation analogue |
|---|---|
| Person (\(p\)) | Model being evaluated |
| Item (\(i\)) | Benchmark question or task |
| Rater (\(r\)) | Human annotator or LLM judge |
| Occasion (\(o\)) | Run, random seed, or prompt variant |
The generalizability coefficient is the ratio of true score variance to true-plus-relative-error variance:
\[ G = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_{\text{rel}}} \]
where the relative error variance for a design with \(n_i\) items and \(n_r\) raters is:
\[ \sigma^2_{\text{rel}} = \frac{\sigma^2_{pi}}{n_i} + \frac{\sigma^2_{pr}}{n_r} + \frac{\sigma^2_{pir,e}}{n_i \cdot n_r} \]
5.4.3 G-Studies and D-Studies
G-theory separates the estimation and design phases:
G-study (Generalizability study): Collect data from a fully or partially crossed design and estimate all variance components. This tells you where the noise is—the diagnostic step. Variance components are typically estimated using the method of expected mean squares from ANOVA or via restricted maximum likelihood (REML).
D-study (Decision study): Given the variance components from a G-study, compute the generalizability coefficient for a planned evaluation design. This answers questions like: “How many items and raters do I need to achieve \(G \geq 0.90\)?”
The key insight is that you can trade off facets against each other. If rater variance dominates (\(\sigma^2_{pr}\) is large), increasing \(n_r\) is more cost-effective than increasing \(n_i\). If item sampling variance dominates (\(\sigma^2_{pi}\) is large), more items help more than more raters. The D-study makes these tradeoffs explicit and quantitative.
5.5 LLM-as-a-Judge Reliability
5.5.1 The Rise of Automated Scoring
As AI systems are increasingly evaluated on open-ended tasks—summarization, creative writing, instruction following, safety—human annotation becomes the bottleneck. LLMs are now widely used as judges, scoring model outputs on rubric-based criteria or making pairwise preference judgments (Zheng et al. 2023). This creates a new measurement instrument whose reliability properties must be understood.
In the G-theory framework, an LLM judge is a rater. It has its own biases (\(\gamma_r\)), interacts differently with different models (\(\alpha\gamma_{pr}\)), and may score the same output differently on repeated calls (if temperature \(> 0\)). The reliability analysis from the previous section applies directly, with the LLM judge playing the role of the human annotator.
5.5.2 Inter-Rater Agreement
Three questions characterize LLM-as-a-judge reliability:
Inter-judge reliability: Do different LLM judges agree with each other? If GPT-4, Claude, and Gemini all judge the same outputs, do they produce the same rankings?
Human-LLM agreement: Does the LLM judge agree with human annotators? High agreement suggests the LLM captures human preferences; low agreement means the LLM measures something different.
Intra-judge consistency: Does the same LLM judge produce the same score when asked twice? At temperature 0 this should be deterministic, but prompt-order effects and other sources of instability can create variability.
5.5.3 Position Bias and Systematic Error
LLM judges exhibit position bias: in pairwise comparisons, they tend to prefer the response presented first (or last, depending on the model) (Wang et al. 2023). This is a systematic error, not random noise.
CTT’s error term \(E\) captures random error: zero mean, uncorrelated across occasions. Systematic errors—like position bias or a judge that always prefers longer responses—inflate the true score, not the error term. A biased judge can be highly reliable (consistent in its bias) yet invalid.
Reliability analysis detects random error. Detecting systematic error requires the validity tools from Chapter 6. A high inter-rater agreement between two LLM judges does not mean they are correct—they might share the same systematic bias.
5.5.4 Designing Reliable Judge Protocols
Based on the reliability framework, practical strategies for improving LLM-as-a-judge reliability include:
Multiple judges with aggregation. Using \(n_r > 1\) judges and taking the majority vote or average score reduces error variance by a factor of \(1/n_r\) (the G-theory D-study). Three independent judges are substantially more reliable than one.
Position randomization. Present responses in both orders and average the judgments. This eliminates position bias, converting a systematic error into cancelled noise.
Deterministic scoring. Use temperature 0 to eliminate sampling stochasticity. This removes one source of \(\sigma^2_E\) entirely.
Rubric specificity. Vague rubrics like “which response is better?” produce more disagreement than specific criteria like “which response correctly follows the instruction and provides accurate information?” Specific rubrics reduce \(\sigma^2_r\) and \(\sigma^2_{ir}\).
Calibration against gold standards. Before deploying an LLM judge, measure its agreement with human annotations on a calibration set. This provides an estimate of systematic bias (not just random error).
5.6 Reliability Under IRT
5.6.1 Conditional Reliability
CTT provides a single reliability number for the entire test. But from Chapter 2 and Chapter 4, we know that measurement precision varies across the ability range. A benchmark designed for mid-range models may be highly reliable at \(\theta = 0\) but unreliable at the extremes.
Under IRT, the precision of measurement at ability \(\theta\) is captured by the test information function (introduced in Section 4.2.1):
\[ I(\theta) = \sum_{j=1}^M P_j(\theta)(1 - P_j(\theta)) \]
The conditional reliability at \(\theta\) is:
\[ \rho(\theta) = 1 - \frac{1}{I(\theta) \cdot \sigma^2_\theta} \]
where \(\sigma^2_\theta\) is the variance of abilities in the population. This formula makes a key point: reliability is not a fixed property of the test—it depends on where in the ability range you are measuring and how spread out the population is.
5.6.2 The Spearman-Brown Prophecy Under IRT
The classical Spearman-Brown formula (Equation 5.2) says reliability increases with test length. Under IRT, this is approximately true when items are well-targeted to the ability range of interest, but breaks down when items are poorly matched.
Adding 50 easy items to a benchmark that already has 50 hard items does little to improve reliability for high-ability models—the easy items provide almost no information in that range. This is precisely the insight behind adaptive testing (Section 4.2.2): by selecting items matched to each model’s ability, CAT achieves the same reliability with fewer items.
The practical implication is clear: a single reliability number can be misleading. A benchmark might report \(\alpha = 0.92\) overall, but have conditional reliability below 0.80 for the frontier models that matter most—because all the items are too easy for them. The IRT framework provides the tools to diagnose and fix this problem, connecting directly to the efficient measurement methods of Chapter 4.
5.7 Designing Reliable AI Evaluations
The theory developed in this chapter yields concrete design principles for AI evaluation:
Know your dominant noise source. Run a G-study before committing to an evaluation design. If annotator variability dominates, invest in more raters or clearer rubrics. If item sampling variability dominates, use more items or D-optimal item selection.
Match items to the ability range of interest. Conditional reliability shows that measurement precision varies across ability levels. If the goal is to rank frontier models, the benchmark needs items that are difficult enough to discriminate among them.
Use multiple judges when scoring is subjective. The G-theory D-study quantifies the improvement from adding judges. Three independent judges with majority vote is substantially more reliable than a single judge, and often more cost-effective than adding more items.
Report confidence intervals, not just point estimates. The SEM provides a natural uncertainty band. Two models whose scores overlap within \(\pm 2 \times \text{SEM}\) may not be meaningfully different. Leaderboards should indicate which ranking differences are statistically reliable.
Control what you can. Use temperature 0 for deterministic scoring. Randomize prompt formats and average across them. Standardize rubrics. These measures eliminate sources of \(\sigma^2_E\) rather than averaging over them.
Separate reliability from validity. High reliability is necessary but not sufficient. A benchmark can be highly reliable while systematically measuring the wrong thing (contamination, shortcut features, length bias). The tools for detecting these problems are in Chapter 6.
Audit items using reliability diagnostics. Individual items can degrade overall reliability. Items with negative item-total correlations reduce Cronbach’s \(\alpha\) — removing them improves reliability. Truong et al. (2025) demonstrate that this classical item analysis principle is a powerful tool for finding benchmark bugs: items with negative tetrachoric correlations or low Mokken scalability coefficients often have incorrect answer keys, ambiguous wording, or grading errors. On nine benchmarks, reliability-based flagging achieves up to 84% precision at the top-50 flagged items. The intuition is simple: under the Rasch model (or any unidimensional model), all items should correlate positively with each other and with the total score. Violations indicate items that are measuring something different from the rest of the benchmark — whether due to multidimensionality or outright errors.
5.8 Discussion Questions
Determinism and reliability. At temperature 0, a language model gives the same output every time. Does this mean the evaluation has perfect test-retest reliability? What sources of unreliability remain even with deterministic models?
Reliability vs. number of items. The Spearman-Brown formula predicts that longer tests are more reliable. AI benchmarks often have thousands of items—far more than typical educational tests. Does this mean AI benchmarks automatically have high reliability? What assumptions might be violated?
LLM-as-a-judge tradeoffs. Using GPT-4 as a judge is cheaper than human annotation but introduces model-specific biases. Under what conditions would you prefer a less accurate but more reliable judge? How does G-theory help formalize this tradeoff?
Reliability across model generations. When a new generation of models is released, the reliability of an existing benchmark may change (because item difficulties shift relative to model abilities). How should benchmark developers monitor and maintain reliability over time?
The reliability-validity tension. A benchmark consisting of 1000 copies of the same easy question would have extremely high internal consistency but near-zero validity. How does this pathological example illustrate the distinction between reliability and validity? What design principles prevent this?
Item-level diagnostics and benchmark maintenance. Truong et al. (2025) find that removing items with negative item-total correlations improves both reliability and validity of AI benchmarks. But removing items also shortens the test, which the Spearman-Brown formula predicts will reduce reliability. Under what conditions does the net effect of item removal improve reliability? How would you design an iterative item screening procedure that balances these competing pressures?
5.9 Bibliographic Notes
5.9.1 Classical Test Theory
The foundational reference is Lord and Novick (1968), which formalized CTT axiomatically. Cronbach (1951) introduced coefficient alpha, the most widely used reliability statistic. The Spearman-Brown formula dates to Spearman (1910) and Brown (1910) independently. For the connection between CTT and IRT-based reliability, see Chapter 7 of Hambleton and Swaminathan (1985).
5.9.2 Generalizability Theory
G-theory was developed by Cronbach et al. (1972), building on earlier work by Cronbach on the multiple sources of measurement error. Brennan (2001) provides the comprehensive modern treatment. Shavelson and Webb (1991) offers an accessible primer. The connection between G-theory and mixed-effects models makes modern software (e.g., lme4 in R) directly applicable to G-studies.
5.9.3 Inter-Rater Reliability
Cohen (1960) introduced Cohen’s kappa. Fleiss (1971) extended it to multiple raters. Krippendorff (2011) provides Krippendorff’s alpha, which handles missing data, ordinal scales, and any number of raters. For a comprehensive treatment with practical guidance, see Gwet (2014).
5.9.4 LLM-as-a-Judge
Zheng et al. (2023) introduced the LLM-as-a-judge paradigm and the MT-Bench evaluation framework. Position bias was documented by Wang et al. (2023). Shankar et al. (2024) addresses the question of validating LLM judges against human preferences. For prompt sensitivity in evaluation, see Mizrahi et al. (2024). Biderman et al. (2024) discusses reproducibility challenges in language model evaluation more broadly.
5.9.5 Item-Level Reliability Diagnostics
Classical item analysis — computing item-total correlations, inter-item correlations, and Mokken scalability coefficients to identify misfitting items — is a standard step in educational test development. Truong et al. (2025) apply these techniques to AI benchmarks at scale, showing that items flagged by negative tetrachoric correlations or low scalability coefficients frequently contain genuine errors (incorrect keys, ambiguous wording, grading bugs). Their framework is grounded in the Rasch model’s sufficiency property (Section 2.3.1): if sum scores are sufficient statistics for ability, all inter-item correlations must be non-negative, making violations a principled diagnostic signal. They recommend evaluating with 60–80 LLMs from at least 10 organizations for reliable detection.
5.10 Exercises
5.10.1 Theoretical Exercises
Exercise 4.1 (\(\star\)): Starting from the CTT model \(X = T + E\) with the standard assumptions (\(\mathbb{E}[E] = 0\), \(\text{Cov}(T, E) = 0\)), derive that \(\rho_{XX'} = \sigma^2_T / \sigma^2_X\). Show that reliability equals the squared correlation between observed and true scores: \(\rho_{XX'} = r^2_{XT}\).
Exercise 4.2 (\(\star\star\)): Derive the Spearman-Brown prophecy formula. If a test of length \(M\) has reliability \(\rho\), show that a test of length \(KM\) (made of \(K\) parallel forms) has reliability \(\rho_K = K\rho / (1 + (K-1)\rho)\). What happens as \(K \to \infty\)?
Exercise 4.3 (\(\star\star\)): In a G-theory \(p \times i \times r\) design, derive the formula for the generalizability coefficient \(G\). Show that increasing \(n_r\) reduces the contribution of rater variance but not item variance, and vice versa.
Exercise 4.4 (\(\star\star\)): Show that Cronbach’s alpha is the mean of all possible split-half reliabilities (after Spearman-Brown correction) for a test with \(M\) items. Under what conditions is alpha a lower bound for reliability?
5.10.2 Computational Exercises
Exercise 4.5 (\(\star\star\)): Implement a G-study for a model \(\times\) item \(\times\) rater design. Given a 3-way data array \(X[p, i, r]\), estimate all variance components using expected mean squares. Apply your implementation to simulated data and verify that the estimates recover the true variance components.
Exercise 4.6 (\(\star\star\)): Simulate an LLM-as-a-judge evaluation with 5 judges of varying quality (agreement with gold standard ranging from 0.70 to 0.90). Compute: (a) pairwise Cohen’s kappa for all judge pairs, (b) accuracy with majority vote for 1, 3, and 5 judges. How does reliability change if all judges share the same systematic bias (e.g., preferring longer responses)?
Exercise 4.7 (\(\star\star\star\)): Implement a D-study optimizer. Given variance components and cost per item (\(c_i\)) and cost per rater-item pair (\(c_r\)), find the cheapest evaluation design \((n_i, n_r)\) that achieves \(G \geq G^*\). Solve this as a constrained optimization problem and visualize the cost surface.
5.10.3 Discussion Exercises
Exercise 4.8: The Chatbot Arena uses a single human preference judgment per comparison. Using G-theory, analyze the reliability of Elo ratings derived from this design. How many comparisons per model pair would be needed to achieve \(G \geq 0.90\)? What are the cost implications?
Exercise 4.9 (\(\star\star\)): Implement the item-level diagnostic pipeline from Truong et al. (2025). Simulate a Rasch benchmark with \(N = 80\) models and \(M = 200\) items. Introduce 20 “buggy” items: 10 with flipped answer keys (replace \(Y_{ij}\) with \(1 - Y_{ij}\)) and 10 with random responses (independent of \(\theta\)). Compute: (a) item-total correlations, (b) average tetrachoric correlations, and (c) Mokken scalability coefficients \(H_j\) for all items. Rank items by each statistic. What is the precision@20 and recall@20 for each method? How does performance change as you vary the number of LLMs from 10 to 100?