6 Validity
By the end of this chapter, you will be able to:
- Define validity in the context of AI evaluation and explain why it cannot be reduced to a single statistic.
- Distinguish between content validity, criterion validity, construct validity, external validity, and consequential validity for AI benchmarks.
- Apply Borsboom’s realist framework and Salaudeen et al.’s claim-centered framework to evaluate whether a benchmark measures what it claims to measure.
- Identify common threats to validity in AI evaluation: benchmark contamination, construct-irrelevant variance, construct underrepresentation, and differential item functioning.
- Use diagnostic tools (DIF analysis, dimensionality assessment, item-fit statistics) to detect validity threats.
- Design instrument construction and revision procedures, including the use of synthetic data for item generation.
- Evaluate the validity of existing AI benchmarks using the frameworks presented in this chapter.
This chapter can be covered in 3 lectures (75-90 minutes each):
Lecture 1: Validity Frameworks and Evidence
- Borsboom’s realist framework: validity as truth, warrant inference, semantic indeterminacy (20 min)
- From reliability to validity: motivating the distinction (10 min)
- A taxonomy of validity evidence: content, criterion, construct, external, consequential (25 min)
- Threats to validity in AI evaluation (20 min)
Lecture 2: Diagnostic Tools
- Differential Item Functioning (DIF) analysis (25 min)
- Dimensionality assessment and parallel analysis (25 min)
- Item-fit statistics and contamination detection (20 min)
- Hands-on: running diagnostics on simulated data (5 min)
Lecture 3: Building Valid Benchmarks
- The Multitrait-Multimethod matrix (25 min)
- Principled item construction and revision (20 min)
- Nomological networks and the claim-centered framework (15 min)
- Design exercise: planning a valid evaluation (15 min)
This chapter introduces validity-specific notation: \(g\) (group membership for DIF), \(\alpha_{MH}\) (Mantel-Haenszel odds ratio), \(\lambda_k\) (eigenvalues for dimensionality), \(\text{MNSQ}_i\) (item fit statistics), and \(r_{ij}\) (MTMM correlations). See ?sec-notation for the complete notation reference.
6.1 Borsboom’s Warrant Inference Framework
Before we can assess whether an evaluation is valid, we must understand what validity means. This seemingly philosophical question has profound practical implications. If we do not have a clear conception of validity, we cannot evaluate whether our benchmarks actually measure what we intend.
The Dutch psychometrician Denny Borsboom has developed the most influential contemporary framework for understanding measurement validity. His approach, which we call the realist framework, provides the philosophical foundation for the AIMS approach to AI evaluation.
6.1.1 Validity as Truth, Not Evidence
Traditional approaches to validity, following Cronbach and Messick, treat validity as a matter of evidence accumulation. Under this view, a test is valid to the extent that we have gathered evidence supporting its intended interpretation. Validity becomes a matter of degree: more evidence means more validity.
Borsboom rejects this view. He argues that validity is fundamentally about truth, not evidence:
A test is valid for measuring an attribute if and only if:
- The attribute exists, and
- Variations in the attribute causally produce variations in the measurement outcomes.
This is a yes/no property: either the attribute causes the test responses, or it does not. Evidence is relevant to our knowledge of validity, but validity itself is about the causal structure of the world.
This definition has several important implications:
Existence requirement. The attribute being measured must actually exist. If we claim to measure “general intelligence” but there is no such thing—if intelligence is better understood as a collection of independent abilities—then no test can validly measure it. The existence question is empirical, not definitional.
Causation requirement. The attribute must cause variation in test responses. It is not enough for test scores to be correlated with the attribute; the attribute must be the reason for the variation. This rules out tests that are merely predictive of outcomes without measuring the underlying construct.
Truth vs. evidence distinction. We can have strong evidence that a test is valid and yet be wrong. Conversely, a test might be valid even if we have limited evidence. This distinction matters because it separates the epistemological question (what do we know?) from the ontological question (what is true?).
6.1.2 The Warrant Inference Problem
Measurement involves an inference from observed data to latent constructs:
\[ \text{Observed: } Y_{ij} \quad \xrightarrow{\text{inference}} \quad \text{Latent: } \theta_i \]
This inference requires a warrant: a justified belief that the test measures what it claims to measure. The warrant connects the measurement procedure (administering test items, recording responses) to the theoretical construct (ability, intelligence, reasoning).
Following Toulmin’s model of argumentation, a measurement argument has the structure:
- Claim: “Model \(i\) has ability \(\theta_i = 2.3\)”
- Data: “Model \(i\) answered 47 of 60 questions correctly”
- Warrant: “The test measures the ability construct, and the scoring procedure accurately converts responses to ability estimates”
- Backing: “The test items were written by domain experts, the psychometric model fits the data, ability estimates are stable across different item subsets”
The warrant is the critical element. Without it, we have no basis for interpreting test scores as measurements of the intended construct. The backing provides evidence for the warrant but does not replace it.
6.1.3 Semantic Indeterminacy
Borsboom identifies a fundamental problem in measurement: semantic indeterminacy. The meaning of a test score depends on which measurement system we adopt, but there is no compelling argument for any particular system.
Consider three measurement frameworks:
Classical Test Theory (CTT): A test score \(X = T + E\) consists of a true score \(T\) plus random error \(E\). The true score is defined as the expected value of the test score over hypothetical replications.
Item Response Theory (IRT): Test responses are generated by a latent ability \(\theta\) through a probabilistic model \(P(Y_{ij} = 1 | \theta_i, \beta_j)\). The ability parameter is a property of the person that exists independently of any particular test.
Network Models: There is no latent variable. Test items are causally connected to each other, and correlations arise from these direct connections rather than a common cause.
These frameworks make different claims about what test scores mean:
| Framework | What does the score represent? |
|---|---|
| CTT | Expected value over test replications |
| IRT | Position on a latent continuum |
| Network | Summary of a network state |
The frameworks are not merely different parameterizations of the same model—they make different ontological commitments about what exists and what causes what. Yet we often cannot empirically distinguish between them.
When we say “GPT-4 has reasoning ability of 2.3 logits,” what do we mean? The answer depends on our measurement framework:
- CTT interpretation: If we tested GPT-4 many times on parallel forms, its average score would correspond to 2.3 logits.
- IRT interpretation: GPT-4 possesses an underlying reasoning capacity that, when combined with item difficulties, generates the observed response pattern.
- Network interpretation: GPT-4’s responses to reasoning questions form a pattern that we summarize with the number 2.3, but there is no single “reasoning ability” being measured.
These interpretations have different implications for how we should use and trust the measurement.
6.1.4 Construct Validity and the Nomological Network
If a construct cannot be directly observed, how do we know it exists? Cronbach and Meehl proposed that constructs are defined by their place in a nomological network—a web of theoretical relationships connecting the construct to other constructs and observable indicators.
For example, “reasoning ability” might be defined by relationships like:
- Higher reasoning ability \(\to\) better performance on logic puzzles
- Higher reasoning ability \(\to\) better performance on mathematical proofs
- Higher reasoning ability \(\to\) correlation with general intelligence
- Higher reasoning ability \(\to\) development with education
The construct gains meaning through these relationships. If a test score behaves as the theory predicts—if it correlates with the right things and not with the wrong things—we have evidence that it measures the intended construct.
For AI evaluation, this suggests we need theoretical frameworks that specify:
- What capabilities should be related to benchmark performance
- What capabilities should be independent of benchmark performance
- How capabilities should develop with model scale or training
- How capabilities should transfer across domains
Without such frameworks, we have benchmark scores without meaning.
6.2 From Reliability to Validity
Consider a coding benchmark that produces highly consistent scores: every time you evaluate a model, it gets roughly the same accuracy. The split-half reliability is 0.95, Cronbach’s alpha is 0.93, and the G-coefficient from Chapter 5 is excellent. You might conclude that the benchmark is working well.
But suppose you investigate further and discover that the benchmark items were scraped from a popular programming tutorial site. The models with the highest scores turn out to be those whose training data included that site. The benchmark is not measuring coding ability—it is measuring memorization of specific solutions. The scores are perfectly consistent, but they do not mean what you think they mean.
This is the validity problem. Reliability asks: does the evaluation give the same answer twice? Validity asks the deeper question: does the evaluation measure what it claims to measure? A perfectly reliable instrument can be completely invalid—like a bathroom scale that always reads 70 kg regardless of who steps on it. Conversely, validity requires reliability: if measurements fluctuate randomly, they cannot track any underlying attribute.
In Section 6.1, we introduced Borsboom’s realist definition of validity: a test is valid for measuring an attribute if and only if (a) the attribute exists, and (b) variation in the attribute causally produces variation in test scores (Borsboom 2005). This is a demanding standard. It requires not just statistical association between scores and the target construct, but a causal mechanism connecting the two.
More recently, Salaudeen et al. (2025) proposed a claim-centered framework for AI evaluation validity. Their key insight is that benchmarks are not valid or invalid in the abstract—they are valid for specific claims. A benchmark might validly support the claim “Model A scores higher than Model B on these 500 multiple-choice questions” while failing to support the broader claim “Model A reasons better than Model B.” Validity is about the relationship between the measurement instrument, the intended interpretation, and the population of systems being evaluated.
A benchmark is not “valid” or “invalid” in isolation. Validity depends on:
- The construct: what attribute are we claiming to measure?
- The population: which models or systems are being evaluated?
- The interpretation: what conclusions are drawn from the scores?
- The context: in what setting will the scores be used?
The same benchmark may be valid for one purpose and invalid for another. Validity is a property of the inference, not the instrument.
This chapter develops the theory and tools for assessing and building validity in AI evaluation. We begin with a taxonomy of validity evidence (Section 6.3), then examine the major threats to validity in AI benchmarks (Section 6.4). We develop diagnostic tools for detecting these threats (Section 6.5) and close with methods for constructing valid evaluation instruments (Section 6.6).
6.3 A Taxonomy of Validity Evidence
The modern view of validity, following Messick (1995) and the Standards for Educational and Psychological Testing, treats validity as a unitary concept: the degree to which evidence supports the intended interpretation of test scores. However, the evidence for validity comes in several distinguishable forms. Following Salaudeen et al. (2025) and the classical framework, we organize validity evidence into five categories.
6.3.1 Content Validity
Content validity concerns whether the benchmark adequately represents the construct domain it claims to measure. The key questions are: Does the benchmark cover the full range of the construct? Are the items relevant? Are important aspects of the construct missing?
For AI evaluation, content validity failures are pervasive. A benchmark labeled “reasoning” might contain only arithmetic word problems, missing logical reasoning, causal reasoning, analogical reasoning, and spatial reasoning entirely. A “coding ability” benchmark might test only Python function completion, neglecting debugging, system design, documentation, and code review.
Content validity is established through expert judgment rather than statistical analysis. Domain experts review the construct definition, examine the item pool, and assess coverage. In AI evaluation, this step is frequently skipped: benchmarks are constructed from convenience samples of existing data rather than from principled domain specifications.
AI example. Suppose a benchmark claims to measure “scientific reasoning.” Content validity requires asking: Does it include hypothesis generation? Experimental design? Data interpretation? Statistical inference? Causal reasoning? If it tests only factual recall of scientific knowledge, it has poor content validity for the “reasoning” claim, regardless of how reliable the scores are.
6.3.2 Criterion Validity
Criterion validity asks whether benchmark scores predict or correlate with an external criterion that independently captures the construct. There are two subtypes:
- Concurrent validity: the benchmark correlates with a currently available gold standard. For example, an automated coding benchmark should correlate with expert human evaluation of the same code.
- Predictive validity: the benchmark predicts future outcomes of interest. A benchmark for coding assistants should predict how useful the assistant is in actual developer workflows over time.
Criterion validity provides some of the most compelling evidence, but it requires the existence of a trustworthy external criterion—which is often the fundamental problem. If we already had a perfect measure of “reasoning ability,” we would not need the benchmark. The circularity of criterion validation is a recognized challenge in psychometrics (Cronbach and Meehl 1955) and becomes especially acute in AI evaluation, where the construct of interest (e.g., “general intelligence”) may not have any agreed-upon gold standard.
AI example. The Chatbot Arena (Zheng et al. 2023) collects pairwise human preferences as a criterion for model quality. A benchmark has concurrent criterion validity if its scores correlate with Arena Elo ratings. But the Arena itself has validity assumptions—are crowdworker preferences a valid criterion for “model quality”?
6.3.3 Construct Validity
Construct validity is the central and most encompassing form of validity evidence. It asks: does the benchmark actually measure the theoretical construct it claims to measure? This goes beyond content coverage and criterion correlation to the internal structure of the measurement.
The classical approach to construct validity uses two kinds of evidence:
- Convergent validity: scores on the benchmark should correlate positively with scores on other measures of the same or closely related constructs. If two different “reasoning” benchmarks produce uncorrelated scores, at least one of them has poor construct validity.
- Discriminant validity: scores on the benchmark should not correlate strongly with measures of different constructs. If a “reasoning” benchmark correlates as highly with a “memorization” benchmark as with another reasoning benchmark, the construct is not well separated.
The systematic study of convergent and discriminant validity was formalized by Campbell and Fiske (1959) through the Multitrait-Multimethod (MTMM) matrix, which we develop in Section 6.6.1.
Under Borsboom’s causal framework, construct validity requires a specific causal claim: variation in the latent attribute (reasoning ability) causally produces variation in benchmark scores. This is testable—if we could intervene to increase a model’s reasoning ability while holding everything else constant, valid benchmark scores should increase. In practice, such clean interventions are rare, but the causal framing clarifies what we are asking.
AI example. Consider two benchmarks that both claim to measure “mathematical reasoning”: GSM8K (grade-school math word problems) and MATH (competition-level problems). If they measure the same construct, models that excel on one should tend to excel on the other (convergent validity). If “mathematical reasoning” is distinct from “commonsense knowledge,” then GSM8K scores should correlate less with commonsense benchmarks than with MATH (discriminant validity).
6.3.4 External Validity
External validity concerns the generalizability of benchmark results beyond the specific conditions of the evaluation. Does performance on the benchmark predict performance in other contexts, with other populations, or at other times?
For AI systems, external validity questions include:
- Does performance on English-language benchmarks predict multilingual performance?
- Do results on curated, clean test items generalize to noisy, real-world inputs?
- Do benchmark scores from today predict performance after the model is updated?
- Do results generalize across deployment contexts (e.g., from research settings to production)?
External validity connects closely to the problem of distribution shift, which we examine formally in Chapter 7. When the benchmark distribution differs from the target distribution, strong benchmark performance may not transfer.
AI example. A model scores 95% on a medical question-answering benchmark derived from textbook questions. But in a clinical setting, the questions are noisier, contextually embedded, and may involve ambiguous information. The benchmark has poor external validity if the 95% score does not predict clinically useful performance.
6.3.5 Consequential Validity
Consequential validity, introduced by Messick (1995), asks whether the social consequences of using benchmark scores are appropriate. This is the most controversial form of validity evidence, as it extends validity beyond measurement science into ethics and policy.
In AI evaluation, consequential validity is increasingly important:
- Goodhart’s law: When a benchmark becomes a target, models are optimized for it, and the benchmark ceases to measure the original construct. Training on benchmark-like data improves scores without improving capability.
- Development distortion: Benchmarks that are easy to optimize attract disproportionate effort, even if they measure less important capabilities.
- Misuse of rankings: Leaderboard positions are used to make deployment decisions, marketing claims, and policy arguments that go far beyond what the scores support.
Consequential validity does not mean that benchmarks are “invalid” whenever they have negative consequences. Rather, it means that the consequences of score use are relevant evidence when evaluating whether the measurement system is serving its intended purpose.
AI example. A safety benchmark becomes widely used for regulatory compliance. Model developers learn to optimize specifically for the benchmark items, achieving high scores while leaving genuine safety risks unaddressed. The consequential validity of the benchmark is undermined: the scores are being used to support claims (“this model is safe”) that they do not actually support.
6.3.6 Summary
Table 6.1 summarizes the five forms of validity evidence, their key questions, and how they manifest in AI evaluation.
| Form | Key Question | AI Example | Typical Evidence |
|---|---|---|---|
| Content | Does the benchmark cover the construct domain? | “Reasoning” benchmark tests only arithmetic | Expert review, domain specification |
| Criterion | Do scores predict an external criterion? | Benchmark vs. Arena Elo correlation | Correlation with gold standard |
| Construct | Does the benchmark measure the intended construct? | Convergent/discriminant evidence across benchmarks | MTMM matrix, factor analysis |
| External | Do results generalize beyond test conditions? | English benchmark → multilingual performance | Cross-context replication |
| Consequential | Are the social consequences appropriate? | Goodhart’s law on safety benchmarks | Impact analysis, misuse audit |
6.4 Threats to Validity in AI Evaluation
Even well-designed benchmarks face systematic threats that can undermine the validity of the inferences they support. We identify four major categories.
6.4.1 Construct-Irrelevant Variance
Construct-irrelevant variance (CIV) occurs when systematic factors other than the target construct influence benchmark scores. Unlike random noise (which reduces reliability), CIV is systematic and can inflate or deflate scores in predictable ways.
Sources of CIV in AI evaluation include:
- Prompt formatting sensitivity: The same question presented with different formatting (bullet points vs. paragraphs, numbered vs. lettered options) can change model responses substantially (Mizrahi et al. 2024). If the construct is “reasoning,” but scores depend on formatting, the format contributes construct-irrelevant variance.
- Multiple-choice position bias: Many language models show systematic preferences for certain answer positions (e.g., option A or the last option). This inflates scores for items where the correct answer happens to be in the preferred position.
- Tokenization artifacts: Model performance can depend on how text is tokenized, which is an artifact of the model’s vocabulary rather than its ability.
- Language and cultural bias: Items that assume specific cultural knowledge or linguistic patterns may be easier for models trained predominantly on certain data, independent of the target construct.
CIV is particularly insidious because it is systematic: unlike random error, it does not average out with more items. A benchmark full of items with position bias has high reliability (the bias is consistent) but poor validity (the scores reflect position preference, not just the target ability).
6.4.2 Construct Underrepresentation
Construct underrepresentation occurs when the benchmark is too narrow, sampling only a limited aspect of the construct it claims to measure. This is the content validity threat from Section 6.3.1 viewed through a different lens.
Examples in AI evaluation:
- Coding ability tested only through function-completion tasks, missing debugging, architecture design, code review, documentation, and refactoring.
- Language understanding tested only with formal, well-edited text, missing colloquial language, code-switching, and domain-specific jargon.
- Safety tested only through adversarial prompts in English, missing multilingual attacks, multi-turn manipulation, and system-prompt override attempts.
Construct underrepresentation is often a consequence of convenience sampling: benchmarks are built from data that is easy to collect and annotate, not from a principled specification of the construct domain.
6.4.3 Benchmark Contamination
Benchmark contamination occurs when evaluation items appear in the model’s training data, allowing the model to retrieve memorized answers rather than demonstrating the target capability. This is arguably the most discussed validity threat in current AI evaluation.
Contamination can be direct (exact match between training and test items) or indirect (paraphrases, derivatives, or items from the same source that share structural patterns). Detection approaches include:
- Canary strings: Embedding unique identifiers in benchmark items and testing whether models can reproduce them (Jacovi et al. 2023).
- Membership inference: Statistical tests for whether a model has seen specific items during training.
- Chronological splits: Comparing performance on items created before vs. after the model’s training data cutoff.
- Public/private splits: Items released publicly tend to have inflated scores compared to held-out private items—the gap measures contamination effects.
- Performance discontinuities: If a model’s accuracy on “seen” items is dramatically higher than on matched “unseen” items, contamination is likely.
Under Borsboom’s causal framework, contamination is a validity threat because it introduces an alternative causal path: the model’s response is caused by memory rather than by the target ability. The score no longer reflects the construct it claims to measure.
6.4.4 Differential Item Functioning
Differential Item Functioning (DIF) occurs when an item is systematically easier or harder for one subgroup of examinees than for another, after controlling for the overall ability level. In classical psychometrics, DIF analysis compares demographic groups; in AI evaluation, the “groups” might be model families, architectural types, or training paradigms.
Formally, an item \(i\) exhibits DIF if:
\[ P(X_{ij} = 1 \mid \theta_j, g = 1) \neq P(X_{ij} = 1 \mid \theta_j, g = 0) \tag{6.1}\]
where \(\theta_j\) is model \(j\)’s ability and \(g\) is group membership. The conditioning on \(\theta_j\) is critical: we are asking whether the item behaves differently for equally able models from different groups.
DIF comes in two forms:
- Uniform DIF: The item is consistently easier (or harder) for one group across all ability levels. The ICCs for the two groups are shifted but do not cross.
- Non-uniform DIF: The item favors one group at some ability levels and the other group at different ability levels. The ICCs cross.
The Mantel-Haenszel (MH) procedure (Holland and Wainer 1993) is the most widely used DIF detection method. It stratifies examinees by ability (using total score as a proxy), computes odds ratios at each stratum, and combines them into a summary statistic:
\[ \alpha_{MH} = \frac{\sum_k A_k D_k / N_k}{\sum_k B_k C_k / N_k} \tag{6.2}\]
where \(A_k, B_k, C_k, D_k\) are the entries of the \(2 \times 2\) table (correct/incorrect \(\times\) group) at ability stratum \(k\), and \(N_k\) is the stratum size. An odds ratio of 1 indicates no DIF; values significantly different from 1 indicate DIF.
AI example. Consider a mathematics benchmark comparing open-source and proprietary models. Some items might rely on LaTeX-formatted mathematical expressions that proprietary models handle better due to training data composition, independent of mathematical reasoning ability. These items would exhibit DIF: the formatting advantage inflates scores for one group without reflecting the target construct.
6.5 Diagnostic Tools for Validity
This section develops three computational tools for detecting the validity threats described above.
6.5.1 DIF Analysis
The Mantel-Haenszel procedure described in Section 6.4.4 provides a nonparametric DIF test. A complementary approach uses logistic regression (Zumbo 1999):
\[ \log \frac{P(X_{ij}=1)}{P(X_{ij}=0)} = \beta_0 + \beta_1 \theta_j + \beta_2 g_j + \beta_3 (\theta_j \times g_j) \tag{6.3}\]
where \(\theta_j\) is the ability proxy (total score), \(g_j\) is group membership, and \(\beta_3\) captures the interaction. Testing \(\beta_2 = 0\) detects uniform DIF; testing \(\beta_3 = 0\) detects non-uniform DIF.
The following simulation demonstrates DIF analysis on a benchmark with planted DIF items.
The left panel shows the Mantel-Haenszel log-odds ratio for each item. Items near zero show no DIF; items with large absolute values are flagged. The planted DIF items (red) cluster at the extremes, while non-DIF items (blue) remain near zero. The right panel compares Item Characteristic Curves: for a non-DIF item, the two groups share the same ICC; for a DIF item, Group 1’s curve is shifted leftward (the item is easier for them at every ability level), indicating uniform DIF.
In practice, DIF analysis for AI benchmarks would compare model families (e.g., autoregressive vs. encoder-decoder), training paradigms (e.g., RLHF vs. base models), or architectural features (e.g., mixture-of-experts vs. dense transformers).
6.5.2 Dimensionality Assessment
A benchmark that claims to measure a single construct (e.g., “reasoning”) should exhibit a unidimensional factor structure: a single latent factor should explain the bulk of the covariance among items. If multiple factors are needed, the benchmark is actually measuring multiple constructs, and a single summary score conflates them.
The primary tool for dimensionality assessment is eigenvalue analysis of the item correlation matrix. In a unidimensional benchmark, the first eigenvalue should be substantially larger than the rest. However, raw eigenvalues can be misleading because even random data produces a non-trivial first eigenvalue.
Parallel analysis (Horn 1965) addresses this by comparing the observed eigenvalues to those from random data of the same dimensions. A factor is retained only if its eigenvalue exceeds the corresponding random-data eigenvalue. This is connected to the factor models discussed in Chapter 2: if the data require \(k\) factors, the benchmark is measuring \(k\) distinct constructs, not one.
The scree plot (left) shows two eigenvalues clearly exceeding the parallel analysis threshold (red dashed line), confirming two-dimensional structure. The loading plot (right) reveals the two-factor structure: verbal items (blue, 1–10) load on the first component, while quantitative items (red, 11–20) load on the second. A benchmark reporting a single “reasoning” score would conflate these two distinct abilities.
This analysis connects directly to the factor models introduced in Chapter 2. When dimensionality assessment reveals multiple factors, the appropriate response is to either (a) report subscores for each factor rather than a single composite, or (b) revise the benchmark to focus on a single construct.
6.5.3 Item Fit and Contamination Detection
Item-fit statistics from Rasch modeling (Wright and Masters 1982) measure how well individual items conform to the measurement model. Items that misfit—those whose response patterns deviate from model predictions—may indicate contamination, construct-irrelevant features, or other validity problems.
The two standard fit statistics are:
- Infit (information-weighted mean square): Sensitive to unexpected responses near an item’s difficulty level. Infit emphasizes on-target responses and is less affected by extreme scores.
- Outfit (unweighted mean square): Sensitive to unexpected responses far from an item’s difficulty level. Large outfit values indicate that the item behaves erratically for very high- or very low-ability models.
For a Rasch model with predicted probability \(P_{ij} = \text{expit}(\theta_j - \beta_i)\), the standardized residual is:
\[ z_{ij} = \frac{X_{ij} - P_{ij}}{\sqrt{P_{ij}(1 - P_{ij})}} \]
The outfit mean square for item \(i\) is:
\[ \text{Outfit}_i = \frac{1}{N} \sum_{j=1}^{N} z_{ij}^2 \]
and infit weights by the variance:
\[ \text{Infit}_i = \frac{\sum_{j=1}^{N} (X_{ij} - P_{ij})^2}{\sum_{j=1}^{N} P_{ij}(1 - P_{ij})} \]
Expected value under the model is 1.0. Values above 1.3 suggest underfit (more noise than the model predicts, possibly multidimensionality or noise). Values below 0.7 suggest overfit (too predictable, possibly item redundancy or dependency). Very high values (above 2.0) indicate serious misfit.
Contamination leaves a distinctive item-fit signature: contaminated items become much easier for models that have memorized them (producing unexpectedly correct responses from low-ability models) while remaining at their true difficulty for uncontaminated models. This inflates outfit statistics because the misfitting responses come from models far from the item’s difficulty.
The left panel shows the infit-outfit scatter: clean items (blue) cluster near the expected value of 1.0, while contaminated items (red) show inflated fit statistics, especially outfit. This is because contaminated models produce unexpectedly correct responses on these items, creating large residuals. The right panel shows the ICC for a single contaminated item: the observed proportions (red dots) consistently exceed the Rasch-expected curve (blue line), particularly at lower ability levels where the memorization effect is most visible.
Item-fit analysis is a powerful diagnostic because it does not require knowing which models are contaminated or which items are compromised. The Rasch model’s prediction residuals reveal anomalies regardless of their source.
6.6 Building Valid Benchmarks
The diagnostic tools above detect validity problems in existing benchmarks. This section turns to constructing benchmarks with validity built in from the start.
6.6.1 The Multitrait-Multimethod Matrix
The Multitrait-Multimethod (MTMM) matrix, introduced by Campbell and Fiske (1959), provides a systematic framework for assessing both convergent and discriminant validity simultaneously. The idea is to measure multiple traits (constructs) using multiple methods (measurement formats) and examine the resulting correlation structure.
In AI evaluation:
- Traits are the constructs of interest: reasoning, knowledge retrieval, creativity, instruction following, etc.
- Methods are the benchmark formats: multiple-choice questions, open-ended generation, pairwise comparison, code completion, etc.
The MTMM matrix organizes correlations into four blocks:
- Monotrait-heteromethod (same trait, different methods): These should be high—this is convergent validity. If “reasoning” measured by MCQ and by open-ended generation produces highly correlated scores, the construct is robust across methods.
- Heterotrait-monomethod (different traits, same method): These should be lower than monotrait-heteromethod correlations—this is discriminant validity. If “reasoning-MCQ” correlates as highly with “knowledge-MCQ” as with “reasoning-open-ended,” then method variance dominates trait variance.
- Heterotrait-heteromethod (different traits, different methods): These should be lowest—neither shared trait nor shared method.
- Reliability diagonals (same trait, same method, repeated): These should be highest—they set the ceiling for validity.
Campbell and Fiske’s criteria for construct validity require:
- Convergent validity coefficients (monotrait-heteromethod) should be significantly different from zero and large enough to warrant further investigation.
- Convergent validity coefficients should be higher than the values in the same row and column of the heterotrait-heteromethod block.
- Convergent validity coefficients should be higher than the heterotrait-monomethod correlations.
- The pattern of trait intercorrelations should be consistent across method blocks.
The heatmap (left) shows the full MTMM correlation matrix with trait blocks separated by black lines. Strong convergent validity appears as high correlations along the off-diagonal within each trait block (same trait measured by different methods). The bar chart (right) confirms the expected ordering: convergent validity coefficients (green) are highest, discriminant-monomethod correlations (orange) are moderate (reflecting shared method variance), and discriminant-heteromethod correlations (red) are lowest. This ordering satisfies the Campbell-Fiske criteria for construct validity.
For AI evaluation, constructing an MTMM study requires measuring the same constructs with genuinely different methods. If all benchmarks use the same multiple-choice format, discriminant validity cannot be assessed because method variance is confounded with trait variance.
6.6.2 Principled Item Construction and Revision
Constructing a valid benchmark requires more than assembling a large set of items. The process should be systematic:
Define the construct: Write an explicit construct definition specifying what the benchmark is intended to measure and what it is not intended to measure. This definition serves as the foundation for content validity.
Specify the domain: Create a detailed content specification (sometimes called a “test blueprint” or “table of specifications”) that enumerates the sub-domains, difficulty levels, and item types that should be represented.
Generate candidate items: Items can be written by domain experts, sourced from existing materials, or generated synthetically using LLMs. For synthetic generation, the construct definition and difficulty targets serve as prompts.
Pilot and analyze: Administer items to a representative sample of models. Compute item statistics: difficulty (\(p\)-value or IRT \(\beta\)), discrimination (\(a\) parameter or point-biserial correlation), and fit (infit/outfit from Section 6.5.3).
Screen for validity threats: Run DIF analysis (Section 6.5.1) to check for group bias. Run dimensionality analysis (Section 6.5.2) to verify unidimensionality (if claimed). Check for contamination using fit statistics and public/private splits.
Revise and iterate: Remove or revise items with poor statistics. Replace them with new items and re-pilot. This cycle continues until the item pool meets quality standards.
The following simulation illustrates the pilot-and-analyze phase.
The three panels show the item screening workflow. The difficulty-discrimination plot (left) reveals that low-discrimination items (red) cluster below the 0.2 threshold, providing little measurement information, while contaminated items (orange) show inflated proportion-correct values. The infit plot (center) highlights items that deviate from model predictions. The summary (right) shows the yield: how many candidate items survive screening. Items that are flagged should be revised or replaced before the benchmark is finalized.
6.6.3 Nomological Networks
The concept of the nomological network was introduced by Cronbach and Meehl (1955) as the broader theoretical context in which a construct’s validity is established. A construct does not exist in isolation—it is defined by its lawful relationships with other constructs, observable behaviors, and experimental outcomes.
For AI evaluation, a nomological network specifies:
- What the construct should predict: “Mathematical reasoning” should predict success on novel math problems, performance in tutoring scenarios, and ability to verify proofs.
- What should predict the construct: Training on diverse mathematical data, chain-of-thought prompting, and model scale should all increase mathematical reasoning scores.
- What the construct should be distinct from: Mathematical reasoning should be separable from memorization of specific problem-solution pairs, linguistic fluency, and pattern matching on surface features.
Salaudeen et al. (2025) operationalize this through their claim-centered framework: each benchmark should be accompanied by explicit claims about what the scores mean, and validity evidence should be gathered for each claim. This is more practical than the full nomological network, which requires specifying all theoretical relationships, but serves the same purpose of grounding scores in a web of meaning.
The MTMM analysis from Section 6.6.1 provides empirical evidence for the nomological network by testing whether the expected convergent and discriminant relationships actually hold. Item-level diagnostics from Section 6.5.1 and Section 6.5.3 ensure that individual items function as predicted by the construct theory.
Borsboom’s realist framework asks: does the attribute exist, and does it causally produce score variation? This is a strong ontological claim that requires causal evidence.
Salaudeen et al.’s claim-centered framework asks: what specific claims does this benchmark support, and what evidence exists for each? This is more pragmatic, acknowledging that validity comes in degrees and depends on use.
The two frameworks are complementary. Borsboom’s framework sets the philosophical standard; Salaudeen et al.’s framework operationalizes it for practical AI evaluation. Both agree that validity is not a binary property and that multiple forms of evidence are needed.
6.7 Designing Valid AI Evaluations
We close with six design principles that synthesize the concepts developed in this chapter.
1. Start with the construct, not the data. Define what you are measuring before collecting items. Write an explicit construct definition and content specification. This prevents construct underrepresentation and makes content validity assessable.
2. Use multiple methods. Measure each construct with at least two different formats (e.g., MCQ and open-ended). This enables MTMM analysis and separates trait variance from method variance. A construct that is robust across methods has stronger validity evidence.
3. Check dimensionality. Before reporting a single composite score, verify that the benchmark is unidimensional (or report subscores for each dimension). Parallel analysis is cheap and can prevent misleading composites.
4. Screen for DIF. Whenever the model population includes distinct subgroups (architectural families, training paradigms, model sizes), run DIF analysis to detect items that function differently across groups.
5. Monitor for contamination. Maintain a held-out private item pool. Compare performance on public and private items. Use item-fit statistics to flag items where some models perform unexpectedly well.
6. Document validity evidence. Follow the claim-centered approach: for each claim the benchmark is intended to support, document the available validity evidence and its limitations. This transparency allows users to judge whether the benchmark supports their intended interpretation.
Worked example: evaluating summarization ability. Suppose you want to build a benchmark for evaluating how well language models summarize documents.
- Construct definition: The ability to produce concise, accurate, and coherent summaries of source documents, preserving key information while omitting irrelevant details.
- Content specification: Documents from 5 domains (news, scientific, legal, medical, conversational), 3 lengths (short, medium, long), 2 genres (informative, narrative).
- Multiple methods: Evaluate summaries with (a) ROUGE against reference summaries, (b) LLM-as-judge on a rubric, (c) human expert ratings.
- Dimensionality check: Factor analysis reveals two dimensions: factual accuracy and coherence. Report subscores for each.
- DIF analysis: Compare models fine-tuned for summarization vs. general-purpose models. Flag items where domain-specific vocabulary creates an irrelevant advantage.
- Contamination check: Use held-out documents not available online. Compare private vs. public document performance gaps.
6.8 Discussion Questions
A coding benchmark produces scores with test-retest reliability of 0.95 and Cronbach’s alpha of 0.93. Is this sufficient evidence for validity? What additional evidence would you want, and why?
How does benchmark contamination differ from construct-irrelevant variance? Can you describe a scenario where training data overlap is actually construct-relevant rather than a validity threat?
Design an MTMM study for evaluating “mathematical reasoning” in large language models. Specify the traits, methods, and the expected pattern of convergent and discriminant correlations.
Goodhart’s law suggests that any benchmark used as an optimization target will eventually lose validity. Does this make consequential validity fundamentally different from the other four forms? Or is it a special case of construct-irrelevant variance?
DIF analysis identifies items that function differently across groups after controlling for ability. In AI evaluation, when should DIF be treated as a validity problem to be fixed, and when might it reflect a genuine group difference that the benchmark should capture?
6.9 Bibliographic Notes
The conceptual foundations of validity have evolved substantially over the past century. Cronbach and Meehl (1955) introduced the notion of construct validity and the nomological network, arguing that validity requires embedding a construct in a web of theoretical relationships. Messick (1995) unified the previously separate types of validity into a single framework, arguing that all validity is construct validity, with content, criterion, and consequential evidence as facets of a unified concept. Kane (2006) formalized the argument-based approach to validation, which requires specifying the chain of inferences from observed scores to intended interpretations and gathering evidence for each link.
Borsboom (2005) challenged the orthodox framework with a realist account: a test is valid if and only if the target attribute exists and causally produces score variation. This causal perspective connects validity to the structural causal models discussed in Chapter 7.
The Multitrait-Multimethod matrix was introduced by Campbell and Fiske (1959) and remains one of the most systematic approaches to convergent and discriminant validity. Differential Item Functioning methods are covered comprehensively by Holland and Wainer (1993) and Zumbo (1999). Item-fit statistics for Rasch models are developed by Wright and Masters (1982). Parallel analysis for dimensionality assessment was proposed by Horn (1965).
For AI-specific validity, Salaudeen et al. (2025) propose a claim-centered framework that adapts classical validity concepts for benchmark evaluation. Kiela et al. (2021) argue for dynamic benchmarking to combat contamination and Goodhart’s law. Jacovi et al. (2023) address the contamination problem directly, proposing practical strategies for protecting test data. The broader challenges of AI evaluation validity are discussed by Shankar et al. (2024) and Biderman et al. (2024). Truong et al. (2025) demonstrate that item-level validity threats — incorrect answer keys, ambiguous wording, grading bugs — can be detected at scale using reliability diagnostics derived from the Rasch model’s sufficiency property. Their framework achieves up to 84% precision at the top-50 flagged items across nine benchmarks, illustrating how measurement theory yields operational tools for benchmark quality assurance.
The relationship between reliability and validity was previewed in Chapter 5, where we established that reliability is a necessary but not sufficient condition for validity. The causal aspects of validity—particularly the relationship between Borsboom’s causal definition and structural causal models—are developed further in Chapter 7.
6.10 Exercises
Theoretical
Prove that if an item exhibits uniform DIF between two groups (i.e., the item is uniformly easier for one group at all ability levels), the standard Rasch model \(P(X_{ij}=1) = \text{expit}(\theta_j - \beta_i)\) cannot hold simultaneously for both groups with the same item parameter \(\beta_i\). What does this imply about the meaning of “item difficulty” when DIF is present?
Consider a 2-trait \(\times\) 2-method MTMM design. Show that if all convergent validity coefficients equal 1 (perfect convergent validity) and all heterotrait correlations equal 0 (perfect discriminant validity), then the method factors contribute zero variance. What does this imply about the interpretability of scores in practice, where method variance is nonzero?
Define infit mean square as \(\text{Infit}_i = \sum_j (X_{ij} - P_{ij})^2 / \sum_j P_{ij}(1-P_{ij})\). Suppose a fraction \(\gamma\) of models have memorized item \(i\) and respond correctly with probability 1 regardless of ability. Derive the expected infit as a function of \(\gamma\) and the item difficulty \(\beta_i\). For what values of \(\gamma\) does the infit exceed the conventional 1.3 threshold?
Using Borsboom’s causal definition of validity, argue whether “prompt sensitivity” (a model’s score changes when the prompt template is varied, holding the question content constant) is a validity threat or a reliability threat. Under what conditions might it be both?
Computational
Download a publicly available AI benchmark dataset (e.g., MMLU, HellaSwag, or ARC). Divide the models into two groups (e.g., models above and below 7B parameters). Conduct a Mantel-Haenszel DIF analysis and report which items, if any, show significant DIF. Interpret the results: does the DIF reflect genuine ability differences or construct-irrelevant features?
Simulate a benchmark with 3 underlying factors (e.g., verbal, quantitative, and spatial reasoning) using a factor model from Chapter 2. Generate responses for 300 models on 30 items (10 per factor). Implement parallel analysis to determine the correct dimensionality. How does the accuracy of parallel analysis depend on the factor correlations and sample size?
Using the MTMM simulation framework from Section 6.6.1, vary the relative strength of trait loadings vs. method loadings. At what ratio do the Campbell-Fiske criteria begin to fail? Plot the convergent and discriminant validity coefficients as a function of the trait-to-method variance ratio.
Discussion
- Salaudeen et al. (2025) argue that validity should be assessed relative to specific claims rather than as a global property of the benchmark. Borsboom (2005) argues that validity is fundamentally about whether the target attribute exists and causally produces score variation. Compare these two frameworks. Under what conditions do they agree? Under what conditions might they disagree? Which framework is more useful for guiding practical benchmark design, and why?