7 Causality and Distribution Shift
By the end of this chapter, you will be able to:
- Formulate causal models for AI evaluation and distinguish causal from associational claims about benchmark performance.
- Apply structural causal models (SCMs) to represent the data-generating process behind evaluation data, including the roles of training data, model architecture, and benchmark design.
- Identify when distribution shift—between training, calibration, and deployment populations—threatens the validity of evaluation conclusions, and characterize shift types (covariate, label, concept).
- Use interventional and counterfactual reasoning to diagnose construct-irrelevant variance: does the benchmark score change because ability changed, or because something else did?
- Explain the connection between Borsboom’s causal theory of validity and modern causal inference: a benchmark is valid if and only if the target construct causally produces variation in scores.
- Apply doubly robust estimation to correct for selection bias in adaptive and non-representative evaluations, connecting importance weighting, model-based imputation, and prediction-powered inference.
- Evaluate when benchmark results generalize across deployment contexts and when they do not, using transportability criteria and conformal inference.
This chapter can be covered in 3 lectures (75-90 minutes each):
Lecture 1: Causal Models for Evaluation
- From correlation to causation in benchmark scores (20 min)
- Structural causal models and the evaluation DAG (25 min)
- Interventional reasoning: diagnosing contamination and CIV (20 min)
- Borsboom’s causal validity revisited (10 min)
Lecture 2: Distribution Shift and Transportability
- Types of distribution shift: covariate, label, concept (20 min)
- When do benchmark results generalize? Transportability (25 min)
- The evaluation-as-bandit framing (20 min)
- Hands-on: shift visualization (10 min)
Lecture 3: Robust Estimation and Conformal Prediction
- Three estimators: DM, IPW, doubly robust (25 min)
- Connection to prediction-powered inference (15 min)
- Conformal prediction under distribution shift (20 min)
- Hands-on: DR estimation and weighted conformal (15 min)
This chapter introduces causal and distributional notation: \(\text{do}(X=x)\) (intervention), \(P^{(s)}/P^{(t)}\) (source/target distributions), \(\pi_0/\pi\) (logging/target policies), \(w(x)\) (importance weights), \(\hat{V}_{\text{DR}}\) (doubly robust estimator), and \(C_\alpha(x)\) (conformal prediction sets). See ?sec-notation for the complete notation reference.
7.1 From Association to Causation in Benchmark Scores
Model A outperforms Model B on a popular reasoning benchmark. What can we conclude? The association is clear: \(\mathbb{E}[Y \mid \text{Model} = A] > \mathbb{E}[Y \mid \text{Model} = B]\). But does this mean Model A has greater reasoning ability? Or did Model A’s training data happen to include problems similar to the benchmark items, giving it a memorization advantage that has nothing to do with reasoning?
This distinction—between associational and causal claims—is the central concern of this chapter. In Chapter 2, we introduced Borsboom’s realist definition of validity: a test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variation in the attribute causally produces variation in test scores (Borsboom 2005). In Chapter 6, we developed the practical implications of this definition through content, criterion, construct, external, and consequential validity. This chapter provides the formal causal machinery to make these ideas precise.
7.1.1 The Causal Hierarchy
Pearl (2009) distinguishes three levels of causal reasoning, each requiring progressively stronger assumptions:
Association \(P(Y \mid X)\): What does observing \(X\) tell us about \(Y\)? Example: “Models with more parameters tend to score higher on MMLU.” This is a correlation that could arise from many causal structures.
Intervention \(P(Y \mid \text{do}(X = x))\): What happens to \(Y\) if we set \(X\) to \(x\), regardless of what would have occurred naturally? Example: “If we increase the context window from 4K to 128K tokens, does the reasoning score improve?” This requires knowing the causal structure, not just the joint distribution.
Counterfactual \(P(Y_x \mid X = x')\): What would have happened to \(Y\) if \(X\) had been \(x\), given that we actually observed \(X = x'\)? Example: “Would this model have scored lower on the benchmark if its training data had not included the benchmark’s source documents?” Counterfactuals reason about individual cases, not populations.
Most benchmark analyses operate at Level 1. A leaderboard reports \(P(Y \mid \text{Model})\)—the association between model identity and score. But the claims we want to make are at Level 2 or 3: “This model has better reasoning ability” is a causal claim about what produces the score.
7.1.2 Structural Causal Models
A Structural Causal Model (SCM) provides the formal language for causal reasoning (Pearl 2009).
An SCM is a tuple \(\mathcal{M} = (U, V, F, P(U))\) where:
- \(U\) is a set of exogenous (background) variables
- \(V = \{V_1, \ldots, V_n\}\) is a set of endogenous variables
- \(F = \{f_1, \ldots, f_n\}\) is a set of structural equations, \(V_i = f_i(\text{pa}(V_i), U_i)\), where \(\text{pa}(V_i) \subseteq V \setminus \{V_i\}\)
- \(P(U)\) is a distribution over exogenous variables
Each SCM induces a directed acyclic graph (DAG) where edges point from \(\text{pa}(V_i)\) to \(V_i\). The DAG encodes conditional independence structure: \(V_i \perp\!\!\!\perp V_j \mid S\) if \(S\) d-separates \(V_i\) and \(V_j\) in the DAG.
7.1.3 The Evaluation DAG
We now construct an SCM for the AI evaluation process. The key endogenous variables are:
- \(D\): Training data (corpus composition, size, quality)
- \(A\): Architecture (transformer variant, parameter count, training recipe)
- \(\theta\): Latent ability (the construct the benchmark claims to measure)
- \(B\): Benchmark design (item pool, selection criteria, format)
- \(\beta\): Item properties (difficulty, discrimination, content)
- \(F\): Formatting (prompt template, few-shot examples, system prompt)
- \(Y\): Observed score (the benchmark outcome)
The causal structure is:
\[ D \to \theta, \quad A \to \theta, \quad \theta \to Y, \quad \beta \to Y, \quad B \to \beta, \quad F \to Y \]
When the benchmark is valid in Borsboom’s sense, the primary path from \(D\) and \(A\) to \(Y\) passes through \(\theta\): training data and architecture determine ability, and ability determines the score. The item properties \(\beta\) moderate this relationship (harder items produce lower scores for the same ability), but do not confound it.
Validity threats as DAG pathologies. The threats identified in Section 6.4 correspond to specific DAG structures:
- Contamination (Section 6.4.3): A direct path \(D \to Y\) that bypasses \(\theta\). The model’s training data includes benchmark items, so the score reflects memorization rather than ability. This is a confounding path that inflates the observed association between \(\theta\) and \(Y\).
- Construct-irrelevant variance (Section 6.4.1): A path \(F \to Y\) that does not pass through \(\theta\). Prompt formatting affects the score independent of ability. This is an additional cause of \(Y\) that is not part of the target construct.
- Differential item functioning (Section 6.4.4): An interaction \(A \to Y\) that is not mediated by \(\theta\). Certain architectures have advantages on certain items independent of their ability level.
Under Borsboom’s causal framework, a benchmark is valid if and only if the only systematic path from the data-generating process to the observed score passes through the target construct \(\theta\). Every unblocked non-\(\theta\) path is a validity threat. The diagnostic tools from Section 6.5 — DIF analysis, dimensionality assessment, item-fit statistics — are empirical tests for whether such paths exist.
7.2 Interventional and Counterfactual Diagnostics
The evaluation DAG from Section 7.1.3 allows us to reason about interventions—what happens when we deliberately change one variable while holding others fixed.
7.2.1 The Back-Door Adjustment
The back-door criterion (Pearl 2009) identifies when observational data suffices to estimate causal effects. A set of variables \(Z\) satisfies the back-door criterion relative to an ordered pair \((X, Y)\) if: (i) no node in \(Z\) is a descendant of \(X\), and (ii) \(Z\) blocks every path between \(X\) and \(Y\) that contains an arrow into \(X\).
When the back-door criterion is satisfied:
\[ P(Y \mid \text{do}(X = x)) = \sum_z P(Y \mid X = x, Z = z) P(Z = z) \tag{7.1}\]
This adjustment formula removes confounding by averaging over the confounder distribution. For AI evaluation, this means: to estimate the causal effect of ability \(\theta\) on score \(Y\), we must adjust for all confounders—variables that cause both \(\theta\) and \(Y\) through non-\(\theta\) paths.
7.2.2 Diagnosing Contamination
The contamination diagnostic from Section 6.4.3 can be formalized as an interventional question. Let \(D_{\text{overlap}}\) denote the subset of training data that overlaps with benchmark items. The causal question is:
\[ P(Y \mid \text{do}(D_{\text{overlap}} = \emptyset)) \stackrel{?}{=} P(Y \mid D_{\text{overlap}} = \emptyset) \]
If these are equal, removing the overlapping data does not change the score beyond what we would expect from the reduced training set. If they differ, the overlap was providing a direct \(D \to Y\) path (memorization) rather than contributing to \(\theta\) (genuine learning).
In practice, we cannot perform this intervention on already-trained models. But we can approximate it through:
- Chronological splits: Items created after the training data cutoff cannot be memorized. Comparing scores on pre- vs. post-cutoff items estimates the contamination effect.
- Canary detection: Embedding unique strings in benchmark items and testing whether models reproduce them provides evidence for the \(D \to Y\) path (Jacovi et al. 2023).
- Item-fit analysis: The infit/outfit statistics from Section 6.5.3 detect items where some models perform unexpectedly well—a signature of the memorization path.
7.2.3 Counterfactual Reasoning for DIF
Differential Item Functioning (Section 6.4.4) has a natural counterfactual interpretation. An item \(i\) exhibits DIF between groups \(g = 0\) and \(g = 1\) if:
\[ P(Y_{i,g=1} = 1 \mid \theta) \neq P(Y_{i,g=0} = 1 \mid \theta) \]
This is a counterfactual statement: “For a model with ability \(\theta\), would the response to item \(i\) have been different if the model belonged to the other group, holding ability fixed?” The “group” variable \(g\) (e.g., model architecture family) has a direct effect on the item response that is not mediated by \(\theta\).
Under the evaluation DAG, DIF corresponds to a path \(A \to Y\) that does not pass through \(\theta\)—an architectural advantage on specific items that reflects something other than the target construct. The Mantel-Haenszel procedure from Section 6.5.1 estimates this direct effect by stratifying on \(\theta\) (using total score as a proxy), which blocks the \(A \to \theta \to Y\) path and isolates the \(A \to Y\) direct effect.
A score difference \(Y_A - Y_B\) between two models reflects a genuine ability difference only if all paths from the data-generating process to \(Y\) that do not pass through \(\theta\) are blocked. Common unblocked paths in AI evaluation:
- Training data overlap (\(D \to Y\) bypassing \(\theta\)): memorization inflates scores
- Prompt formatting (\(F \to Y\)): format preferences affect scores independent of ability
- Evaluation order (position bias in LLM-as-judge): systematic bias from presentation order
The causal diagnostics in this section and the statistical diagnostics in Section 6.5 are complementary approaches to detecting these unblocked paths.
7.3 Distribution Shift and Transportability
In Section 6.3.4, we raised the question of whether benchmark results generalize beyond the specific conditions of the evaluation. This section provides the formal framework.
7.3.1 Setup
Let \(P^{(s)}(X, Y)\) denote the source (benchmark) joint distribution over items \(X\) and responses \(Y\), and \(P^{(t)}(X, Y)\) the target (deployment) distribution. The fundamental question is: when does a performance estimate computed under \(P^{(s)}\) remain valid under \(P^{(t)}\)?
The joint distribution factorizes as \(P(X, Y) = P(Y \mid X) P(X)\), giving rise to a taxonomy of shift types based on which component changes.
Covariate shift: \(P^{(s)}(X) \neq P^{(t)}(X)\) but \(P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)\). The item distribution changes but the model’s conditional behavior is stable. Example: a benchmark oversamples easy items; deployment sees a harder mix.
Label shift (prior probability shift): \(P^{(s)}(Y) \neq P^{(t)}(Y)\) but \(P^{(s)}(X \mid Y) = P^{(t)}(X \mid Y)\). The base rate of outcomes changes. Example: a benchmark has 50% correct base rate; deployment tasks have 10%.
Concept drift: \(P^{(s)}(Y \mid X) \neq P^{(t)}(Y \mid X)\). The relationship between items and responses changes. Example: a coding benchmark becomes outdated as programming conventions evolve, so the correct answers shift.
7.3.2 Covariate Shift in AI Evaluation
Covariate shift is the most common and most correctable form of shift. In AI evaluation, it arises whenever the benchmark item pool is not representative of the deployment item distribution.
Consider a benchmark designed to measure mathematical reasoning. If the item pool is drawn disproportionately from algebra (easy) rather than from combinatorics, number theory, and analysis (hard), then the benchmark’s item distribution \(P^{(s)}(X)\) is shifted toward easy items relative to the deployment distribution \(P^{(t)}(X)\). The model’s behavior on any given item—its conditional response \(P(Y \mid X)\)—is the same in both settings. But the overall accuracy computed under \(P^{(s)}\) overestimates deployment accuracy because easy items are overweighted.
This connects directly to the content validity discussion in Section 6.3.1: a benchmark with poor content validity (non-representative item sampling) is a benchmark with covariate shift relative to the construct domain.
7.3.3 Concept Drift in AI Evaluation
Concept drift is more dangerous because the model’s conditional behavior itself changes. In AI evaluation, concept drift occurs when:
- Model updates: A model is updated via RLHF, fine-tuning, or system prompt changes. The IRT parameters calibrated on the old model no longer apply.
- Prompt template changes: Switching from a zero-shot to a few-shot template changes \(P(Y \mid X)\) even for the same items.
- Temporal evolution: Programming languages evolve, scientific knowledge advances, and social norms shift. A benchmark’s “correct” answers may become outdated.
Under concept drift, no reweighting of the source distribution can recover the target performance. The model must be re-evaluated on data drawn from the target distribution, or the concept drift must be modeled explicitly.
7.3.4 Transportability
When can we transport causal conclusions from one setting to another? Bareinboim and Pearl (2016) formalize this through selection diagrams: DAGs augmented with special “selection” nodes \(S\) that indicate which mechanisms differ between source and target.
The key result (stated informally): a causal effect is transportable from source to target if the differences between environments, as encoded by \(S\), can be “adjusted away” using the causal structure. Specifically, the target-domain causal effect is identifiable from the source data plus knowledge of which mechanisms differ.
For AI evaluation, this means: a benchmark result is transportable to a new deployment context if we can identify what changed between the benchmark setting and deployment, and the evaluation DAG provides a path to adjust for those changes. If the only change is the item distribution (covariate shift), importance weighting suffices. If the model’s behavior changes (concept drift), we need additional assumptions or new data from the target domain.
The three panels illustrate the shift taxonomy. Covariate shift (left): the benchmark item pool is concentrated at easy difficulties (blue), while deployment items span a broader range (red), but the item characteristic curve is the same in both settings. Label shift (center): the base rate of correct responses differs between benchmark (50%) and deployment (10%). Concept drift (right): the relationship between ability and response probability itself changes—the target ICC (dashed red) has different discrimination and difficulty than the source (blue), so no reweighting of items can correct the estimate.
7.4 Off-Policy Evaluation and Doubly Robust Estimation
The distribution shift framework tells us when benchmark results are biased. This section develops the tools to correct that bias, drawing on the off-policy evaluation literature from contextual bandits.
7.4.1 The Evaluation-as-Bandit Framing
We reframe AI evaluation as an off-policy estimation problem. The key mapping is:
| Bandit concept | AI evaluation translation |
|---|---|
| Context \(x\) | Evaluation item (content, format, difficulty) |
| Action \(a\) | Model’s response |
| Reward \(r\) | Correctness score (\(0\) or \(1\), or a continuous quality rating) |
| Logging policy \(\pi_0(a \mid x)\) | The benchmark design process—how items were selected and administered |
| Target policy \(\pi(a \mid x)\) | The model’s behavior under deployment conditions |
Given logged data \(\{(x_t, a_t, r_t)\}_{t=1}^n\) collected under logging policy \(\pi_0\), the off-policy evaluation problem is to estimate the value of a target policy \(\pi\):
\[ V(\pi) = \mathbb{E}_{x \sim P, a \sim \pi(\cdot \mid x)}[r(x, a)] \]
without deploying \(\pi\) to collect new data.
Why does this framing matter? Whenever items are selected non-uniformly—by adaptive testing (Section 4.2.2), by convenience sampling, or by any design process that makes some items more likely to appear than others—we are observing rewards under \(\pi_0\) but want to estimate performance under \(\pi\). Naive accuracy (the simple average of rewards) is biased because \(\pi_0 \neq \pi\).
7.4.2 Three Estimators
Three classical estimators address this problem, each with different bias-variance tradeoffs.
The Direct Method (DM). Build a reward model \(\hat{r}(x, a)\) and use it to predict rewards under the target policy:
\[ \hat{V}_{\text{DM}} = \frac{1}{n} \sum_{t=1}^{n} \hat{r}(x_t, \pi) \tag{7.2}\]
where \(\hat{r}(x_t, \pi) = \sum_a \pi(a \mid x_t) \hat{r}(x_t, a)\) integrates the reward model over the target policy’s action distribution. In the AIMS context, \(\hat{r}\) is precisely the IRT model from Chapter 2: \(\hat{r}(x, a) = P(Y = 1 \mid \theta, \beta_x)\). The direct method is what we have been doing throughout Chapters 1–3: using the fitted IRT model to predict performance.
The direct method is biased when the reward model is misspecified. If the IRT model is wrong—perhaps the true data-generating process has interactions the model does not capture—the predictions will be systematically off.
Inverse Propensity Weighting (IPW). Instead of modeling the reward, model the selection process and reweight observations:
\[ \hat{V}_{\text{IPW}} = \frac{1}{n} \sum_{t=1}^{n} \frac{\pi(a_t \mid x_t)}{\pi_0(a_t \mid x_t)} r_t \tag{7.3}\]
Each observation is weighted by the ratio of how likely the target policy was to take that action versus how likely the logging policy was. IPW is unbiased when the propensities are known, but can have very high variance when \(\pi_0\) and \(\pi\) differ substantially (the weights become large).
In computerized adaptive testing (Section 4.2.2), items are selected to maximize Fisher information at the current ability estimate. This means the item selection probability \(\pi_0(x_t \mid \hat{\theta}_t)\) is non-uniform: items near the model’s estimated ability are heavily oversampled, while very easy and very hard items are undersampled.
Naive accuracy on CAT-selected items is biased because the item pool is deliberately non-representative. IPW corrects this by upweighting items that were unlikely to be selected by the adaptive algorithm. The propensities \(\pi_0\) are known from the CAT algorithm’s selection rule (the item with maximum Fisher information at the current estimate is selected with probability 1 in a greedy CAT, or with probability proportional to information in a stochastic variant).
The Doubly Robust Estimator (DR). The doubly robust estimator (Dudı́k, Langford, and Li 2011; Robins, Rotnitzky, and Zhao 1994) combines both approaches:
\[ \hat{V}_{\text{DR}} = \frac{1}{n} \sum_{t=1}^{n} \left[ \hat{r}(x_t, \pi) + \frac{\pi(a_t \mid x_t)}{\pi_0(a_t \mid x_t)} \bigl(r_t - \hat{r}(x_t, a_t)\bigr) \right] \tag{7.4}\]
The DR estimator augments the direct method prediction \(\hat{r}(x_t, \pi)\) with a propensity-weighted correction term \(\frac{\pi}{\pi_0}(r_t - \hat{r})\). It is doubly robust: consistent if either the reward model \(\hat{r}\) or the propensity model \(\pi_0\) is correctly specified (but not necessarily both).
- When \(\hat{r}\) is correct: the residual \(r_t - \hat{r}(x_t, a_t)\) has zero expectation, so the correction vanishes and \(\hat{V}_{\text{DR}} \approx \hat{V}_{\text{DM}}\).
- When \(\hat{r}\) is wrong but \(\pi_0\) is correct: the IPW term corrects the model’s bias, yielding an unbiased estimate.
Proof of double robustness. We verify that \(\mathbb{E}[\hat{V}_{\text{DR}}] = V(\pi)\) when either model is correct. Taking expectations:
\[ \mathbb{E}[\hat{V}_{\text{DR}}] = \mathbb{E}\!\left[\hat{r}(x, \pi)\right] + \mathbb{E}\!\left[\frac{\pi(a \mid x)}{\pi_0(a \mid x)}(r - \hat{r}(x, a))\right] \]
For the second term, conditioning on \(x\):
\[ \mathbb{E}\!\left[\frac{\pi(a \mid x)}{\pi_0(a \mid x)}(r - \hat{r}(x, a)) \;\middle|\; x\right] = \sum_a \pi_0(a \mid x) \frac{\pi(a \mid x)}{\pi_0(a \mid x)} \bigl(\mathbb{E}[r \mid x, a] - \hat{r}(x, a)\bigr) \]
\[ = \sum_a \pi(a \mid x) \bigl(r^*(x, a) - \hat{r}(x, a)\bigr) \]
where \(r^*(x, a) = \mathbb{E}[r \mid x, a]\) is the true reward function. If \(\hat{r} = r^*\), this is zero. If \(\pi_0\) is correct (which it is by construction in the importance weighting), the full expectation becomes:
\[ \mathbb{E}[\hat{V}_{\text{DR}}] = \mathbb{E}\!\left[\hat{r}(x, \pi)\right] + \mathbb{E}_x\!\left[\sum_a \pi(a \mid x)(r^*(x,a) - \hat{r}(x,a))\right] = \mathbb{E}_x\!\left[\sum_a \pi(a \mid x) r^*(x, a)\right] = V(\pi) \]
7.4.3 Connection to Prediction-Powered Inference
Prediction-Powered Inference (PPI) (Angelopoulos et al. 2023) is a recently developed framework that is structurally equivalent to doubly robust estimation. PPI uses a small labeled dataset \(\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n\) (expensive: actual model evaluations) and a large unlabeled dataset \(\mathcal{D}_N = \{(X_i, \hat{Y}_i)\}_{i=1}^N\) (cheap: model predictions or automated scores) to estimate a population quantity \(\mu = \mathbb{E}[Y]\):
\[ \hat{\mu}_{\text{PPI}} = \frac{1}{N} \sum_{i \in \mathcal{D}_N} \hat{Y}_i + \frac{1}{n} \sum_{i \in \mathcal{D}_n} (Y_i - \hat{Y}_i) \tag{7.5}\]
The first term is the direct method estimate using the cheap predictions. The second term is the bias correction: it estimates the systematic error in \(\hat{Y}\) using the labeled subsample. This is exactly the DR structure: imputation + correction.
The connection to the cold-start pipeline in Section 2.6 is direct. The PPE (Prediction-Powered Evaluation) approach from Chapter 2 learns a mapping from model embeddings and metadata to IRT parameters, producing predictions \(\hat{Y}_{ij}\) for unseen model-item pairs. PPI provides the debiasing step: use a small set of actual evaluations to correct the PPE predictions, yielding valid confidence intervals for model performance. The PPE predictions serve as the imputation model \(\hat{r}\); the actual evaluations provide the labeled correction \(Y - \hat{Y}\).
7.4.4 Importance Weighting for Covariate Shift
When the shift between benchmark and deployment is purely covariate shift (\(P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)\), \(P^{(s)}(X) \neq P^{(t)}(X)\)), the importance-weighted estimator corrects the bias (Shimodaira 2000):
\[ \hat{\mu}_{\text{IW}} = \frac{\sum_{i=1}^n w(x_i) Y_i}{\sum_{i=1}^n w(x_i)}, \quad w(x_i) = \frac{P^{(t)}(x_i)}{P^{(s)}(x_i)} \tag{7.6}\]
This is the self-normalized variant, which is more stable than the unnormalized version. The weights \(w(x_i)\) upweight items that are underrepresented in the benchmark relative to deployment, and downweight overrepresented items.
Practical challenges:
- Density ratio estimation: The weights require knowing or estimating the density ratio \(P^{(t)}/P^{(s)}\). Methods include logistic regression on a domain classifier, kernel mean matching, and moment matching (Sugiyama, Suzuki, and Kanamori 2012).
- Weight clipping: Extreme weights inflate variance. Clipping weights at a maximum value \(M\) (e.g., \(w_i \leftarrow \min(w_i, M)\)) trades bias for variance reduction.
- Effective sample size: The effective sample size \(n_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2\) measures how much information the reweighted sample contains. When \(n_{\text{eff}} \ll n\), the importance-weighted estimate is unreliable.
AI example. A reasoning benchmark oversamples algebra items (60% algebra, 20% geometry, 20% combinatorics). The deployment distribution is uniform across topics. Importance weighting reweights: algebra items get weight \(w = 1/3 \div 3/5 = 5/9\), while geometry and combinatorics items get weight \(w = 1/3 \div 1/5 = 5/3\). The corrected accuracy better reflects deployment performance.
The simulation demonstrates the off-policy estimation problem under CAT-like adaptive item selection. The naive estimator (simple average of responses on adaptively-selected items) is biased because the CAT algorithm oversamples items near the model’s ability level. The Direct Method (IRT prediction with noisy ability estimate) has low bias when the model is approximately correct but is sensitive to misspecification. IPW corrects the selection bias but has higher variance due to extreme weights. The Doubly Robust estimator combines the strengths of both, achieving the lowest RMSE by using the IRT model for imputation and IPW for bias correction.
The left panel shows the source (benchmark) and target (deployment) item difficulty distributions. The benchmark oversamples easy items (blue, centered at \(\beta = -1\)), while deployment items span a broader range (red, centered at \(\beta = 0\)). The right panel shows the consequence: naive accuracy computed on benchmark items (blue) systematically overestimates deployment accuracy (red). Importance weighting (green) corrects the bias by upweighting hard items that are underrepresented in the benchmark, bringing the estimates close to the true deployment accuracy.
7.5 Conformal Prediction Under Distribution Shift
The off-policy estimators in Section 7.4 correct point estimates of model performance. But practitioners also need uncertainty quantification: not just “the model’s deployment accuracy is 0.78” but “the model’s deployment accuracy is in \([0.72, 0.84]\) with 90% confidence.” Conformal prediction provides this guarantee with minimal distributional assumptions.
Given a calibration set \(\{(X_i, Y_i)\}_{i=1}^n\) and a new input \(X_{n+1}\), a conformal prediction set \(C_\alpha(X_{n+1})\) satisfies:
\[ P(Y_{n+1} \in C_\alpha(X_{n+1})) \geq 1 - \alpha \]
under the assumption that the calibration and test data are exchangeable (informally: drawn from the same distribution in no particular order). No parametric assumptions on the data distribution are required.
7.5.1 Split Conformal Prediction
The split conformal algorithm (Vovk, Gammerman, and Shafer 2005) is the simplest conformal method:
- Fit a model \(\hat{f}\) on a training set.
- On a held-out calibration set \(\{(X_i, Y_i)\}_{i=1}^n\), compute nonconformity scores \(s_i = |Y_i - \hat{f}(X_i)|\).
- Let \(\hat{q}\) be the \(\lceil (1-\alpha)(1 + 1/n) \rceil / n\) quantile of \(\{s_1, \ldots, s_n\}\).
- For a new input \(X_{n+1}\), predict \(C_\alpha(X_{n+1}) = [\hat{f}(X_{n+1}) - \hat{q}, \hat{f}(X_{n+1}) + \hat{q}]\).
The coverage guarantee \(P(Y_{n+1} \in C_\alpha) \geq 1 - \alpha\) holds for any model \(\hat{f}\), any data distribution, and any sample size—as long as the calibration and test data are exchangeable.
7.5.2 When Exchangeability Fails
Under distribution shift, exchangeability fails: the calibration data is drawn from \(P^{(s)}\) while the test data comes from \(P^{(t)}\). This causes the conformal prediction set to lose its coverage guarantee. Specifically, if the target distribution concentrates on regions where the model is less accurate (harder items), the calibration-based quantile \(\hat{q}\) underestimates the nonconformity scores at test time, leading to undercoverage.
7.5.3 Weighted Conformal Prediction
Tibshirani et al. (2019) show that coverage can be restored under covariate shift by replacing the uniform quantile with an importance-weighted quantile. The key insight is the same as for importance weighting in Section 7.4.4: reweight the calibration scores to match the target distribution.
The algorithm modifies step 3: instead of the uniform quantile, compute the weighted quantile of \(\{s_1, \ldots, s_n, +\infty\}\) with weights \(\{w_1, \ldots, w_n, w_{n+1}\}\) where \(w_i = P^{(t)}(X_i) / P^{(s)}(X_i)\) and \(w_{n+1} = P^{(t)}(X_{n+1}) / P^{(s)}(X_{n+1})\). This yields:
\[ P^{(t)}(Y_{n+1} \in C_\alpha(X_{n+1})) \geq 1 - \alpha \]
The coverage guarantee now holds under the target distribution, provided the importance weights are correct. Barber et al. (2023) extend this beyond covariate shift to more general forms of distribution shift.
The left panel shows coverage calibration: the diagonal represents perfect calibration (empirical coverage = nominal coverage). Naive conformal prediction (red) falls below the diagonal—it promises 90% coverage but delivers less, because the calibration items (easy) are not representative of the test items (hard). Weighted conformal prediction (green) restores coverage by reweighting the calibration scores to match the test distribution. The right panel shows the source of the problem: the calibration distribution (blue) is shifted left (easier items) relative to the test distribution (red).
7.6 Putting It All Together: A Causal Audit
We close with a worked example that composes all tools from this chapter. Consider “CodeReason,” a hypothetical benchmark claiming to measure coding reasoning ability in language models.
Step 1: Draw the evaluation DAG (Section 7.1.3). We identify: Training Data \(D\) (code repositories, documentation), Architecture \(A\) (transformer variant), Latent Coding Ability \(\theta\), Benchmark Items \(\beta\) (coding problems), Prompt Template \(F\) (instruction format), and Observed Score \(Y\). We suspect two validity-threatening paths: (i) \(D \to Y\) bypassing \(\theta\) (some models trained on CodeReason’s source repository), and (ii) \(F \to Y\) (the benchmark uses a specific instruction format that favors certain models).
Step 2: Check for distribution shift (Section 7.3). The benchmark items are predominantly Python function-completion tasks at medium difficulty. The deployment context requires debugging, multi-file reasoning, and code review across Python, JavaScript, and Rust. This is covariate shift: the item distribution \(P^{(s)}(\beta)\) is narrower than \(P^{(t)}(\beta)\). We estimate density ratios using a domain classifier trained to distinguish benchmark items from deployment items.
Step 3: Apply doubly robust estimation (Section 7.4.2). We combine two components: (a) An IRT model fitted on the benchmark data provides \(\hat{r}(x, a)\)—the direct method estimate of each model’s accuracy on any item. (b) The density ratio weights from Step 2 provide the importance weights. The DR estimator corrects the IRT predictions using the importance-weighted residuals. The result: an accuracy estimate that accounts for both the item selection bias and any IRT misspecification.
Step 4: Construct prediction intervals (Section 7.5.3). Using the importance weights from Step 2, we apply weighted conformal prediction to the DR-corrected estimates. The result: “Model X’s deployment accuracy is in [0.62, 0.71] with 90% coverage.” The interval is wider than it would be under no shift, reflecting the additional uncertainty from extrapolating to a different item distribution.
Step 5: Diagnose contamination (Section 7.2.2). We compare model performance on items created before vs. after the training data cutoff (chronological split). Models with suspected contamination show a 15-point accuracy gap between pre- and post-cutoff items. We run item-fit analysis (Section 6.5.3) and find that the pre-cutoff items have inflated outfit statistics for these models. Conclusion: the \(D \to Y\) path is active for these models. We flag them and report both contamination-adjusted and unadjusted rankings.
This five-step causal audit integrates the tools from this chapter with the diagnostic methods from Section 6.5, providing a systematic approach to evaluating whether benchmark results are trustworthy enough to inform deployment decisions.
7.7 Discussion Questions
A benchmark uses Computerized Adaptive Testing (Section 4.2.2) to select items. A critic argues that the resulting accuracy is “biased because models only answered items matched to their ability level.” Using the off-policy evaluation framework, explain whether this criticism is valid and how you would correct the estimate if needed.
Two benchmarks both claim to measure “reasoning.” On Benchmark A, Model X outperforms Model Y. On Benchmark B, the ranking reverses. Draw a DAG that explains this reversal (hint: consider Simpson’s paradox). What causal assumptions would make the Benchmark A ranking the correct one?
Prediction-powered inference assumes that the model predictions \(\hat{Y}\) are informative about the true labels \(Y\). When might this assumption fail in AI evaluation? What are the consequences for the DR estimator’s consistency, and how would you detect the failure?
A company evaluates its model on an internal benchmark (source) and wants to predict performance on a customer-facing deployment (target). What information is needed to determine whether the benchmark results are transportable? Which assumptions are untestable from the source data alone?
Conformal prediction guarantees marginal coverage: \(P(Y \in C_\alpha) \geq 1 - \alpha\) on average over the test distribution. Why is this weaker than conditional coverage (\(P(Y \in C_\alpha \mid X = x) \geq 1 - \alpha\) for each \(x\)), and when does the distinction matter for AI evaluation?
7.8 Bibliographic Notes
The mathematical framework for causal reasoning used in this chapter is due to Pearl (2009), who developed structural causal models, the do-calculus, and the graphical criteria for identifiability. An accessible introduction is Peters, Janzing, and Schölkopf (2017). The transportability theory—when causal conclusions can be transferred across settings—is developed by Bareinboim and Pearl (2016).
The taxonomy of distribution shift draws on a large literature. Covariate shift and importance weighting were formalized by Shimodaira (2000). Sugiyama, Suzuki, and Kanamori (2012) provide a comprehensive treatment of density ratio estimation methods. Quiñonero-Candela et al. (2009) survey the broader landscape of dataset shift in machine learning.
Doubly robust estimation originated in the biostatistics literature with Robins, Rotnitzky, and Zhao (1994), who introduced the augmented inverse probability weighted estimator. Dudı́k, Langford, and Li (2011) adapted the DR framework to contextual bandits and off-policy evaluation, which is the formulation we use here. The connection between DR estimation and semiparametric efficiency theory is developed in the biostatistics literature; for machine learning applications, see the survey by Dudík, Erber, Langford, and Li (2014).
Prediction-Powered Inference was introduced by Angelopoulos et al. (2023), who showed how to construct valid confidence intervals by combining a small labeled dataset with a large set of model predictions. The structural connection between PPI and doubly robust estimation is noted in their paper and further developed in follow-up work on PPI++.
Conformal prediction was introduced by Vovk, Gammerman, and Shafer (2005). Tibshirani et al. (2019) extended conformal methods to handle covariate shift via importance weighting, and Barber et al. (2023) further generalized the framework beyond exchangeability.
The connection between Borsboom’s causal validity and structural causal models was developed in Chapter 2. The evaluation DAG in this chapter makes that connection explicit, showing how each validity threat from Section 6.4 corresponds to a specific causal pathway. Data attribution methods—influence functions, TracIn, and TRAK—provide complementary tools for tracing the causal effect of training data on model predictions, but are beyond the scope of this chapter; see Koh and Liang (2017) for the foundational work.
7.9 Exercises
Theoretical
Show that under covariate shift (\(P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)\), \(P^{(s)}(X) \neq P^{(t)}(X)\)), the importance-weighted estimator \(\hat{\mu}_{\text{IW}} = \frac{1}{n}\sum_{i=1}^n w(x_i) Y_i\) with \(w(x_i) = P^{(t)}(x_i)/P^{(s)}(x_i)\) is unbiased for \(\mathbb{E}_{P^{(t)}}[Y]\). Under what conditions on the weight distribution does the variance of \(\hat{\mu}_{\text{IW}}\) diverge?
Prove the double robustness property: show that \(\mathbb{E}[\hat{V}_{\text{DR}}] = V(\pi)\) (Equation 7.4) if either (a) the reward model satisfies \(\hat{r}(x, a) = \mathbb{E}[r \mid x, a]\) for all \((x, a)\), or (b) the propensity \(\pi_0(a \mid x)\) is correctly specified. Where does the proof rely on the assumption that the logged data is collected under \(\pi_0\)?
In a CAT procedure that selects the item maximizing Fisher information at the current ability estimate \(\hat{\theta}\), derive the selection propensity \(\pi_0(x_t \mid \hat{\theta}_t)\) for a Rasch model. Show that this propensity is maximized for items with \(\beta_i \approx \hat{\theta}_t\) and decays for items far from the current estimate. What is the effective sample size \(n_{\text{eff}}\) of the resulting IPW estimator as a function of the item pool’s difficulty distribution?
Using the evaluation DAG from Section 7.1.3, show that benchmark contamination (a direct path \(D \to Y\) bypassing \(\theta\)) is not identifiable from observational data alone without additional assumptions. What minimal intervention (e.g., a chronological split or a canary-based test) would identify the contamination effect? State the identifying assumptions precisely.
Computational
Implement importance weighting for a benchmark with covariate shift. Simulate a Rasch model with \(P^{(s)}(\beta) = \mathcal{N}(-1, 1)\) (easy-biased source) and \(P^{(t)}(\beta) = \mathcal{N}(0, 1.5)\) (broader target). For 200 models, compare naive and IW-corrected accuracy estimates. Vary the degree of shift (by changing the source mean from \(-2\) to \(0\)) and plot the naive bias as a function of shift magnitude.
Simulate a full CAT procedure for 200 models on a 500-item Rasch pool. For each model, run a 30-item adaptive test using the maximum-information selection rule from Section 4.2.2. Compute naive accuracy, IPW-corrected accuracy, and DR-corrected accuracy. Introduce IRT misspecification (e.g., the true model is 2PL but the reward model assumes Rasch) and show that DR remains robust while DM degrades.
Implement weighted conformal prediction for AI evaluation under covariate shift. Calibrate on items drawn from \(\mathcal{N}(-0.5, 0.8)\) and test on items from \(\mathcal{N}(1.0, 1.0)\). Plot empirical coverage vs. nominal level for both standard and weighted conformal across the range \(\alpha \in [0.05, 0.50]\). How does the effective sample size of the weighted calibration set affect the coverage guarantee?
Discussion
- Angelopoulos et al. (2023)’s PPI uses cheap model predictions to augment expensive human labels. In AI evaluation, the “expensive” labels are actual model runs and the “cheap” predictions are the PPE cold-start predictions from Section 2.6. Design a practical PPI pipeline for evaluating a new model on 10,000 items when you can only afford to run 500 actual evaluations. What are the key design choices (which 500 items to evaluate? how to estimate the bias correction? how to construct confidence intervals)? How does the accuracy of the PPE predictions affect the width of the resulting intervals?