7 Causality

Intended Learning Outcomes

By the end of this chapter, the reader will be able to:

Formulate causal models for AI evaluation and distinguish causal from associational claims about benchmark performance.
Apply structural causal models (SCMs) to represent the data-generating process behind evaluation data, including the roles of training data, model architecture, and benchmark design.
Identify when distribution shift—between training, calibration, and deployment populations—threatens the validity of evaluation conclusions, and characterize shift types (covariate, label, concept).
Use interventional and counterfactual reasoning to diagnose construct-irrelevant variance: does the benchmark score change because ability changed, or because something else did?
Explain the connection between Borsboom’s causal theory of validity and modern causal inference: a benchmark is valid if and only if the target construct causally produces variation in scores.
Apply doubly robust estimation to correct for selection bias in adaptive and non-representative evaluations, connecting importance weighting, model-based imputation, and prediction-powered inference.
Evaluate when benchmark results generalize across deployment contexts and when they do not, using transportability criteria and conformal inference.

Suggested Lecture Plan

This chapter can be covered in 3 lectures (75-90 minutes each):

Lecture 1: Causal Models for Evaluation

From correlation to causation in benchmark scores (20 min)
Structural causal models and the evaluation DAG (25 min)
Interventional reasoning: diagnosing contamination and CIV (20 min)
Borsboom’s causal validity revisited (10 min)

Lecture 2: Distribution Shift and Transportability

Types of distribution shift: covariate, label, concept (20 min)
When do benchmark results generalize? Transportability (25 min)
The evaluation-as-bandit framing (20 min)
Hands-on: shift visualization (10 min)

Lecture 3: Robust Estimation and Conformal Prediction

Three estimators: DM, IPW, doubly robust (25 min)
Connection to prediction-powered inference (15 min)
Conformal prediction under distribution shift (20 min)
Hands-on: DR estimation and weighted conformal (15 min)

Notation

This chapter introduces causal and distributional notation: $\text{do}(X=x)$ (intervention), $P^{(s)}/P^{(t)}$ (source/target distributions), $\pi_0/\pi$ (logging/target policies), $w(x)$ (importance weights), $\hat{V}_{\text{DR}}$ (doubly robust estimator), and $C_\alpha(x)$ (conformal prediction sets). See the notation appendix for the complete notation reference.

7.1 From Association to Causation in Benchmark Scores

Model A outperforms Model B on a popular reasoning benchmark. What can we conclude? The association is clear: $\mathbb{E}[Y \mid \text{Model} = A] > \mathbb{E}[Y \mid \text{Model} = B]$. But does this mean Model A has greater reasoning ability? Or did Model A’s training data happen to include problems similar to the benchmark items, giving it a memorization advantage that has nothing to do with reasoning?

This distinction—between associational and causal claims—is the central concern of this chapter. In Chapter 3, we introduced Borsboom’s realist definition of validity: a test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variation in the attribute causally produces variation in test scores (Borsboom 2005). In Chapter 1, we developed the practical implications of this definition through content, criterion, construct, external, and consequential validity. This chapter provides the formal causal machinery to make these ideas precise.

7.1.1 The Causal Hierarchy

Pearl (2009) distinguishes three levels of causal reasoning, each requiring progressively stronger assumptions:

Association $P(Y \mid X)$: What does observing $X$ tell us about $Y$? Example: “Models with more parameters tend to score higher on MMLU.” This is a correlation that could arise from many causal structures.
Intervention $P(Y \mid \text{do}(X = x))$: What happens to $Y$ if we set $X$ to $x$, regardless of what would have occurred naturally? Example: “If we increase the context window from 4K to 128K tokens, does the reasoning score improve?” This requires knowing the causal structure, not just the joint distribution.
Counterfactual $P(Y_x \mid X = x')$: What would have happened to $Y$ if $X$ had been $x$, given that we actually observed $X = x'$? Example: “Would this model have scored lower on the benchmark if its training data had not included the benchmark’s source documents?” Counterfactuals reason about individual cases, not populations.

Most benchmark analyses operate at Level 1. A leaderboard reports $P(Y \mid \text{Model})$—the association between model identity and score. But the claims we want to make are at Level 2 or 3: “This model has better reasoning ability” is a causal claim about what produces the score.

7.1.2 Structural Causal Models

A Structural Causal Model (SCM) provides the formal language for causal reasoning (Pearl 2009).

Definition: Structural Causal Model

An SCM is a tuple $\mathcal{M} = (U, V, F, P(U))$ where:

$U$ is a set of exogenous (background) variables
$V = \{V_1, \ldots, V_n\}$ is a set of endogenous variables
$F = \{f_1, \ldots, f_n\}$ is a set of structural equations, $V_i = f_i(\text{pa}(V_i), U_i)$, where $\text{pa}(V_i) \subseteq V \setminus \{V_i\}$
$P(U)$ is a distribution over exogenous variables

Each SCM induces a directed acyclic graph (DAG) where edges point from $\text{pa}(V_i)$ to $V_i$. The DAG encodes conditional independence structure: $V_i \perp\!\!\!\perp V_j \mid S$ if $S$ d-separates $V_i$ and $V_j$ in the DAG.

7.1.3 The Evaluation DAG

We now construct an SCM for the AI evaluation process. The key endogenous variables are:

$D$: Training data (corpus composition, size, quality)
$A$: Architecture (transformer variant, parameter count, training recipe)
$\theta$: Latent ability (the construct the benchmark claims to measure)
$B$: Benchmark design (item pool, selection criteria, format)
$\beta$: Item properties (difficulty, discrimination, content)
$F$: Formatting (prompt template, few-shot examples, system prompt)
$Y$: Observed score (the benchmark outcome)

The causal structure is:

\[ D \to \theta, \quad A \to \theta, \quad \theta \to Y, \quad \beta \to Y, \quad B \to \beta, \quad F \to Y \]

When the benchmark is valid in Borsboom’s sense, the primary path from $D$ and $A$ to $Y$ passes through $\theta$: training data and architecture determine ability, and ability determines the score. The item properties $\beta$ moderate this relationship (harder items produce lower scores for the same ability), but do not confound it.

Validity threats as DAG pathologies. The threats identified in Section 1.4 correspond to specific DAG structures:

Contamination (Section 1.4.3): A direct path $D \to Y$ that bypasses $\theta$. The model’s training data includes benchmark items, so the score reflects memorization rather than ability. This is a confounding path that inflates the observed association between $\theta$ and $Y$.
Construct-irrelevant variance (Section 1.4.1): A path $F \to Y$ that does not pass through $\theta$. Prompt formatting affects the score independent of ability. This is an additional cause of $Y$ that is not part of the target construct.
Differential item functioning (Section 5.11.1): An interaction $A \to Y$ that is not mediated by $\theta$. Certain architectures have advantages on certain items independent of their ability level.

Validity = No Unblocked Backdoor Paths

Under Borsboom’s causal framework, a benchmark is valid if and only if the only systematic path from the data-generating process to the observed score passes through the target construct $\theta$. Every unblocked non-$\theta$ path is a validity threat. The diagnostic tools from Section 5.11 — DIF analysis, dimensionality assessment, item-fit statistics — are empirical tests for whether such paths exist.

7.2 Interventional and Counterfactual Diagnostics

The evaluation DAG from Section 7.1.3 allows us to reason about interventions—what happens when we deliberately change one variable while holding others fixed.

7.2.1 The Back-Door Adjustment

The back-door criterion (Pearl 2009) identifies when observational data suffices to estimate causal effects. A set of variables $Z$ satisfies the back-door criterion relative to an ordered pair $(X, Y)$ if: (i) no node in $Z$ is a descendant of $X$, and (ii) $Z$ blocks every path between $X$ and $Y$ that contains an arrow into $X$.

When the back-door criterion is satisfied:

\[ P(Y \mid \text{do}(X = x)) = \sum_z P(Y \mid X = x, Z = z) P(Z = z) \tag{7.1}\]

This adjustment formula removes confounding by averaging over the confounder distribution. For AI evaluation, this means: to estimate the causal effect of ability $\theta$ on score $Y$, we must adjust for all confounders—variables that cause both $\theta$ and $Y$ through non-$\theta$ paths.

7.2.2 Diagnosing Contamination

The contamination diagnostic from Section 1.4.3 can be formalized as an interventional question. Let $D_{\text{overlap}}$ denote the subset of training data that overlaps with benchmark items. The causal question is:

\[ P(Y \mid \text{do}(D_{\text{overlap}} = \emptyset)) \stackrel{?}{=} P(Y \mid D_{\text{overlap}} = \emptyset) \]

If these are equal, removing the overlapping data does not change the score beyond what we would expect from the reduced training set. If they differ, the overlap was providing a direct $D \to Y$ path (memorization) rather than contributing to $\theta$ (genuine learning).

In practice, we cannot perform this intervention on already-trained models. But we can approximate it through:

Chronological splits: Items created after the training data cutoff cannot be memorized. Comparing scores on pre- vs. post-cutoff items estimates the contamination effect.
Canary detection: Embedding unique strings in benchmark items and testing whether models reproduce them provides evidence for the $D \to Y$ path (Jacovi et al. 2023).
Item-fit analysis: The infit/outfit statistics from Section 5.11.4 detect items where some models perform unexpectedly well—a signature of the memorization path.

7.2.3 Counterfactual Reasoning for DIF

Differential Item Functioning (Section 5.11.1) has a natural counterfactual interpretation. An item $i$ exhibits DIF between groups $g = 0$ and $g = 1$ if:

\[ P(Y_{i,g=1} = 1 \mid \theta) \neq P(Y_{i,g=0} = 1 \mid \theta) \]

This is a counterfactual statement: “For a model with ability $\theta$, would the response to item $i$ have been different if the model belonged to the other group, holding ability fixed?” The “group” variable $g$ (e.g., model architecture family) has a direct effect on the item response that is not mediated by $\theta$.

Under the evaluation DAG, DIF corresponds to a path $A \to Y$ that does not pass through $\theta$—an architectural advantage on specific items that reflects something other than the target construct. The Mantel-Haenszel procedure from Section 5.11.2 estimates this direct effect by stratifying on $\theta$ (using total score as a proxy), which blocks the $A \to \theta \to Y$ path and isolates the $A \to Y$ direct effect.

When Is a Score Difference Causal?

A score difference $Y_A - Y_B$ between two models reflects a genuine ability difference only if all paths from the data-generating process to $Y$ that do not pass through $\theta$ are blocked. Common unblocked paths in AI evaluation:

Training data overlap ($D \to Y$ bypassing $\theta$): memorization inflates scores
Prompt formatting ($F \to Y$): format preferences affect scores independent of ability
Evaluation order (position bias in LLM-as-judge): systematic bias from presentation order

The causal diagnostics in this section and the statistical diagnostics in Section 5.11 are complementary approaches to detecting these unblocked paths.

7.3 Distribution Shift and Transportability

In Section 1.3.4, we raised the question of whether benchmark results generalize beyond the specific conditions of the evaluation. This section provides the formal framework.

7.3.1 Setup

Let $P^{(s)}(X, Y)$ denote the source (benchmark) joint distribution over items $X$ and responses $Y$, and $P^{(t)}(X, Y)$ the target (deployment) distribution. The fundamental question is: when does a performance estimate computed under $P^{(s)}$ remain valid under $P^{(t)}$?

The joint distribution factorizes as $P(X, Y) = P(Y \mid X) P(X)$, giving rise to a taxonomy of shift types based on which component changes.

Definition: Types of Distribution Shift

Covariate shift: $P^{(s)}(X) \neq P^{(t)}(X)$ but $P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)$. The item distribution changes but the model’s conditional behavior is stable. Example: a benchmark oversamples easy items; deployment sees a harder mix.
Label shift (prior probability shift): $P^{(s)}(Y) \neq P^{(t)}(Y)$ but $P^{(s)}(X \mid Y) = P^{(t)}(X \mid Y)$. The base rate of outcomes changes. Example: a benchmark has 50% correct base rate; deployment tasks have 10%.
Concept drift: $P^{(s)}(Y \mid X) \neq P^{(t)}(Y \mid X)$. The relationship between items and responses changes. Example: a coding benchmark becomes outdated as programming conventions evolve, so the correct answers shift.

7.3.2 Covariate Shift in AI Evaluation

Covariate shift is the most common and most correctable form of shift. In AI evaluation, it arises whenever the benchmark item pool is not representative of the deployment item distribution.

Consider a benchmark designed to measure mathematical reasoning. If the item pool is drawn disproportionately from algebra (easy) rather than from combinatorics, number theory, and analysis (hard), then the benchmark’s item distribution $P^{(s)}(X)$ is shifted toward easy items relative to the deployment distribution $P^{(t)}(X)$. The model’s behavior on any given item—its conditional response $P(Y \mid X)$—is the same in both settings. But the overall accuracy computed under $P^{(s)}$ overestimates deployment accuracy because easy items are overweighted.

This connects directly to the content validity discussion in Section 1.3.1: a benchmark with poor content validity (non-representative item sampling) is a benchmark with covariate shift relative to the construct domain.

7.3.3 Concept Drift in AI Evaluation

Concept drift is more dangerous because the model’s conditional behavior itself changes. In AI evaluation, concept drift occurs when:

Model updates: A model is updated via RLHF, fine-tuning, or system prompt changes. The IRT parameters calibrated on the old model no longer apply.
Prompt template changes: Switching from a zero-shot to a few-shot template changes $P(Y \mid X)$ even for the same items.
Temporal evolution: Programming languages evolve, scientific knowledge advances, and social norms shift. A benchmark’s “correct” answers may become outdated.

Under concept drift, no reweighting of the source distribution can recover the target performance. The model must be re-evaluated on data drawn from the target distribution, or the concept drift must be modeled explicitly.

7.3.4 Transportability

When can we transport causal conclusions from one setting to another? Bareinboim and Pearl (2016) formalize this through selection diagrams: DAGs augmented with special “selection” nodes $S$ that indicate which mechanisms differ between source and target.

The key result (stated informally): a causal effect is transportable from source to target if the differences between environments, as encoded by $S$, can be “adjusted away” using the causal structure. Specifically, the target-domain causal effect is identifiable from the source data plus knowledge of which mechanisms differ.

For AI evaluation, this means: a benchmark result is transportable to a new deployment context if we can identify what changed between the benchmark setting and deployment, and the evaluation DAG provides a path to adjust for those changes. If the only change is the item distribution (covariate shift), importance weighting suffices. If the model’s behavior changes (concept drift), we need additional assumptions or new data from the target domain.

The three panels illustrate the shift taxonomy. Covariate shift (left): the benchmark item pool is concentrated at easy difficulties (blue), while deployment items span a broader range (red), but the item characteristic curve is the same in both settings. Label shift (center): the base rate of correct responses differs between benchmark (50%) and deployment (10%). Concept drift (right): the relationship between ability and response probability itself changes—the target ICC (dashed red) has different discrimination and difficulty than the source (blue), so no reweighting of items can correct the estimate.

7.4 Off-Policy Evaluation and Doubly Robust Estimation

The distribution shift framework tells us when benchmark results are biased. This section develops the tools to correct that bias, drawing on the off-policy evaluation literature from contextual bandits.

7.4.1 The Evaluation-as-Bandit Framing

We reframe AI evaluation as an off-policy estimation problem. The key mapping is:

Bandit concept	AI evaluation translation
Context $x$	Evaluation item (content, format, difficulty)
Action $a$	Model’s response
Reward $r$	Correctness score ($0$ or $1$, or a continuous quality rating)
Logging policy $\pi_0(a \mid x)$	The benchmark design process—how items were selected and administered
Target policy $\pi(a \mid x)$	The model’s behavior under deployment conditions

Definition: Off-Policy Evaluation

Given logged data $\{(x_t, a_t, r_t)\}_{t=1}^n$ collected under logging policy $\pi_0$, the off-policy evaluation problem is to estimate the value of a target policy $\pi$:

\[ V(\pi) = \mathbb{E}_{x \sim P, a \sim \pi(\cdot \mid x)}[r(x, a)] \]

without deploying $\pi$ to collect new data.

Why does this framing matter? Whenever items are selected non-uniformly—by adaptive testing (Section 6.2.2), by convenience sampling, or by any design process that makes some items more likely to appear than others—we are observing rewards under $\pi_0$ but want to estimate performance under $\pi$. Naive accuracy (the simple average of rewards) is biased because $\pi_0 \neq \pi$.

7.4.2 Three Estimators

Three classical estimators address this problem, each with different bias-variance tradeoffs.

The Direct Method (DM). Build a reward model $\hat{r}(x, a)$ and use it to predict rewards under the target policy:

\[ \hat{V}_{\text{DM}} = \frac{1}{n} \sum_{t=1}^{n} \hat{r}(x_t, \pi) \tag{7.2}\]

where $\hat{r}(x_t, \pi) = \sum_a \pi(a \mid x_t) \hat{r}(x_t, a)$ integrates the reward model over the target policy’s action distribution. In the AIMS context, $\hat{r}$ is precisely the IRT model from Chapter 3: $\hat{r}(x, a) = P(Y = 1 \mid \theta, \beta_x)$. The direct method is what we have been doing throughout the earlier modeling chapters (Chapter 3, Chapter 4): using the fitted IRT model to predict performance.

The direct method is biased when the reward model is misspecified. If the IRT model is wrong—perhaps the true data-generating process has interactions the model does not capture—the predictions will be systematically off.

Inverse Propensity Weighting (IPW). Instead of modeling the reward, model the selection process and reweight observations:

\[ \hat{V}_{\text{IPW}} = \frac{1}{n} \sum_{t=1}^{n} \frac{\pi(a_t \mid x_t)}{\pi_0(a_t \mid x_t)} r_t \tag{7.3}\]

Each observation is weighted by the ratio of how likely the target policy was to take that action versus how likely the logging policy was. IPW is unbiased when the propensities are known, but can have very high variance when $\pi_0$ and $\pi$ differ substantially (the weights become large).

The CAT Propensity Problem

In computerized adaptive testing (Section 6.2.2), items are selected to maximize Fisher information at the current ability estimate. This means the item selection probability $\pi_0(x_t \mid \hat{\theta}_t)$ is non-uniform: items near the model’s estimated ability are heavily oversampled, while very easy and very hard items are undersampled.

Naive accuracy on CAT-selected items is biased because the item pool is deliberately non-representative. IPW corrects this by upweighting items that were unlikely to be selected by the adaptive algorithm. The propensities $\pi_0$ are known from the CAT algorithm’s selection rule (the item with maximum Fisher information at the current estimate is selected with probability 1 in a greedy CAT, or with probability proportional to information in a stochastic variant).

The Doubly Robust Estimator (DR). The doubly robust estimator (Dudı́k et al. 2011; Robins et al. 1994) combines both approaches:

\[ \hat{V}_{\text{DR}} = \frac{1}{n} \sum_{t=1}^{n} \left[ \hat{r}(x_t, \pi) + \frac{\pi(a_t \mid x_t)}{\pi_0(a_t \mid x_t)} \bigl(r_t - \hat{r}(x_t, a_t)\bigr) \right] \tag{7.4}\]

Definition: Doubly Robust Estimator

The DR estimator augments the direct method prediction $\hat{r}(x_t, \pi)$ with a propensity-weighted correction term $\frac{\pi}{\pi_0}(r_t - \hat{r})$. It is doubly robust: consistent if either the reward model $\hat{r}$ or the propensity model $\pi_0$ is correctly specified (but not necessarily both).

When $\hat{r}$ is correct: the residual $r_t - \hat{r}(x_t, a_t)$ has zero expectation, so the correction vanishes and $\hat{V}_{\text{DR}} \approx \hat{V}_{\text{DM}}$.
When $\hat{r}$ is wrong but $\pi_0$ is correct: the IPW term corrects the model’s bias, yielding an unbiased estimate.

Proof of double robustness. We verify that $\mathbb{E}[\hat{V}_{\text{DR}}] = V(\pi)$ when either model is correct. Taking expectations:

\[ \mathbb{E}[\hat{V}_{\text{DR}}] = \mathbb{E}\!\left[\hat{r}(x, \pi)\right] + \mathbb{E}\!\left[\frac{\pi(a \mid x)}{\pi_0(a \mid x)}(r - \hat{r}(x, a))\right] \]

For the second term, conditioning on $x$:

\[ \mathbb{E}\!\left[\frac{\pi(a \mid x)}{\pi_0(a \mid x)}(r - \hat{r}(x, a)) \;\middle|\; x\right] = \sum_a \pi_0(a \mid x) \frac{\pi(a \mid x)}{\pi_0(a \mid x)} \bigl(\mathbb{E}[r \mid x, a] - \hat{r}(x, a)\bigr) \]

\[ = \sum_a \pi(a \mid x) \bigl(r^*(x, a) - \hat{r}(x, a)\bigr) \]

where $r^*(x, a) = \mathbb{E}[r \mid x, a]$ is the true reward function. If $\hat{r} = r^*$, this is zero. If $\pi_0$ is correct (which it is by construction in the importance weighting), the full expectation becomes:

\[ \mathbb{E}[\hat{V}_{\text{DR}}] = \mathbb{E}\!\left[\hat{r}(x, \pi)\right] + \mathbb{E}_x\!\left[\sum_a \pi(a \mid x)(r^*(x,a) - \hat{r}(x,a))\right] = \mathbb{E}_x\!\left[\sum_a \pi(a \mid x) r^*(x, a)\right] = V(\pi) \]

7.4.3 Connection to Prediction-Powered Inference

Prediction-Powered Inference (PPI) (Angelopoulos et al. 2023) is a recently developed framework that is structurally equivalent to doubly robust estimation. PPI uses a small labeled dataset $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ (expensive: actual model evaluations) and a large unlabeled dataset $\mathcal{D}_N = \{(X_i, \hat{Y}_i)\}_{i=1}^N$ (cheap: model predictions or automated scores) to estimate a population quantity $\mu = \mathbb{E}[Y]$:

\[ \hat{\mu}_{\text{PPI}} = \frac{1}{N} \sum_{i \in \mathcal{D}_N} \hat{Y}_i + \frac{1}{n} \sum_{i \in \mathcal{D}_n} (Y_i - \hat{Y}_i) \tag{7.5}\]

The first term is the direct method estimate using the cheap predictions. The second term is the bias correction: it estimates the systematic error in $\hat{Y}$ using the labeled subsample. This is exactly the DR structure: imputation + correction.

The connection to the cold-start pipeline in Section 3.2.4 is direct. The PPE (Prediction-Powered Evaluation) approach from Chapter 3 learns a mapping from model embeddings and metadata to IRT parameters, producing predictions $\hat{Y}_{ij}$ for unseen model-item pairs. PPI provides the debiasing step: use a small set of actual evaluations to correct the PPE predictions, yielding valid confidence intervals for model performance. The PPE predictions serve as the imputation model $\hat{r}$; the actual evaluations provide the labeled correction $Y - \hat{Y}$.

7.4.4 Importance Weighting for Covariate Shift

When the shift between benchmark and deployment is purely covariate shift ($P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)$, $P^{(s)}(X) \neq P^{(t)}(X)$), the importance-weighted estimator corrects the bias (Shimodaira 2000):

\[ \hat{\mu}_{\text{IW}} = \frac{\sum_{i=1}^n w(x_i) Y_i}{\sum_{i=1}^n w(x_i)}, \quad w(x_i) = \frac{P^{(t)}(x_i)}{P^{(s)}(x_i)} \tag{7.6}\]

This is the self-normalized variant, which is more stable than the unnormalized version. The weights $w(x_i)$ upweight items that are underrepresented in the benchmark relative to deployment, and downweight overrepresented items.

Practical challenges:

Density ratio estimation: The weights require knowing or estimating the density ratio $P^{(t)}/P^{(s)}$. Methods include logistic regression on a domain classifier, kernel mean matching, and moment matching (Sugiyama et al. 2012).
Weight clipping: Extreme weights inflate variance. Clipping weights at a maximum value $M$ (e.g., $w_i \leftarrow \min(w_i, M)$) trades bias for variance reduction.
Effective sample size: The effective sample size $n_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$ measures how much information the reweighted sample contains. When $n_{\text{eff}} \ll n$, the importance-weighted estimate is unreliable.

AI example. A reasoning benchmark oversamples algebra items (60% algebra, 20% geometry, 20% combinatorics). The deployment distribution is uniform across topics. Importance weighting reweights: algebra items get weight $w = 1/3 \div 3/5 = 5/9$, while geometry and combinatorics items get weight $w = 1/3 \div 1/5 = 5/3$. The corrected accuracy better reflects deployment performance.

The simulation demonstrates the off-policy estimation problem under CAT-like adaptive item selection. The naive estimator (simple average of responses on adaptively-selected items) is biased because the CAT algorithm oversamples items near the model’s ability level. The Direct Method (IRT prediction with noisy ability estimate) has low bias when the model is approximately correct but is sensitive to misspecification. IPW corrects the selection bias but has higher variance due to extreme weights. The Doubly Robust estimator combines the strengths of both, achieving the lowest RMSE by using the IRT model for imputation and IPW for bias correction.

The left panel shows the source (benchmark) and target (deployment) item difficulty distributions. The benchmark oversamples easy items (blue, centered at $\beta = -1$), while deployment items span a broader range (red, centered at $\beta = 0$). The right panel shows the consequence: naive accuracy computed on benchmark items (blue) systematically overestimates deployment accuracy (red). Importance weighting (green) corrects the bias by upweighting hard items that are underrepresented in the benchmark, bringing the estimates close to the true deployment accuracy.

7.5 Conformal Prediction Under Distribution Shift

The off-policy estimators in Section 7.4 correct point estimates of model performance. But practitioners also need uncertainty quantification: not just “the model’s deployment accuracy is 0.78” but “the model’s deployment accuracy is in $[0.72, 0.84]$ with 90% confidence.” Conformal prediction provides this guarantee with minimal distributional assumptions.

Definition: Conformal Prediction

Given a calibration set $\{(X_i, Y_i)\}_{i=1}^n$ and a new input $X_{n+1}$, a conformal prediction set $C_\alpha(X_{n+1})$ satisfies:

\[ P(Y_{n+1} \in C_\alpha(X_{n+1})) \geq 1 - \alpha \]

under the assumption that the calibration and test data are exchangeable (informally: drawn from the same distribution in no particular order). No parametric assumptions on the data distribution are required.

7.5.1 Split Conformal Prediction

The split conformal algorithm (Vovk et al. 2005) is the simplest conformal method:

Fit a model $\hat{f}$ on a training set.
On a held-out calibration set $\{(X_i, Y_i)\}_{i=1}^n$, compute nonconformity scores $s_i = |Y_i - \hat{f}(X_i)|$.
Let $\hat{q}$ be the $\lceil (1-\alpha)(1 + 1/n) \rceil / n$ quantile of $\{s_1, \ldots, s_n\}$.
For a new input $X_{n+1}$, predict $C_\alpha(X_{n+1}) = [\hat{f}(X_{n+1}) - \hat{q}, \hat{f}(X_{n+1}) + \hat{q}]$.

The coverage guarantee $P(Y_{n+1} \in C_\alpha) \geq 1 - \alpha$ holds for any model $\hat{f}$, any data distribution, and any sample size—as long as the calibration and test data are exchangeable.

7.5.2 When Exchangeability Fails

Under distribution shift, exchangeability fails: the calibration data is drawn from $P^{(s)}$ while the test data comes from $P^{(t)}$. This causes the conformal prediction set to lose its coverage guarantee. Specifically, if the target distribution concentrates on regions where the model is less accurate (harder items), the calibration-based quantile $\hat{q}$ underestimates the nonconformity scores at test time, leading to undercoverage.

7.5.3 Weighted Conformal Prediction

Tibshirani et al. (2019) show that coverage can be restored under covariate shift by replacing the uniform quantile with an importance-weighted quantile. The key insight is the same as for importance weighting in Section 7.4.4: reweight the calibration scores to match the target distribution.

The algorithm modifies step 3: instead of the uniform quantile, compute the weighted quantile of $\{s_1, \ldots, s_n, +\infty\}$ with weights $\{w_1, \ldots, w_n, w_{n+1}\}$ where $w_i = P^{(t)}(X_i) / P^{(s)}(X_i)$ and $w_{n+1} = P^{(t)}(X_{n+1}) / P^{(s)}(X_{n+1})$. This yields:

\[ P^{(t)}(Y_{n+1} \in C_\alpha(X_{n+1})) \geq 1 - \alpha \]

The coverage guarantee now holds under the target distribution, provided the importance weights are correct. Barber et al. (2023) extend this beyond covariate shift to more general forms of distribution shift.

The left panel shows coverage calibration: the diagonal represents perfect calibration (empirical coverage = nominal coverage). Naive conformal prediction (red) falls below the diagonal—it promises 90% coverage but delivers less, because the calibration items (easy) are not representative of the test items (hard). Weighted conformal prediction (green) restores coverage by reweighting the calibration scores to match the test distribution. The right panel shows the source of the problem: the calibration distribution (blue) is shifted left (easier items) relative to the test distribution (red).

7.6 Putting It All Together: A Causal Audit

We close with a worked example that composes all tools from this chapter. Consider “CodeReason,” a hypothetical benchmark claiming to measure coding reasoning ability in language models.

Step 1: Draw the evaluation DAG (Section 7.1.3). We identify: Training Data $D$ (code repositories, documentation), Architecture $A$ (transformer variant), Latent Coding Ability $\theta$, Benchmark Items $\beta$ (coding problems), Prompt Template $F$ (instruction format), and Observed Score $Y$. We suspect two validity-threatening paths: (i) $D \to Y$ bypassing $\theta$ (some models trained on CodeReason’s source repository), and (ii) $F \to Y$ (the benchmark uses a specific instruction format that favors certain models).

Step 2: Check for distribution shift (Section 7.3). The benchmark items are predominantly Python function-completion tasks at medium difficulty. The deployment context requires debugging, multi-file reasoning, and code review across Python, JavaScript, and Rust. This is covariate shift: the item distribution $P^{(s)}(\beta)$ is narrower than $P^{(t)}(\beta)$. We estimate density ratios using a domain classifier trained to distinguish benchmark items from deployment items.

Step 3: Apply doubly robust estimation (Section 7.4.2). We combine two components: (a) An IRT model fitted on the benchmark data provides $\hat{r}(x, a)$—the direct method estimate of each model’s accuracy on any item. (b) The density ratio weights from Step 2 provide the importance weights. The DR estimator corrects the IRT predictions using the importance-weighted residuals. The result: an accuracy estimate that accounts for both the item selection bias and any IRT misspecification.

Step 4: Construct prediction intervals (Section 7.5.3). Using the importance weights from Step 2, we apply weighted conformal prediction to the DR-corrected estimates. The result: “Model X’s deployment accuracy is in [0.62, 0.71] with 90% coverage.” The interval is wider than it would be under no shift, reflecting the additional uncertainty from extrapolating to a different item distribution.

Step 5: Diagnose contamination (Section 7.2.2). We compare model performance on items created before vs. after the training data cutoff (chronological split). Models with suspected contamination show a 15-point accuracy gap between pre- and post-cutoff items. We run item-fit analysis (Section 5.11.4) and find that the pre-cutoff items have inflated outfit statistics for these models. Conclusion: the $D \to Y$ path is active for these models. We flag them and report both contamination-adjusted and unadjusted rankings.

This five-step causal audit integrates the tools from this chapter with the diagnostic methods from Section 5.11, providing a systematic approach to evaluating whether benchmark results are trustworthy enough to inform deployment decisions.

7.7 Discussion Questions

A benchmark uses Computerized Adaptive Testing (Section 6.2.2) to select items. A critic argues that the resulting accuracy is “biased because models only answered items matched to their ability level.” Using the off-policy evaluation framework, explain whether this criticism is valid and how the estimate could be corrected if needed.
Two benchmarks both claim to measure “reasoning.” On Benchmark A, Model X outperforms Model Y. On Benchmark B, the ranking reverses. Draw a DAG that explains this reversal (hint: consider Simpson’s paradox). What causal assumptions would make the Benchmark A ranking the correct one?
Prediction-powered inference assumes that the model predictions $\hat{Y}$ are informative about the true labels $Y$. When might this assumption fail in AI evaluation? What are the consequences for the DR estimator’s consistency, and how could the failure be detected?
A company evaluates its model on an internal benchmark (source) and wants to predict performance on a customer-facing deployment (target). What information is needed to determine whether the benchmark results are transportable? Which assumptions are untestable from the source data alone?
Conformal prediction guarantees marginal coverage: $P(Y \in C_\alpha) \geq 1 - \alpha$ on average over the test distribution. Why is this weaker than conditional coverage ($P(Y \in C_\alpha \mid X = x) \geq 1 - \alpha$ for each $x$), and when does the distinction matter for AI evaluation?

7.8 Bibliographic Notes

The mathematical framework for causal reasoning used in this chapter is due to Pearl (2009), who developed structural causal models, the do-calculus, and the graphical criteria for identifiability. An accessible introduction is Peters et al. (2017). The transportability theory—when causal conclusions can be transferred across settings—is developed by Bareinboim and Pearl (2016).

The taxonomy of distribution shift draws on a large literature. Covariate shift and importance weighting were formalized by Shimodaira (2000). Sugiyama et al. (2012) provide a comprehensive treatment of density ratio estimation methods. Quiñonero-Candela et al. (2009) survey the broader landscape of dataset shift in machine learning.

Doubly robust estimation originated in the biostatistics literature with Robins et al. (1994), who introduced the augmented inverse probability weighted estimator. Dudı́k et al. (2011) adapted the DR framework to contextual bandits and off-policy evaluation, which is the formulation we use here. The connection between DR estimation and semiparametric efficiency theory is developed in the biostatistics literature; for machine learning applications, see the survey by Dudík, Erber, Langford, and Li (2014).

Prediction-Powered Inference was introduced by Angelopoulos et al. (2023), who showed how to construct valid confidence intervals by combining a small labeled dataset with a large set of model predictions. The structural connection between PPI and doubly robust estimation is noted in their paper and further developed in follow-up work on PPI++.

Conformal prediction was introduced by Vovk et al. (2005). Tibshirani et al. (2019) extended conformal methods to handle covariate shift via importance weighting, and Barber et al. (2023) further generalized the framework beyond exchangeability.

The connection between Borsboom’s causal validity and structural causal models was developed in Chapter 3. The evaluation DAG in this chapter makes that connection explicit, showing how each validity threat from Section 1.4 corresponds to a specific causal pathway. Data attribution methods—influence functions, TracIn, and TRAK—provide complementary tools for tracing the causal effect of training data on model predictions, but are beyond the scope of this chapter; see Koh and Liang (2017) for the foundational work.

7.9 Exercises

Theoretical

Show that under covariate shift ($P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)$, $P^{(s)}(X) \neq P^{(t)}(X)$), the importance-weighted estimator $\hat{\mu}_{\text{IW}} = \frac{1}{n}\sum_{i=1}^n w(x_i) Y_i$ with $w(x_i) = P^{(t)}(x_i)/P^{(s)}(x_i)$ is unbiased for $\mathbb{E}_{P^{(t)}}[Y]$. Under what conditions on the weight distribution does the variance of $\hat{\mu}_{\text{IW}}$ diverge?
Prove the double robustness property: show that $\mathbb{E}[\hat{V}_{\text{DR}}] = V(\pi)$ (Equation 7.4) if either (a) the reward model satisfies $\hat{r}(x, a) = \mathbb{E}[r \mid x, a]$ for all $(x, a)$, or (b) the propensity $\pi_0(a \mid x)$ is correctly specified. Where does the proof rely on the assumption that the logged data is collected under $\pi_0$?
In a CAT procedure that selects the item maximizing Fisher information at the current ability estimate $\hat{\theta}$, derive the selection propensity $\pi_0(x_t \mid \hat{\theta}_t)$ for a Rasch model. Show that this propensity is maximized for items with $\beta_i \approx \hat{\theta}_t$ and decays for items far from the current estimate. What is the effective sample size $n_{\text{eff}}$ of the resulting IPW estimator as a function of the item pool’s difficulty distribution?
Using the evaluation DAG from Section 7.1.3, show that benchmark contamination (a direct path $D \to Y$ bypassing $\theta$) is not identifiable from observational data alone without additional assumptions. What minimal intervention (e.g., a chronological split or a canary-based test) would identify the contamination effect? State the identifying assumptions precisely.

Computational

Implement importance weighting for a benchmark with covariate shift. Simulate a Rasch model with $P^{(s)}(\beta) = \mathcal{N}(-1, 1)$ (easy-biased source) and $P^{(t)}(\beta) = \mathcal{N}(0, 1.5)$ (broader target). For 200 models, compare naive and IW-corrected accuracy estimates. Vary the degree of shift (by changing the source mean from $-2$ to $0$) and plot the naive bias as a function of shift magnitude.
Simulate a full CAT procedure for 200 models on a 500-item Rasch pool. For each model, run a 30-item adaptive test using the maximum-information selection rule from Section 6.2.2. Compute naive accuracy, IPW-corrected accuracy, and DR-corrected accuracy. Introduce IRT misspecification (e.g., the true model is 2PL but the reward model assumes Rasch) and show that DR remains robust while DM degrades.
Implement weighted conformal prediction for AI evaluation under covariate shift. Calibrate on items drawn from $\mathcal{N}(-0.5, 0.8)$ and test on items from $\mathcal{N}(1.0, 1.0)$. Plot empirical coverage vs. nominal level for both standard and weighted conformal across the range $\alpha \in [0.05, 0.50]$. How does the effective sample size of the weighted calibration set affect the coverage guarantee?

Discussion

Angelopoulos et al. (2023)’s PPI uses cheap model predictions to augment expensive human labels. In AI evaluation, the “expensive” labels are actual model runs and the “cheap” predictions are the PPE cold-start predictions from Section 3.2.4. Design a practical PPI pipeline for evaluating a new model on 10,000 items when only 500 actual evaluations can be afforded. What are the key design choices (which 500 items to evaluate? how to estimate the bias correction? how to construct confidence intervals)? How does the accuracy of the PPE predictions affect the width of the resulting intervals?

References

Angelopoulos, Anastasios N., Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. 2023. “Prediction-Powered Inference.” Science 382 (6671): 669–74.

Barber, Rina Foygel, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. 2023. “Conformal Prediction Beyond Exchangeability.” Annals of Statistics 51 (2): 816–45.

Bareinboim, Elias, and Judea Pearl. 2016. “Causal Inference and the Data-Fusion Problem.” Proceedings of the National Academy of Sciences 113 (27): 7345–52.

Borsboom, Denny. 2005. Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge University Press.

Dudı́k, Miroslav, John Langford, and Lihong Li. 2011. “Doubly Robust Policy Evaluation and Learning.” Proceedings of the 28th International Conference on Machine Learning, 1097–104.

Jacovi, Alon, Ksenia Nezhurina, Fazl Barez, Timo Schick, and Thomas Scialom. 2023. “Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks.” arXiv Preprint arXiv:2305.10160.

Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.

Quiñonero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. 2009. Dataset Shift in Machine Learning. MIT Press.

Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.

Shimodaira, Hidetoshi. 2000. “Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function.” Journal of Statistical Planning and Inference 90 (2): 227–44.

Sugiyama, Masashi, Taiji Suzuki, and Takafumi Kanamori. 2012. Density Ratio Estimation in Machine Learning. Cambridge University Press.

Tibshirani, Ryan J., Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas. 2019. “Conformal Prediction Under Covariate Shift.” Advances in Neural Information Processing Systems 32.

Vovk, Vladimir, Alex Gammerman, and Glenn Shafer. 2005. Algorithmic Learning in a Random World. Springer.

--- format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); }); </script> filters: - pyodide pyodide: packages: - numpy - matplotlib - scipy --- # Causality {#sec-causality} ::: {.callout-note title="Intended Learning Outcomes"} By the end of this chapter, the reader will be able to: 1. **Formulate** causal models for AI evaluation and distinguish causal from associational claims about benchmark performance. 2. **Apply** structural causal models (SCMs) to represent the data-generating process behind evaluation data, including the roles of training data, model architecture, and benchmark design. 3. **Identify** when distribution shift---between training, calibration, and deployment populations---threatens the validity of evaluation conclusions, and characterize shift types (covariate, label, concept). 4. **Use** interventional and counterfactual reasoning to diagnose construct-irrelevant variance: does the benchmark score change because ability changed, or because something else did? 5. **Explain** the connection between Borsboom's causal theory of validity and modern causal inference: a benchmark is valid if and only if the target construct causally produces variation in scores. 6. **Apply** doubly robust estimation to correct for selection bias in adaptive and non-representative evaluations, connecting importance weighting, model-based imputation, and prediction-powered inference. 7. **Evaluate** when benchmark results generalize across deployment contexts and when they do not, using transportability criteria and conformal inference. ::: ::: {.callout-tip title="Suggested Lecture Plan" collapse="true"} This chapter can be covered in **3 lectures** (75-90 minutes each): **Lecture 1: Causal Models for Evaluation** - From correlation to causation in benchmark scores (20 min) - Structural causal models and the evaluation DAG (25 min) - Interventional reasoning: diagnosing contamination and CIV (20 min) - Borsboom's causal validity revisited (10 min) **Lecture 2: Distribution Shift and Transportability** - Types of distribution shift: covariate, label, concept (20 min) - When do benchmark results generalize? Transportability (25 min) - The evaluation-as-bandit framing (20 min) - Hands-on: shift visualization (10 min) **Lecture 3: Robust Estimation and Conformal Prediction** - Three estimators: DM, IPW, doubly robust (25 min) - Connection to prediction-powered inference (15 min) - Conformal prediction under distribution shift (20 min) - Hands-on: DR estimation and weighted conformal (15 min) ::: ::: {.callout-note title="Notation"} This chapter introduces causal and distributional notation: $\text{do}(X=x)$ (intervention), $P^{(s)}/P^{(t)}$ (source/target distributions), $\pi_0/\pi$ (logging/target policies), $w(x)$ (importance weights), $\hat{V}_{\text{DR}}$ (doubly robust estimator), and $C_\alpha(x)$ (conformal prediction sets). See the [notation appendix](notation.qmd) for the complete notation reference. ::: ## From Association to Causation in Benchmark Scores {#sec-association-causation} Model A outperforms Model B on a popular reasoning benchmark. What can we conclude? The association is clear: $\mathbb{E}[Y \mid \text{Model} = A] > \mathbb{E}[Y \mid \text{Model} = B]$. But does this mean Model A has greater reasoning ability? Or did Model A's training data happen to include problems similar to the benchmark items, giving it a memorization advantage that has nothing to do with reasoning? This distinction---between associational and causal claims---is the central concern of this chapter. In @sec-foundations, we introduced Borsboom's realist definition of validity: a test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variation in the attribute *causally produces* variation in test scores [@borsboom2005measuring]. In @sec-validity, we developed the practical implications of this definition through content, criterion, construct, external, and consequential validity. This chapter provides the *formal causal machinery* to make these ideas precise. ### The Causal Hierarchy {#sec-causal-hierarchy} @pearl2009causality distinguishes three levels of causal reasoning, each requiring progressively stronger assumptions: 1. **Association** $P(Y \mid X)$: What does observing $X$ tell us about $Y$? *Example:* "Models with more parameters tend to score higher on MMLU." This is a correlation that could arise from many causal structures. 2. **Intervention** $P(Y \mid \text{do}(X = x))$: What happens to $Y$ if we *set* $X$ to $x$, regardless of what would have occurred naturally? *Example:* "If we increase the context window from 4K to 128K tokens, does the reasoning score improve?" This requires knowing the causal structure, not just the joint distribution. 3. **Counterfactual** $P(Y_x \mid X = x')$: What *would have happened* to $Y$ if $X$ had been $x$, given that we actually observed $X = x'$? *Example:* "Would this model have scored lower on the benchmark if its training data had not included the benchmark's source documents?" Counterfactuals reason about individual cases, not populations. Most benchmark analyses operate at Level 1. A leaderboard reports $P(Y \mid \text{Model})$---the association between model identity and score. But the claims we want to make are at Level 2 or 3: "This model *has better reasoning ability*" is a causal claim about what produces the score. ### Structural Causal Models {#sec-scm} A **Structural Causal Model** (SCM) provides the formal language for causal reasoning [@pearl2009causality]. ::: {.callout-note title="Definition: Structural Causal Model"} An SCM is a tuple $\mathcal{M} = (U, V, F, P(U))$ where: - $U$ is a set of exogenous (background) variables - $V = \{V_1, \ldots, V_n\}$ is a set of endogenous variables - $F = \{f_1, \ldots, f_n\}$ is a set of structural equations, $V_i = f_i(\text{pa}(V_i), U_i)$, where $\text{pa}(V_i) \subseteq V \setminus \{V_i\}$ - $P(U)$ is a distribution over exogenous variables Each SCM induces a directed acyclic graph (DAG) where edges point from $\text{pa}(V_i)$ to $V_i$. The DAG encodes conditional independence structure: $V_i \perp\!\!\!\perp V_j \mid S$ if $S$ *d-separates* $V_i$ and $V_j$ in the DAG. ::: ### The Evaluation DAG {#sec-evaluation-dag} We now construct an SCM for the AI evaluation process. The key endogenous variables are: - $D$: **Training data** (corpus composition, size, quality) - $A$: **Architecture** (transformer variant, parameter count, training recipe) - $\theta$: **Latent ability** (the construct the benchmark claims to measure) - $B$: **Benchmark design** (item pool, selection criteria, format) - $\beta$: **Item properties** (difficulty, discrimination, content) - $F$: **Formatting** (prompt template, few-shot examples, system prompt) - $Y$: **Observed score** (the benchmark outcome) The causal structure is: $$ D \to \theta, \quad A \to \theta, \quad \theta \to Y, \quad \beta \to Y, \quad B \to \beta, \quad F \to Y $$ When the benchmark is *valid* in Borsboom's sense, the primary path from $D$ and $A$ to $Y$ passes through $\theta$: training data and architecture determine ability, and ability determines the score. The item properties $\beta$ moderate this relationship (harder items produce lower scores for the same ability), but do not confound it. **Validity threats as DAG pathologies.** The threats identified in @sec-validity-threats correspond to specific DAG structures: - **Contamination** (@sec-contamination): A direct path $D \to Y$ that bypasses $\theta$. The model's training data includes benchmark items, so the score reflects memorization rather than ability. This is a *confounding* path that inflates the observed association between $\theta$ and $Y$. - **Construct-irrelevant variance** (@sec-civ): A path $F \to Y$ that does not pass through $\theta$. Prompt formatting affects the score independent of ability. This is an additional cause of $Y$ that is not part of the target construct. - **Differential item functioning** (@sec-dif-theory): An interaction $A \to Y$ that is not mediated by $\theta$. Certain architectures have advantages on certain items independent of their ability level. ::: {.callout-important title="Validity = No Unblocked Backdoor Paths"} Under Borsboom's causal framework, a benchmark is valid if and only if the *only* systematic path from the data-generating process to the observed score passes through the target construct $\theta$. Every unblocked non-$\theta$ path is a validity threat. The diagnostic tools from @sec-validity-diagnostics — DIF analysis, dimensionality assessment, item-fit statistics — are empirical tests for whether such paths exist. ::: ## Interventional and Counterfactual Diagnostics {#sec-interventional-diagnostics} The evaluation DAG from @sec-evaluation-dag allows us to reason about *interventions*---what happens when we deliberately change one variable while holding others fixed. ### The Back-Door Adjustment {#sec-backdoor} The back-door criterion [@pearl2009causality] identifies when observational data suffices to estimate causal effects. A set of variables $Z$ satisfies the back-door criterion relative to an ordered pair $(X, Y)$ if: (i) no node in $Z$ is a descendant of $X$, and (ii) $Z$ blocks every path between $X$ and $Y$ that contains an arrow into $X$. When the back-door criterion is satisfied: $$ P(Y \mid \text{do}(X = x)) = \sum_z P(Y \mid X = x, Z = z) P(Z = z) $$ {#eq-backdoor} This *adjustment formula* removes confounding by averaging over the confounder distribution. For AI evaluation, this means: to estimate the causal effect of ability $\theta$ on score $Y$, we must adjust for all confounders---variables that cause both $\theta$ and $Y$ through non-$\theta$ paths. ### Diagnosing Contamination {#sec-causal-contamination} The contamination diagnostic from @sec-contamination can be formalized as an interventional question. Let $D_{\text{overlap}}$ denote the subset of training data that overlaps with benchmark items. The causal question is: $$ P(Y \mid \text{do}(D_{\text{overlap}} = \emptyset)) \stackrel{?}{=} P(Y \mid D_{\text{overlap}} = \emptyset) $$ If these are equal, removing the overlapping data does not change the score *beyond* what we would expect from the reduced training set. If they differ, the overlap was providing a direct $D \to Y$ path (memorization) rather than contributing to $\theta$ (genuine learning). In practice, we cannot perform this intervention on already-trained models. But we can approximate it through: - **Chronological splits**: Items created after the training data cutoff cannot be memorized. Comparing scores on pre- vs. post-cutoff items estimates the contamination effect. - **Canary detection**: Embedding unique strings in benchmark items and testing whether models reproduce them provides evidence for the $D \to Y$ path [@jacovi2023stop]. - **Item-fit analysis**: The infit/outfit statistics from @sec-item-fit detect items where some models perform unexpectedly well---a signature of the memorization path. ### Counterfactual Reasoning for DIF {#sec-counterfactual-dif} Differential Item Functioning (@sec-dif-theory) has a natural counterfactual interpretation. An item $i$ exhibits DIF between groups $g = 0$ and $g = 1$ if: $$ P(Y_{i,g=1} = 1 \mid \theta) \neq P(Y_{i,g=0} = 1 \mid \theta) $$ This is a counterfactual statement: "For a model with ability $\theta$, would the response to item $i$ have been different if the model belonged to the other group, holding ability fixed?" The "group" variable $g$ (e.g., model architecture family) has a *direct* effect on the item response that is not mediated by $\theta$. Under the evaluation DAG, DIF corresponds to a path $A \to Y$ that does not pass through $\theta$---an architectural advantage on specific items that reflects something other than the target construct. The Mantel-Haenszel procedure from @sec-dif-analysis estimates this direct effect by stratifying on $\theta$ (using total score as a proxy), which blocks the $A \to \theta \to Y$ path and isolates the $A \to Y$ direct effect. ::: {.callout-important title="When Is a Score Difference Causal?"} A score difference $Y_A - Y_B$ between two models reflects a genuine *ability* difference only if all paths from the data-generating process to $Y$ that do not pass through $\theta$ are blocked. Common unblocked paths in AI evaluation: - **Training data overlap** ($D \to Y$ bypassing $\theta$): memorization inflates scores - **Prompt formatting** ($F \to Y$): format preferences affect scores independent of ability - **Evaluation order** (position bias in LLM-as-judge): systematic bias from presentation order The causal diagnostics in this section and the statistical diagnostics in @sec-validity-diagnostics are complementary approaches to detecting these unblocked paths. ::: ## Distribution Shift and Transportability {#sec-shift-taxonomy} In @sec-external-validity, we raised the question of whether benchmark results generalize beyond the specific conditions of the evaluation. This section provides the formal framework. ### Setup {#sec-shift-setup} Let $P^{(s)}(X, Y)$ denote the **source** (benchmark) joint distribution over items $X$ and responses $Y$, and $P^{(t)}(X, Y)$ the **target** (deployment) distribution. The fundamental question is: when does a performance estimate computed under $P^{(s)}$ remain valid under $P^{(t)}$? The joint distribution factorizes as $P(X, Y) = P(Y \mid X) P(X)$, giving rise to a taxonomy of shift types based on which component changes. ::: {.callout-note title="Definition: Types of Distribution Shift"} 1. **Covariate shift**: $P^{(s)}(X) \neq P^{(t)}(X)$ but $P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)$. The *item distribution* changes but the model's *conditional behavior* is stable. Example: a benchmark oversamples easy items; deployment sees a harder mix. 2. **Label shift** (prior probability shift): $P^{(s)}(Y) \neq P^{(t)}(Y)$ but $P^{(s)}(X \mid Y) = P^{(t)}(X \mid Y)$. The *base rate* of outcomes changes. Example: a benchmark has 50% correct base rate; deployment tasks have 10%. 3. **Concept drift**: $P^{(s)}(Y \mid X) \neq P^{(t)}(Y \mid X)$. The *relationship between items and responses* changes. Example: a coding benchmark becomes outdated as programming conventions evolve, so the correct answers shift. ::: ### Covariate Shift in AI Evaluation {#sec-covariate-shift} Covariate shift is the most common and most correctable form of shift. In AI evaluation, it arises whenever the benchmark item pool is not representative of the deployment item distribution. Consider a benchmark designed to measure mathematical reasoning. If the item pool is drawn disproportionately from algebra (easy) rather than from combinatorics, number theory, and analysis (hard), then the benchmark's item distribution $P^{(s)}(X)$ is shifted toward easy items relative to the deployment distribution $P^{(t)}(X)$. The model's behavior on any given item---its conditional response $P(Y \mid X)$---is the same in both settings. But the overall accuracy computed under $P^{(s)}$ overestimates deployment accuracy because easy items are overweighted. This connects directly to the content validity discussion in @sec-content-validity: a benchmark with poor content validity (non-representative item sampling) is a benchmark with covariate shift relative to the construct domain. ### Concept Drift in AI Evaluation {#sec-concept-drift} Concept drift is more dangerous because the model's conditional behavior itself changes. In AI evaluation, concept drift occurs when: - **Model updates**: A model is updated via RLHF, fine-tuning, or system prompt changes. The IRT parameters calibrated on the old model no longer apply. - **Prompt template changes**: Switching from a zero-shot to a few-shot template changes $P(Y \mid X)$ even for the same items. - **Temporal evolution**: Programming languages evolve, scientific knowledge advances, and social norms shift. A benchmark's "correct" answers may become outdated. Under concept drift, no reweighting of the source distribution can recover the target performance. The model must be re-evaluated on data drawn from the target distribution, or the concept drift must be modeled explicitly. ### Transportability {#sec-transportability} When can we transport causal conclusions from one setting to another? @bareinboim2016causal formalize this through **selection diagrams**: DAGs augmented with special "selection" nodes $S$ that indicate which mechanisms differ between source and target. The key result (stated informally): a causal effect is transportable from source to target if the differences between environments, as encoded by $S$, can be "adjusted away" using the causal structure. Specifically, the target-domain causal effect is identifiable from the source data plus knowledge of which mechanisms differ. For AI evaluation, this means: a benchmark result is transportable to a new deployment context if we can identify *what changed* between the benchmark setting and deployment, and the evaluation DAG provides a path to adjust for those changes. If the only change is the item distribution (covariate shift), importance weighting suffices. If the model's behavior changes (concept drift), we need additional assumptions or new data from the target domain. {{< include _plt_setup.qmd >}} ```{pyodide-python} #| label: shift-taxonomy #| autorun: true import numpy as np import matplotlib.pyplot as plt from scipy.special import expit np.random.seed(42) fig, axes = plt.subplots(1, 3, figsize=(6, 2)) theta_range = np.linspace(-3, 3, 200) # --- Panel 1: Covariate Shift --- # Source: easy items (beta ~ N(-1, 0.5)) # Target: broad items (beta ~ N(0, 1.5)) beta_source = np.random.normal(-1, 0.5, 500) beta_target = np.random.normal(0, 1.5, 500) axes[0].hist(beta_source, bins=30, alpha=0.6, color='#5B8DEE', density=True, label='Source (benchmark)') axes[0].hist(beta_target, bins=30, alpha=0.6, color='#E8637A', density=True, label='Target (deployment)') axes[0].set_xlabel(r'Item difficulty ($\beta$)') axes[0].set_ylabel('Density') axes[0].set_title('Covariate Shift') axes[0].legend() axes[0].set_xlim(-4, 4) # Inset: same ICC in both ax_inset1 = axes[0].inset_axes([0.55, 0.55, 0.4, 0.35]) theta_ex = 0.5 p_source = expit(theta_ex - theta_range) ax_inset1.plot(theta_range, p_source, '#5B8DEE', linewidth=1.5, label='Source') ax_inset1.plot(theta_range, p_source, '#E8637A', linewidth=1.5, linestyle='--', label='Target') ax_inset1.set_title('Same P(Y|X)') ax_inset1.set_xticks([]) ax_inset1.set_yticks([]) # --- Panel 2: Label Shift --- # Source: 50% correct base rate # Target: 10% correct base rate n_items = 1000 y_source = np.random.binomial(1, 0.5, n_items) y_target = np.random.binomial(1, 0.1, n_items) categories = ['Correct', 'Incorrect'] source_rates = [y_source.mean(), 1 - y_source.mean()] target_rates = [y_target.mean(), 1 - y_target.mean()] x_pos = np.arange(len(categories)) width = 0.35 axes[1].bar(x_pos - width/2, source_rates, width, color='#5B8DEE', alpha=0.7, label='Source') axes[1].bar(x_pos + width/2, target_rates, width, color='#E8637A', alpha=0.7, label='Target') axes[1].set_xticks(x_pos) axes[1].set_xticklabels(categories) axes[1].set_ylabel('Proportion') axes[1].set_title('Label Shift') axes[1].legend() axes[1].set_ylim(0, 1.1) # --- Panel 3: Concept Drift --- # Source: Rasch with beta=0 # Target: Rasch with beta shifted AND discrimination changed beta_fixed = 0.0 p_source_icc = expit(theta_range - beta_fixed) p_target_icc = expit(1.5 * (theta_range - 0.5)) # Different discrimination and difficulty axes[2].plot(theta_range, p_source_icc, '#5B8DEE', linewidth=2.5, label='Source: P(Y|X)') axes[2].plot(theta_range, p_target_icc, '#E8637A', linewidth=2.5, linestyle='--', label='Target: P(Y|X)') axes[2].fill_between(theta_range, p_source_icc, p_target_icc, alpha=0.15, color='#F0A35C') axes[2].set_xlabel(r'Ability ($\theta$)') axes[2].set_ylabel('P(Correct)') axes[2].set_title('Concept Drift') axes[2].legend() axes[2].set_ylim(-0.05, 1.05) plt.tight_layout() plt.show() print("Distribution Shift Summary:") print(f" Covariate shift: source mean difficulty = {beta_source.mean():.2f}, " f"target mean difficulty = {beta_target.mean():.2f}") print(f" Label shift: source base rate = {y_source.mean():.2f}, " f"target base rate = {y_target.mean():.2f}") print(f" Concept drift: P(Y|X) changes between source and target") ``` The three panels illustrate the shift taxonomy. **Covariate shift** (left): the benchmark item pool is concentrated at easy difficulties (blue), while deployment items span a broader range (red), but the item characteristic curve is the same in both settings. **Label shift** (center): the base rate of correct responses differs between benchmark (50%) and deployment (10%). **Concept drift** (right): the relationship between ability and response probability itself changes---the target ICC (dashed red) has different discrimination and difficulty than the source (blue), so no reweighting of items can correct the estimate. ## Off-Policy Evaluation and Doubly Robust Estimation {#sec-off-policy} The distribution shift framework tells us *when* benchmark results are biased. This section develops the tools to *correct* that bias, drawing on the off-policy evaluation literature from contextual bandits. ### The Evaluation-as-Bandit Framing {#sec-bandit-framing} We reframe AI evaluation as an off-policy estimation problem. The key mapping is: | Bandit concept | AI evaluation translation | |---|---| | Context $x$ | Evaluation item (content, format, difficulty) | | Action $a$ | Model's response | | Reward $r$ | Correctness score ($0$ or $1$, or a continuous quality rating) | | Logging policy $\pi_0(a \mid x)$ | The benchmark design process---how items were selected and administered | | Target policy $\pi(a \mid x)$ | The model's behavior under deployment conditions | ::: {.callout-note title="Definition: Off-Policy Evaluation"} Given logged data $\{(x_t, a_t, r_t)\}_{t=1}^n$ collected under logging policy $\pi_0$, the off-policy evaluation problem is to estimate the **value** of a target policy $\pi$: $$ V(\pi) = \mathbb{E}_{x \sim P, a \sim \pi(\cdot \mid x)}[r(x, a)] $$ without deploying $\pi$ to collect new data. ::: Why does this framing matter? Whenever items are selected non-uniformly---by adaptive testing (@sec-cat), by convenience sampling, or by any design process that makes some items more likely to appear than others---we are observing rewards under $\pi_0$ but want to estimate performance under $\pi$. Naive accuracy (the simple average of rewards) is biased because $\pi_0 \neq \pi$. ### Three Estimators {#sec-three-estimators} Three classical estimators address this problem, each with different bias-variance tradeoffs. **The Direct Method (DM).** Build a reward model $\hat{r}(x, a)$ and use it to predict rewards under the target policy: $$ \hat{V}_{\text{DM}} = \frac{1}{n} \sum_{t=1}^{n} \hat{r}(x_t, \pi) $$ {#eq-dm} where $\hat{r}(x_t, \pi) = \sum_a \pi(a \mid x_t) \hat{r}(x_t, a)$ integrates the reward model over the target policy's action distribution. In the AIMS context, $\hat{r}$ is precisely the IRT model from @sec-foundations: $\hat{r}(x, a) = P(Y = 1 \mid \theta, \beta_x)$. The direct method is what we have been doing throughout the earlier modeling chapters (@sec-foundations, @sec-learning): using the fitted IRT model to predict performance. The direct method is biased when the reward model is misspecified. If the IRT model is wrong---perhaps the true data-generating process has interactions the model does not capture---the predictions will be systematically off. **Inverse Propensity Weighting (IPW).** Instead of modeling the reward, model the selection process and reweight observations: $$ \hat{V}_{\text{IPW}} = \frac{1}{n} \sum_{t=1}^{n} \frac{\pi(a_t \mid x_t)}{\pi_0(a_t \mid x_t)} r_t $$ {#eq-ipw} Each observation is weighted by the ratio of how likely the target policy was to take that action versus how likely the logging policy was. IPW is unbiased when the propensities are known, but can have very high variance when $\pi_0$ and $\pi$ differ substantially (the weights become large). ::: {.callout-important title="The CAT Propensity Problem"} In computerized adaptive testing (@sec-cat), items are selected to maximize Fisher information at the current ability estimate. This means the item selection probability $\pi_0(x_t \mid \hat{\theta}_t)$ is non-uniform: items near the model's estimated ability are heavily oversampled, while very easy and very hard items are undersampled. Naive accuracy on CAT-selected items is biased because the item pool is deliberately non-representative. IPW corrects this by upweighting items that were unlikely to be selected by the adaptive algorithm. The propensities $\pi_0$ are known from the CAT algorithm's selection rule (the item with maximum Fisher information at the current estimate is selected with probability 1 in a greedy CAT, or with probability proportional to information in a stochastic variant). ::: **The Doubly Robust Estimator (DR).** The doubly robust estimator [@dudik2011doubly; @robins1994estimation] combines both approaches: $$ \hat{V}_{\text{DR}} = \frac{1}{n} \sum_{t=1}^{n} \left[ \hat{r}(x_t, \pi) + \frac{\pi(a_t \mid x_t)}{\pi_0(a_t \mid x_t)} \bigl(r_t - \hat{r}(x_t, a_t)\bigr) \right] $$ {#eq-dr} ::: {.callout-note title="Definition: Doubly Robust Estimator"} The DR estimator augments the direct method prediction $\hat{r}(x_t, \pi)$ with a propensity-weighted correction term $\frac{\pi}{\pi_0}(r_t - \hat{r})$. It is **doubly robust**: consistent if *either* the reward model $\hat{r}$ or the propensity model $\pi_0$ is correctly specified (but not necessarily both). - When $\hat{r}$ is correct: the residual $r_t - \hat{r}(x_t, a_t)$ has zero expectation, so the correction vanishes and $\hat{V}_{\text{DR}} \approx \hat{V}_{\text{DM}}$. - When $\hat{r}$ is wrong but $\pi_0$ is correct: the IPW term corrects the model's bias, yielding an unbiased estimate. ::: **Proof of double robustness.** We verify that $\mathbb{E}[\hat{V}_{\text{DR}}] = V(\pi)$ when either model is correct. Taking expectations: $$ \mathbb{E}[\hat{V}_{\text{DR}}] = \mathbb{E}\!\left[\hat{r}(x, \pi)\right] + \mathbb{E}\!\left[\frac{\pi(a \mid x)}{\pi_0(a \mid x)}(r - \hat{r}(x, a))\right] $$ For the second term, conditioning on $x$: $$ \mathbb{E}\!\left[\frac{\pi(a \mid x)}{\pi_0(a \mid x)}(r - \hat{r}(x, a)) \;\middle|\; x\right] = \sum_a \pi_0(a \mid x) \frac{\pi(a \mid x)}{\pi_0(a \mid x)} \bigl(\mathbb{E}[r \mid x, a] - \hat{r}(x, a)\bigr) $$ $$ = \sum_a \pi(a \mid x) \bigl(r^*(x, a) - \hat{r}(x, a)\bigr) $$ where $r^*(x, a) = \mathbb{E}[r \mid x, a]$ is the true reward function. If $\hat{r} = r^*$, this is zero. If $\pi_0$ is correct (which it is by construction in the importance weighting), the full expectation becomes: $$ \mathbb{E}[\hat{V}_{\text{DR}}] = \mathbb{E}\!\left[\hat{r}(x, \pi)\right] + \mathbb{E}_x\!\left[\sum_a \pi(a \mid x)(r^*(x,a) - \hat{r}(x,a))\right] = \mathbb{E}_x\!\left[\sum_a \pi(a \mid x) r^*(x, a)\right] = V(\pi) $$ ### Connection to Prediction-Powered Inference {#sec-ppi-connection} Prediction-Powered Inference (PPI) [@angelopoulos2023prediction] is a recently developed framework that is structurally equivalent to doubly robust estimation. PPI uses a small labeled dataset $\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n$ (expensive: actual model evaluations) and a large unlabeled dataset $\mathcal{D}_N = \{(X_i, \hat{Y}_i)\}_{i=1}^N$ (cheap: model predictions or automated scores) to estimate a population quantity $\mu = \mathbb{E}[Y]$: $$ \hat{\mu}_{\text{PPI}} = \frac{1}{N} \sum_{i \in \mathcal{D}_N} \hat{Y}_i + \frac{1}{n} \sum_{i \in \mathcal{D}_n} (Y_i - \hat{Y}_i) $$ {#eq-ppi} The first term is the direct method estimate using the cheap predictions. The second term is the bias correction: it estimates the systematic error in $\hat{Y}$ using the labeled subsample. This is exactly the DR structure: imputation + correction. The connection to the cold-start pipeline in @sec-cold-start is direct. The PPE (Prediction-Powered Evaluation) approach from @sec-foundations learns a mapping from model embeddings and metadata to IRT parameters, producing predictions $\hat{Y}_{ij}$ for unseen model-item pairs. PPI provides the debiasing step: use a small set of actual evaluations to correct the PPE predictions, yielding valid confidence intervals for model performance. The PPE predictions serve as the imputation model $\hat{r}$; the actual evaluations provide the labeled correction $Y - \hat{Y}$. ### Importance Weighting for Covariate Shift {#sec-importance-weighting} When the shift between benchmark and deployment is purely covariate shift ($P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)$, $P^{(s)}(X) \neq P^{(t)}(X)$), the importance-weighted estimator corrects the bias [@shimodaira2000improving]: $$ \hat{\mu}_{\text{IW}} = \frac{\sum_{i=1}^n w(x_i) Y_i}{\sum_{i=1}^n w(x_i)}, \quad w(x_i) = \frac{P^{(t)}(x_i)}{P^{(s)}(x_i)} $$ {#eq-iw} This is the self-normalized variant, which is more stable than the unnormalized version. The weights $w(x_i)$ upweight items that are underrepresented in the benchmark relative to deployment, and downweight overrepresented items. **Practical challenges:** - **Density ratio estimation**: The weights require knowing or estimating the density ratio $P^{(t)}/P^{(s)}$. Methods include logistic regression on a domain classifier, kernel mean matching, and moment matching [@sugiyama2012density]. - **Weight clipping**: Extreme weights inflate variance. Clipping weights at a maximum value $M$ (e.g., $w_i \leftarrow \min(w_i, M)$) trades bias for variance reduction. - **Effective sample size**: The effective sample size $n_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$ measures how much information the reweighted sample contains. When $n_{\text{eff}} \ll n$, the importance-weighted estimate is unreliable. **AI example.** A reasoning benchmark oversamples algebra items (60% algebra, 20% geometry, 20% combinatorics). The deployment distribution is uniform across topics. Importance weighting reweights: algebra items get weight $w = 1/3 \div 3/5 = 5/9$, while geometry and combinatorics items get weight $w = 1/3 \div 1/5 = 5/3$. The corrected accuracy better reflects deployment performance. ```{pyodide-python} #| label: dr-estimation #| autorun: true import numpy as np import matplotlib.pyplot as plt from scipy.special import expit np.random.seed(123) # --- Simulate CAT-like adaptive item selection --- n_models = 50 n_items_pool = 200 n_selected = 30 # Items selected per model by CAT n_reps = 500 # True model abilities theta_true = np.random.normal(0, 1, n_models) # Item difficulties spanning the range beta_pool = np.linspace(-3, 3, n_items_pool) # True value: average accuracy under uniform item selection true_values = np.zeros(n_models) for j in range(n_models): true_values[j] = expit(theta_true[j] - beta_pool).mean() # Run repeated experiments dm_estimates = np.zeros((n_reps, n_models)) ipw_estimates = np.zeros((n_reps, n_models)) dr_estimates = np.zeros((n_reps, n_models)) naive_estimates = np.zeros((n_reps, n_models)) for rep in range(n_reps): for j in range(n_models): # CAT item selection: select items near theta (high Fisher info) # Propensity: proportional to Fisher information p_items = expit(theta_true[j] - beta_pool) fisher_info = p_items * (1 - p_items) # Rasch Fisher info propensity = fisher_info / fisher_info.sum() # Normalized propensity # Select items (with replacement for simplicity) selected = np.random.choice(n_items_pool, n_selected, p=propensity, replace=False) beta_selected = beta_pool[selected] prop_selected = propensity[selected] # Generate responses p_correct = expit(theta_true[j] - beta_selected) responses = np.random.binomial(1, p_correct) # Uniform target propensity target_prop = 1.0 / n_items_pool # --- Naive estimate (simple average) --- naive_estimates[rep, j] = responses.mean() # --- Direct Method: use true IRT model --- # Slight misspecification: use noisy theta estimate theta_hat = theta_true[j] + np.random.normal(0, 0.3) dm_pred = expit(theta_hat - beta_pool).mean() dm_estimates[rep, j] = dm_pred # --- IPW --- weights = target_prop / prop_selected weights = weights / weights.sum() # Self-normalize ipw_estimates[rep, j] = (weights * responses).sum() # --- Doubly Robust --- r_hat = expit(theta_hat - beta_selected) # IRT prediction for selected items r_hat_full = expit(theta_hat - beta_pool).mean() # DM estimate correction = (weights * (responses - r_hat)).sum() dr_estimates[rep, j] = r_hat_full + correction # --- Compute bias and RMSE --- def compute_stats(estimates, true_vals): bias = (estimates.mean(axis=0) - true_vals).mean() rmse = np.sqrt(((estimates.mean(axis=0) - true_vals)**2).mean()) return bias, rmse stats = { 'Naive': compute_stats(naive_estimates, true_values), 'DM (IRT)': compute_stats(dm_estimates, true_values), 'IPW': compute_stats(ipw_estimates, true_values), 'DR': compute_stats(dr_estimates, true_values) } # --- Visualization --- fig, axes = plt.subplots(1, 2, figsize=(6, 2)) methods = list(stats.keys()) biases = [stats[m][0] for m in methods] rmses = [stats[m][1] for m in methods] colors = ['#B07CD8', '#5B8DEE', '#F0A35C', '#45BF7C'] # Panel 1: Bias bars1 = axes[0].bar(methods, biases, color=colors, alpha=0.8, edgecolor='white') axes[0].axhline(y=0, color='black', linewidth=0.8) axes[0].set_ylabel('Bias') axes[0].set_title('Estimation Bias (500 replications)') for bar, val in zip(bars1, biases): axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001, f'{val:.4f}', ha='center', fontweight='bold') # Panel 2: RMSE bars2 = axes[1].bar(methods, rmses, color=colors, alpha=0.8, edgecolor='white') axes[1].set_ylabel('RMSE') axes[1].set_title('Root Mean Squared Error (500 replications)') for bar, val in zip(bars2, rmses): axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001, f'{val:.4f}', ha='center', fontweight='bold') plt.tight_layout() plt.show() print(f"\nOff-Policy Estimation Results ({n_reps} replications, {n_models} models):") print(f" {'Method':<12} {'Bias':>8} {'RMSE':>8}") print(f" {'-'*28}") for m in methods: print(f" {m:<12} {stats[m][0]:>8.4f} {stats[m][1]:>8.4f}") ``` The simulation demonstrates the off-policy estimation problem under CAT-like adaptive item selection. The **naive** estimator (simple average of responses on adaptively-selected items) is biased because the CAT algorithm oversamples items near the model's ability level. The **Direct Method** (IRT prediction with noisy ability estimate) has low bias when the model is approximately correct but is sensitive to misspecification. **IPW** corrects the selection bias but has higher variance due to extreme weights. The **Doubly Robust** estimator combines the strengths of both, achieving the lowest RMSE by using the IRT model for imputation and IPW for bias correction. ```{pyodide-python} #| label: importance-weighting #| autorun: true import numpy as np import matplotlib.pyplot as plt from scipy.special import expit from scipy.stats import norm np.random.seed(456) # --- Simulate covariate shift: benchmark oversamples easy items --- n_models = 100 n_source_items = 200 n_target_items = 200 theta = np.random.normal(0.5, 1, n_models) # Source: easy-biased item pool (mean difficulty = -1) beta_source = np.random.normal(-1, 0.7, n_source_items) # Target: broad item pool (mean difficulty = 0) beta_target = np.random.normal(0, 1.5, n_target_items) # Compute true accuracy under each distribution acc_source = np.array([expit(th - beta_source).mean() for th in theta]) acc_target = np.array([expit(th - beta_target).mean() for th in theta]) # Importance-weighted estimate using source data # Weight each source item by p_target(beta) / p_source(beta) weights = norm.pdf(beta_source, 0, 1.5) / norm.pdf(beta_source, -1, 0.7) weights = weights / weights.sum() * len(weights) # Normalize to sum to n # Compute IW-corrected accuracy for each model acc_iw = np.zeros(n_models) for j in range(n_models): p_correct = expit(theta[j] - beta_source) acc_iw[j] = (weights * p_correct).mean() # --- Visualization --- fig, axes = plt.subplots(1, 2, figsize=(6, 2)) # Panel 1: Item distributions beta_range = np.linspace(-4, 4, 200) axes[0].plot(beta_range, norm.pdf(beta_range, -1, 0.7), '#5B8DEE', linewidth=2.5, label='Source (benchmark)') axes[0].plot(beta_range, norm.pdf(beta_range, 0, 1.5), '#E8637A', linewidth=2.5, label='Target (deployment)') axes[0].fill_between(beta_range, norm.pdf(beta_range, -1, 0.7), alpha=0.15, color='#5B8DEE') axes[0].fill_between(beta_range, norm.pdf(beta_range, 0, 1.5), alpha=0.15, color='#E8637A') axes[0].set_xlabel(r'Item difficulty ($\beta$)') axes[0].set_ylabel('Density') axes[0].set_title('Item Distributions: Source vs Target') axes[0].legend() # Panel 2: Accuracy estimates # Sort by target accuracy for clearer visualization sort_idx = np.argsort(acc_target) model_rank = np.arange(n_models) axes[1].scatter(model_rank, acc_source[sort_idx], color='#5B8DEE', s=20, alpha=0.6, label=f'Naive (source): mean={acc_source.mean():.3f}') axes[1].scatter(model_rank, acc_iw[sort_idx], color='#45BF7C', s=20, alpha=0.6, label=f'IW-corrected: mean={acc_iw.mean():.3f}') axes[1].scatter(model_rank, acc_target[sort_idx], color='#E8637A', s=20, alpha=0.6, label=f'True target: mean={acc_target.mean():.3f}') axes[1].set_xlabel('Model rank (by true target accuracy)') axes[1].set_ylabel('Estimated accuracy') axes[1].set_title('Covariate Shift Correction') axes[1].legend(loc='lower right') plt.tight_layout() plt.show() print(f"\nCovariate Shift Correction:") print(f" Naive (source) mean accuracy: {acc_source.mean():.3f}") print(f" IW-corrected mean accuracy: {acc_iw.mean():.3f}") print(f" True target mean accuracy: {acc_target.mean():.3f}") print(f" Naive overestimation: {acc_source.mean() - acc_target.mean():.3f}") print(f" IW residual bias: {acc_iw.mean() - acc_target.mean():.3f}") ``` The left panel shows the source (benchmark) and target (deployment) item difficulty distributions. The benchmark oversamples easy items (blue, centered at $\beta = -1$), while deployment items span a broader range (red, centered at $\beta = 0$). The right panel shows the consequence: naive accuracy computed on benchmark items (blue) systematically overestimates deployment accuracy (red). Importance weighting (green) corrects the bias by upweighting hard items that are underrepresented in the benchmark, bringing the estimates close to the true deployment accuracy. ## Conformal Prediction Under Distribution Shift {#sec-conformal} The off-policy estimators in @sec-off-policy correct *point estimates* of model performance. But practitioners also need *uncertainty quantification*: not just "the model's deployment accuracy is 0.78" but "the model's deployment accuracy is in $[0.72, 0.84]$ with 90% confidence." Conformal prediction provides this guarantee with minimal distributional assumptions. ::: {.callout-note title="Definition: Conformal Prediction"} Given a calibration set $\{(X_i, Y_i)\}_{i=1}^n$ and a new input $X_{n+1}$, a conformal prediction set $C_\alpha(X_{n+1})$ satisfies: $$ P(Y_{n+1} \in C_\alpha(X_{n+1})) \geq 1 - \alpha $$ under the assumption that the calibration and test data are **exchangeable** (informally: drawn from the same distribution in no particular order). No parametric assumptions on the data distribution are required. ::: ### Split Conformal Prediction {#sec-split-conformal} The split conformal algorithm [@vovk2005algorithmic] is the simplest conformal method: 1. Fit a model $\hat{f}$ on a training set. 2. On a held-out calibration set $\{(X_i, Y_i)\}_{i=1}^n$, compute nonconformity scores $s_i = |Y_i - \hat{f}(X_i)|$. 3. Let $\hat{q}$ be the $\lceil (1-\alpha)(1 + 1/n) \rceil / n$ quantile of $\{s_1, \ldots, s_n\}$. 4. For a new input $X_{n+1}$, predict $C_\alpha(X_{n+1}) = [\hat{f}(X_{n+1}) - \hat{q}, \hat{f}(X_{n+1}) + \hat{q}]$. The coverage guarantee $P(Y_{n+1} \in C_\alpha) \geq 1 - \alpha$ holds for *any* model $\hat{f}$, *any* data distribution, and *any* sample size---as long as the calibration and test data are exchangeable. ### When Exchangeability Fails {#sec-exchangeability-failure} Under distribution shift, exchangeability fails: the calibration data is drawn from $P^{(s)}$ while the test data comes from $P^{(t)}$. This causes the conformal prediction set to lose its coverage guarantee. Specifically, if the target distribution concentrates on regions where the model is less accurate (harder items), the calibration-based quantile $\hat{q}$ underestimates the nonconformity scores at test time, leading to **undercoverage**. ### Weighted Conformal Prediction {#sec-weighted-conformal} @tibshirani2019conformal show that coverage can be restored under covariate shift by replacing the uniform quantile with an **importance-weighted quantile**. The key insight is the same as for importance weighting in @sec-importance-weighting: reweight the calibration scores to match the target distribution. The algorithm modifies step 3: instead of the uniform quantile, compute the weighted quantile of $\{s_1, \ldots, s_n, +\infty\}$ with weights $\{w_1, \ldots, w_n, w_{n+1}\}$ where $w_i = P^{(t)}(X_i) / P^{(s)}(X_i)$ and $w_{n+1} = P^{(t)}(X_{n+1}) / P^{(s)}(X_{n+1})$. This yields: $$ P^{(t)}(Y_{n+1} \in C_\alpha(X_{n+1})) \geq 1 - \alpha $$ The coverage guarantee now holds under the *target* distribution, provided the importance weights are correct. @barber2023conformal extend this beyond covariate shift to more general forms of distribution shift. ```{pyodide-python} #| label: conformal-prediction #| autorun: true import numpy as np import matplotlib.pyplot as plt from scipy.special import expit from scipy.stats import norm np.random.seed(789) # --- Simulate: calibrate on easy items, test on harder items --- n_cal = 200 # Calibration items (source) n_test = 200 # Test items (target) n_models = 50 # Source: easy items; Target: harder items beta_cal = np.random.normal(-0.5, 0.8, n_cal) beta_test = np.random.normal(1.0, 1.0, n_test) # Test at multiple nominal coverage levels alphas = np.arange(0.05, 0.55, 0.05) coverage_naive = np.zeros(len(alphas)) coverage_weighted = np.zeros(len(alphas)) for a_idx, alpha in enumerate(alphas): covers_naive = 0 covers_weighted = 0 n_total = 0 for j in range(n_models): theta_j = np.random.normal(0, 1) # True accuracy on calibration and test items p_cal = expit(theta_j - beta_cal) p_test = expit(theta_j - beta_test) # Generate binary responses y_cal = np.random.binomial(1, p_cal) y_test = np.random.binomial(1, p_test) # Model prediction: IRT probability r_hat_cal = p_cal r_hat_test = p_test # Nonconformity scores on calibration set scores_cal = np.abs(y_cal - r_hat_cal) # --- Naive conformal: uniform quantile --- q_level = int(np.ceil((1 - alpha) * (n_cal + 1))) - 1 q_level = min(q_level, n_cal - 1) sorted_scores = np.sort(scores_cal) q_naive = sorted_scores[q_level] # --- Weighted conformal: importance-weighted quantile --- weights = norm.pdf(beta_cal, 1.0, 1.0) / norm.pdf(beta_cal, -0.5, 0.8) weights = weights / weights.sum() # Weighted quantile sorted_idx = np.argsort(scores_cal) cum_weights = np.cumsum(weights[sorted_idx]) q_idx = np.searchsorted(cum_weights, 1 - alpha) q_idx = min(q_idx, n_cal - 1) q_weighted = scores_cal[sorted_idx[q_idx]] # Check coverage on test items for i in range(n_test): score_test = np.abs(y_test[i] - r_hat_test[i]) covers_naive += (score_test <= q_naive) covers_weighted += (score_test <= q_weighted) n_total += 1 coverage_naive[a_idx] = covers_naive / n_total coverage_weighted[a_idx] = covers_weighted / n_total # --- Visualization --- fig, axes = plt.subplots(1, 2, figsize=(6, 2)) # Panel 1: Coverage vs nominal level nominal = 1 - alphas axes[0].plot(nominal, nominal, 'k--', linewidth=1, label='Ideal (y = x)') axes[0].plot(nominal, coverage_naive, 'o-', color='#E8637A', linewidth=2, markersize=6, label='Naive conformal') axes[0].plot(nominal, coverage_weighted, 's-', color='#45BF7C', linewidth=2, markersize=6, label='Weighted conformal') axes[0].fill_between(nominal, nominal - 0.03, nominal + 0.03, alpha=0.1, color='gray', label='±3% tolerance') axes[0].set_xlabel('Nominal coverage (1 - α)') axes[0].set_ylabel('Empirical coverage') axes[0].set_title('Coverage Calibration Under Shift') axes[0].legend() axes[0].set_xlim(0.45, 1.0) axes[0].set_ylim(0.45, 1.05) # Panel 2: Distribution comparison beta_range = np.linspace(-3, 4, 200) axes[1].plot(beta_range, norm.pdf(beta_range, -0.5, 0.8), '#5B8DEE', linewidth=2.5, label='Calibration (easy)') axes[1].plot(beta_range, norm.pdf(beta_range, 1.0, 1.0), '#E8637A', linewidth=2.5, label='Test (hard)') axes[1].fill_between(beta_range, norm.pdf(beta_range, -0.5, 0.8), alpha=0.15, color='#5B8DEE') axes[1].fill_between(beta_range, norm.pdf(beta_range, 1.0, 1.0), alpha=0.15, color='#E8637A') axes[1].set_xlabel(r'Item difficulty ($\beta$)') axes[1].set_ylabel('Density') axes[1].set_title('Calibration vs Test Item Distributions') axes[1].legend() plt.tight_layout() plt.show() print(f"\nConformal Prediction Under Covariate Shift:") print(f" {'Nominal':>8} {'Naive':>8} {'Weighted':>8} {'Gap (naive)':>12}") for i in range(len(alphas)): nom = 1 - alphas[i] print(f" {nom:>8.2f} {coverage_naive[i]:>8.3f} {coverage_weighted[i]:>8.3f} " f"{coverage_naive[i] - nom:>+12.3f}") ``` The left panel shows coverage calibration: the diagonal represents perfect calibration (empirical coverage = nominal coverage). Naive conformal prediction (red) falls below the diagonal---it promises 90% coverage but delivers less, because the calibration items (easy) are not representative of the test items (hard). Weighted conformal prediction (green) restores coverage by reweighting the calibration scores to match the test distribution. The right panel shows the source of the problem: the calibration distribution (blue) is shifted left (easier items) relative to the test distribution (red). ## Putting It All Together: A Causal Audit {#sec-causal-audit} We close with a worked example that composes all tools from this chapter. Consider "CodeReason," a hypothetical benchmark claiming to measure coding reasoning ability in language models. **Step 1: Draw the evaluation DAG** (@sec-evaluation-dag). We identify: Training Data $D$ (code repositories, documentation), Architecture $A$ (transformer variant), Latent Coding Ability $\theta$, Benchmark Items $\beta$ (coding problems), Prompt Template $F$ (instruction format), and Observed Score $Y$. We suspect two validity-threatening paths: (i) $D \to Y$ bypassing $\theta$ (some models trained on CodeReason's source repository), and (ii) $F \to Y$ (the benchmark uses a specific instruction format that favors certain models). **Step 2: Check for distribution shift** (@sec-shift-taxonomy). The benchmark items are predominantly Python function-completion tasks at medium difficulty. The deployment context requires debugging, multi-file reasoning, and code review across Python, JavaScript, and Rust. This is covariate shift: the item distribution $P^{(s)}(\beta)$ is narrower than $P^{(t)}(\beta)$. We estimate density ratios using a domain classifier trained to distinguish benchmark items from deployment items. **Step 3: Apply doubly robust estimation** (@sec-three-estimators). We combine two components: (a) An IRT model fitted on the benchmark data provides $\hat{r}(x, a)$---the direct method estimate of each model's accuracy on any item. (b) The density ratio weights from Step 2 provide the importance weights. The DR estimator corrects the IRT predictions using the importance-weighted residuals. The result: an accuracy estimate that accounts for both the item selection bias and any IRT misspecification. **Step 4: Construct prediction intervals** (@sec-weighted-conformal). Using the importance weights from Step 2, we apply weighted conformal prediction to the DR-corrected estimates. The result: "Model X's deployment accuracy is in [0.62, 0.71] with 90% coverage." The interval is wider than it would be under no shift, reflecting the additional uncertainty from extrapolating to a different item distribution. **Step 5: Diagnose contamination** (@sec-causal-contamination). We compare model performance on items created before vs. after the training data cutoff (chronological split). Models with suspected contamination show a 15-point accuracy gap between pre- and post-cutoff items. We run item-fit analysis (@sec-item-fit) and find that the pre-cutoff items have inflated outfit statistics for these models. Conclusion: the $D \to Y$ path is active for these models. We flag them and report both contamination-adjusted and unadjusted rankings. This five-step causal audit integrates the tools from this chapter with the diagnostic methods from @sec-validity-diagnostics, providing a systematic approach to evaluating whether benchmark results are trustworthy enough to inform deployment decisions. ## Discussion Questions {#sec-causality-discussion} 1. A benchmark uses Computerized Adaptive Testing (@sec-cat) to select items. A critic argues that the resulting accuracy is "biased because models only answered items matched to their ability level." Using the off-policy evaluation framework, explain whether this criticism is valid and how the estimate could be corrected if needed. 2. Two benchmarks both claim to measure "reasoning." On Benchmark A, Model X outperforms Model Y. On Benchmark B, the ranking reverses. Draw a DAG that explains this reversal (hint: consider Simpson's paradox). What causal assumptions would make the Benchmark A ranking the correct one? 3. Prediction-powered inference assumes that the model predictions $\hat{Y}$ are informative about the true labels $Y$. When might this assumption fail in AI evaluation? What are the consequences for the DR estimator's consistency, and how could the failure be detected? 4. A company evaluates its model on an internal benchmark (source) and wants to predict performance on a customer-facing deployment (target). What information is needed to determine whether the benchmark results are transportable? Which assumptions are untestable from the source data alone? 5. Conformal prediction guarantees *marginal* coverage: $P(Y \in C_\alpha) \geq 1 - \alpha$ on average over the test distribution. Why is this weaker than *conditional* coverage ($P(Y \in C_\alpha \mid X = x) \geq 1 - \alpha$ for each $x$), and when does the distinction matter for AI evaluation? ## Bibliographic Notes {#sec-causality-biblio} The mathematical framework for causal reasoning used in this chapter is due to @pearl2009causality, who developed structural causal models, the do-calculus, and the graphical criteria for identifiability. An accessible introduction is @peters2017elements. The transportability theory---when causal conclusions can be transferred across settings---is developed by @bareinboim2016causal. The taxonomy of distribution shift draws on a large literature. Covariate shift and importance weighting were formalized by @shimodaira2000improving. @sugiyama2012density provide a comprehensive treatment of density ratio estimation methods. @quinonero2009dataset survey the broader landscape of dataset shift in machine learning. Doubly robust estimation originated in the biostatistics literature with @robins1994estimation, who introduced the augmented inverse probability weighted estimator. @dudik2011doubly adapted the DR framework to contextual bandits and off-policy evaluation, which is the formulation we use here. The connection between DR estimation and semiparametric efficiency theory is developed in the biostatistics literature; for machine learning applications, see the survey by Dudík, Erber, Langford, and Li (2014). Prediction-Powered Inference was introduced by @angelopoulos2023prediction, who showed how to construct valid confidence intervals by combining a small labeled dataset with a large set of model predictions. The structural connection between PPI and doubly robust estimation is noted in their paper and further developed in follow-up work on PPI++. Conformal prediction was introduced by @vovk2005algorithmic. @tibshirani2019conformal extended conformal methods to handle covariate shift via importance weighting, and @barber2023conformal further generalized the framework beyond exchangeability. The connection between Borsboom's causal validity and structural causal models was developed in @sec-foundations. The evaluation DAG in this chapter makes that connection explicit, showing how each validity threat from @sec-validity-threats corresponds to a specific causal pathway. Data attribution methods---influence functions, TracIn, and TRAK---provide complementary tools for tracing the causal effect of training data on model predictions, but are beyond the scope of this chapter; see Koh and Liang (2017) for the foundational work. ## Exercises {#sec-causality-exercises} **Theoretical** 1. Show that under covariate shift ($P^{(s)}(Y \mid X) = P^{(t)}(Y \mid X)$, $P^{(s)}(X) \neq P^{(t)}(X)$), the importance-weighted estimator $\hat{\mu}_{\text{IW}} = \frac{1}{n}\sum_{i=1}^n w(x_i) Y_i$ with $w(x_i) = P^{(t)}(x_i)/P^{(s)}(x_i)$ is unbiased for $\mathbb{E}_{P^{(t)}}[Y]$. Under what conditions on the weight distribution does the variance of $\hat{\mu}_{\text{IW}}$ diverge? 2. Prove the double robustness property: show that $\mathbb{E}[\hat{V}_{\text{DR}}] = V(\pi)$ (@eq-dr) if either (a) the reward model satisfies $\hat{r}(x, a) = \mathbb{E}[r \mid x, a]$ for all $(x, a)$, or (b) the propensity $\pi_0(a \mid x)$ is correctly specified. Where does the proof rely on the assumption that the logged data is collected under $\pi_0$? 3. In a CAT procedure that selects the item maximizing Fisher information at the current ability estimate $\hat{\theta}$, derive the selection propensity $\pi_0(x_t \mid \hat{\theta}_t)$ for a Rasch model. Show that this propensity is maximized for items with $\beta_i \approx \hat{\theta}_t$ and decays for items far from the current estimate. What is the effective sample size $n_{\text{eff}}$ of the resulting IPW estimator as a function of the item pool's difficulty distribution? 4. Using the evaluation DAG from @sec-evaluation-dag, show that benchmark contamination (a direct path $D \to Y$ bypassing $\theta$) is *not identifiable* from observational data alone without additional assumptions. What minimal intervention (e.g., a chronological split or a canary-based test) would identify the contamination effect? State the identifying assumptions precisely. **Computational** 5. Implement importance weighting for a benchmark with covariate shift. Simulate a Rasch model with $P^{(s)}(\beta) = \mathcal{N}(-1, 1)$ (easy-biased source) and $P^{(t)}(\beta) = \mathcal{N}(0, 1.5)$ (broader target). For 200 models, compare naive and IW-corrected accuracy estimates. Vary the degree of shift (by changing the source mean from $-2$ to $0$) and plot the naive bias as a function of shift magnitude. 6. Simulate a full CAT procedure for 200 models on a 500-item Rasch pool. For each model, run a 30-item adaptive test using the maximum-information selection rule from @sec-cat. Compute naive accuracy, IPW-corrected accuracy, and DR-corrected accuracy. Introduce IRT misspecification (e.g., the true model is 2PL but the reward model assumes Rasch) and show that DR remains robust while DM degrades. 7. Implement weighted conformal prediction for AI evaluation under covariate shift. Calibrate on items drawn from $\mathcal{N}(-0.5, 0.8)$ and test on items from $\mathcal{N}(1.0, 1.0)$. Plot empirical coverage vs. nominal level for both standard and weighted conformal across the range $\alpha \in [0.05, 0.50]$. How does the effective sample size of the weighted calibration set affect the coverage guarantee? **Discussion** 8. @angelopoulos2023prediction's PPI uses cheap model predictions to augment expensive human labels. In AI evaluation, the "expensive" labels are actual model runs and the "cheap" predictions are the PPE cold-start predictions from @sec-cold-start. Design a practical PPI pipeline for evaluating a new model on 10,000 items when only 500 actual evaluations can be afforded. What are the key design choices (which 500 items to evaluate? how to estimate the bias correction? how to construct confidence intervals)? How does the accuracy of the PPE predictions affect the width of the resulting intervals?