9 Design

Intended Learning Outcomes

By the end of this chapter, the reader will be able to:

Explain why Goodhart’s Law is not merely a cautionary aphorism but a formal game-theoretic phenomenon, and classify its four variants in AI evaluation contexts.
Formalize AI evaluation as a Stackelberg game between an evaluation designer and a model builder, and prove that deterministic benchmarks fail while randomized mechanisms achieve one-shot incentive alignment.
Analyze the information-variance tradeoff in repeated evaluation: why reducing noise requires revealing information, and how this degrades alignment over time.
Derive distribution correction as the primary mechanism for restoring alignment, and compute the optimal evaluation size $k^*$ as a function of correction rate $\rho$.
Model metric design as a principal-agent problem and characterize when developers prefer to reveal, conceal, or garble evaluation information.
Apply positional representation and positional proportionality criteria to select representative benchmark subsets using social choice theory.
Synthesize design principles spanning strategic robustness, information control, and representativeness for AI benchmarks.

Suggested Lecture Plan

This chapter can be covered in 3 lectures (75-90 minutes each):

Lecture 1: The Evaluation Game

Goodhart’s Law and strategic manipulation (15 min)
Stackelberg benchmark game: setup and failure of deterministic mechanisms (20 min)
One-shot incentive alignment via randomization (25 min)
Hands-on: deterministic vs. randomized evaluation simulation (15 min)

Lecture 2: Information Leakage and Restoring Alignment

Repeated evaluation and the information-variance tradeoff (20 min)
Distribution correction and alignment recovery (25 min)
Optimal evaluation size and the holdout mechanism (20 min)
Hands-on: Pareto frontier and correction simulations (10 min)

Lecture 3: Metric Design and Representative Selection

Metric design as a principal-agent problem (25 min)
Information elicitation: reveal, conceal, garble (20 min)
Representative benchmark selection via social choice theory (20 min)
Design principles synthesis (10 min)

Notation

This chapter introduces game-theoretic notation: $F$ (task universe), $F_E/F_M$ (evaluator/builder task sets), $\pi_E/\pi_M$ (sampling distributions), $f(\theta)$ (task performance), $u_E$ (evaluator utility), $\Delta_t$ (misalignment), and $\gamma$ (gaming penalty). See the notation appendix for the complete notation reference.

9.1 When Measurement Becomes a Target

“When a measure becomes a target, it ceases to be a good measure.” — Charles Goodhart (1975)

In previous chapters, we treated evaluation as a statistical problem: how to estimate latent abilities from noisy observations (Chapter 3), how to quantify measurement precision (Chapter 5), and how to assess whether a benchmark measures what it claims to measure (Chapter 1, Chapter 7). Throughout, we implicitly assumed that the data-generating process is fixed — that the act of measurement does not change the phenomenon being measured.

This assumption breaks down once benchmarks become influential. When a developer’s reputation, funding, or regulatory standing depends on benchmark scores, the developer has every incentive to optimize specifically for the benchmark — not for the broader capability it claims to measure. The data-generating process is no longer fixed: it shifts in response to the measurement itself. This is Goodhart’s Law, and in AI evaluation it is not a metaphor but a demonstrable, recurring phenomenon.

Concrete examples. The Foundation Model Transparency Index (FMTI) saw scores jump dramatically after its first release, not because companies became more transparent, but because they learned which specific indicators were measured and optimized their disclosures accordingly. Chatbot Arena has faced concerns about strategic submission: developers can selectively deploy models optimized for the types of queries that appear on the platform. And benchmark contamination — where training data includes evaluation items — is rational behavior for a developer whose market value depends on leaderboard position (Section 1.4.3).

Manheim and Garrabrant (2018) identify four variants of the Goodhart effect, each with a distinct causal mechanism:

Regressional Goodhart: The proxy $\hat{u}$ and the true objective $u$ are correlated but not identical. Optimizing $\hat{u}$ overshoots because extreme values of $\hat{u}$ tend to arise from noise, not from genuinely extreme $u$. This is the measurement error story from Chapter 5: a benchmark with $\alpha = 0.7$ captures 70% true variance and 30% noise, so selecting the top-scoring model partially selects for lucky noise.
Extremal Goodhart: The relationship between $\hat{u}$ and $u$ that holds in the bulk of the distribution breaks down in the tails. A benchmark that reliably ranks typical models may fail catastrophically when applied to a model specifically engineered to maximize the benchmark score.
Causal Goodhart: The proxy and the objective share a common cause. Optimizing the proxy can break this causal link. If “passes safety benchmark” and “is actually safe” are both caused by “was trained carefully,” a developer can find shortcuts to the benchmark that bypass the common cause.
Adversarial Goodhart: An agent actively exploits the gap between proxy and objective. This is the regime that dominates in competitive AI evaluation, and the focus of this chapter.

The shift from the earlier measurement chapters to this one is a shift in the source of the threat. Measurement error (Chapter 5) is stochastic. Validity threats (Section 1.4) are systematic but unintentional. Strategic manipulation is deliberate and adaptive: the DGP itself changes in response to the measurement. This connects to the literature on performative prediction (Perdomo et al. 2020), where the act of deploying a model changes the distribution it operates on, and to strategic classification (Hardt et al. 2016), where agents manipulate their features to achieve favorable outcomes.

The remainder of this chapter builds three layers of strategic analysis: (i) benchmark disclosure and information design — how much should evaluators reveal about the evaluation mechanism? (ii) metric selection and reporting granularity — which metrics should evaluators report, and at what level of detail? (iii) mechanism design for repeated evaluation — how can evaluators maintain alignment when the game repeats?

9.2 The Evaluation Game

We now formalize the strategic interaction between an evaluation designer and a model builder as a Stackelberg game. The designer moves first by committing to an evaluation mechanism; the builder responds by training a model. The key question: under what conditions can a benchmark mechanism incentivize builders to improve performance on the full task distribution, rather than merely optimizing for the specific tasks being measured?

9.2.1 Setup: Evaluator and Builder

Let $\Theta$ denote the space of all possible models. For a given model $\theta \in \Theta$, let $f(\theta) \in [0, 1]$ denote the model’s performance on task $f$. The universe of all possible tasks is the finite set $F$ with $|F| = N$.

Definition: Evaluation Designer’s Utility

The evaluation designer’s utility for a model $\theta$ is the aggregate performance across all tasks: \[ u_E(\theta) = \sum_{f \in F} f(\theta) \] The designer acts as a social planner seeking models that perform broadly rather than narrowly.

Each party acquires tasks by sampling from $F$: the designer draws tasks according to $\pi_E$ over $F$, forming the evaluation set $F_E$, while the builder draws according to $\pi_M$, forming the training set $F_M$. In practice, $\pi_M$ is approximately uniform (builders have broad access to training data), while $\pi_E$ is typically biased toward tasks that are easy to construct, grade, or that reflect particular evaluation priorities.

Definition: Stackelberg Benchmark Game

The game proceeds in three stages, parameterized by a mechanism $(M, r)$ where $M$ is a sampling function and $r: \Theta \times \mathcal{P}(F) \to \mathbb{R}$ is a reward function:

Designer’s move (ex-ante): The evaluation designer publishes the mechanism $(M, r)$. If $M$ is randomized, the designer privately draws randomness $\omega$ and computes $S^* = M(F_E, \omega)$, withholding the realization. If $M$ is deterministic, $S^* \subseteq F_E$ is fixed.
Builder’s move: The builder observes the mechanism $(M, r)$ but not the realized set $S^*$, and selects a model $\theta^* \in \Theta$.
Evaluation stage (ex-post): The designer publishes the score $r(\theta^*, S^*)$. The builder’s payoff is $r(\theta^*, S^*)$; the designer’s payoff is $u_E(\theta^*)$.

The misalignment between the builder’s incentive (performance on $S^*$) and the designer’s objective (performance on all of $F$) creates the Goodhart problem.

9.2.2 Failure of Deterministic Mechanisms

Consider first a deterministic mechanism where the designer publishes a fixed subset $S^* \subseteq F_E$.

Proposition: Failure of Deterministic Mechanisms

If the designer publishes a deterministic set $S^*$ with additive reward $r(\theta, S) = \sum_{f \in S} f(\theta)$, the builder’s best response is: \[ \theta^* \in \arg\max_{\theta \in \Theta} \sum_{f \in S^*} f(\theta) \] This provides no incentive for performance on tasks outside $S^*$.

Proof. The builder observes $S^*$ directly and maximizes the known objective. Tasks in $F \setminus S^*$ do not affect the score. $\square$

This is exactly Goodhart’s adversarial variant: the builder exploits the known structure of the evaluation to specialize. It is also the construct underrepresentation problem from Section 1.4.2 viewed through a strategic lens: a fixed benchmark systematically ignores capabilities outside $S^*$, and a rational builder responds accordingly.

9.2.3 One-Shot Alignment via Randomization

While deterministic mechanisms fail, randomization offers a path forward. When the builder faces genuine uncertainty about which tasks will be evaluated, the strategic landscape changes fundamentally.

Theorem 1: One-Shot Incentive Alignment (Omniscient Builder)

Consider a randomized mechanism where (i) $M$ draws a single task uniformly: $s \sim \text{Unif}(F_E)$, (ii) the reward is $r(\theta, \{s\}) = s(\theta)$, and (iii) the builder has a uniform prior $\pi$ over which tasks comprise $F_E$. Then the builder’s expected reward is proportional to the designer’s utility: \[ \mathbb{E}_{F_E \sim \pi}\!\left[\mathbb{E}_{s \sim \text{Unif}(F_E)}[s(\theta)]\right] = \frac{1}{|F|}\, u_E(\theta) \] and the builder’s best response maximizes $u_E(\theta)$.

Proof sketch. Under a symmetric (uniform) prior over which tasks compose $F_E$, the builder has no information distinguishing any task $f_i$ from $f_j$. By symmetry, the marginal probability that any specific task $f$ is the sampled task is constant: $P(s = f) = 1/|F|$. Therefore the builder’s expected reward is $(1/|F|) \sum_{f \in F} f(\theta) = (1/|F|) \, u_E(\theta)$, and maximizing expected reward is equivalent to maximizing $u_E(\theta)$. $\square$

The omniscient version assumes the builder knows all of $F$. A more realistic version considers a builder with limited information:

Theorem 2: One-Shot Alignment (Limited Information)

Suppose the builder draws tasks i.i.d. from distribution $p_M$ and believes the evaluator also draws from $p_M$. Under a single-sample mechanism, the builder’s best response is: \[ \theta^* \in \arg\max_{\theta \in \Theta} \mathbb{E}_{f \sim p_M}[f(\theta)] \] That is, the builder optimizes over their own best approximation of the task universe.

Proof sketch. The builder’s expected reward, taken over their prior on $F_E$ and the randomness in $M$, equals $\mathbb{E}_{f \sim p_M}[f(\theta)]$. Since $F_M$ is drawn from $p_M$, the builder’s best response is empirical risk minimization over $F_M$, which is the best they can do given their information. $\square$

This result is powerful: the single-sample mechanism is the best an evaluation designer can do. The builder’s best response is to approximate their own approximation of $F$ to the best of their ability — essentially training broadly over all available data. Under incentive alignment, the builder performs empirical risk minimization (ERM) over their task set $F_M$. By classical uniform convergence (Vapnik 1998), restricting to any subset $S \subset F_M$ can only increase regret; the builder has no incentive to “game” by focusing on a narrow task set.

9.2.4 Discussion: What Randomization Buys

The one-shot alignment result reveals a fundamental tension with the statistical efficiency goals of earlier chapters. In Section 6.2.1 and Section 6.2.2, we showed that targeted evaluation — choosing items to maximize Fisher information about $\theta$ — is statistically optimal. But targeted evaluation requires revealing information about which tasks the evaluator considers informative, which creates exploitable structure.

Randomized evaluation is not optimal for estimation precision but is optimal for incentive alignment. This tension between statistical efficiency and strategic robustness is a recurring theme:

Computerized adaptive testing (Section 6.2.2) reveals the evaluator’s information about ability through the item selection strategy, making the evaluator’s priorities transparent.
Bayesian persuasion (Kamenica and Gentzkow 2011) provides the information design framework: the evaluator is a sender choosing an information structure, and the builder is a receiver who updates and best-responds. The evaluator’s problem is to design a signal that induces the builder to take the action the evaluator prefers.

The resolution, as we develop in the next two sections, is that the evaluator must invest in correction to make the inevitable information leakage harmless.

Figure. Under a deterministic mechanism (left), the builder concentrates effort on the $k=10$ known evaluation tasks and neglects the rest, achieving high benchmark scores but low true utility. Under a randomized mechanism (center), the builder spreads effort broadly. The right panel compares average true utility across all tasks: the randomized mechanism produces a substantially higher utility despite lower peak performance on any single task.

9.3 The Information-Variance Tradeoff

Theorem 1 established that randomized evaluation achieves incentive alignment in one-shot settings. But benchmarks operate as repeated games: weekly leaderboards, monthly submissions, quarterly benchmark releases. In repeated evaluation, randomization alone fails because information leaks.

9.3.1 Repeated Evaluation and Information Leakage

Definition: Repeated Evaluation Game

At each round $t = 1, 2, \ldots$:

The evaluator draws a fresh evaluation set $F_E^{(t)}$ by sampling from $F$ according to $\pi_E$.
The evaluator samples $k$ tasks from $F_E^{(t)}$ via mechanism $M$ and publishes the score.
The builder observes the $k$ evaluated tasks, updates their estimate $\hat{\pi}_{E,t}$ of the evaluation distribution, and selects model $\theta_t$.

The variance of a $k$-task average reward scales as $\text{Var}(\hat{r}_k(\theta)) \approx \sigma^2/k$. To reliably distinguish two models with performance gap $\Delta$ at 95% confidence requires $k \geq 4\sigma^2/\Delta^2$. As models improve and converge ($\Delta \to 0$), the required sample size grows quadratically. This creates inexorable market pressure to increase $k$.

But increasing $k$ to reduce variance simultaneously increases the rate at which the builder accumulates information about $\pi_E$. The builder’s information set at time $t$ is $\mathcal{I}_t = \bigcup_{i=1}^{t-1} S_i$, where $S_i$ is the set of tasks sampled in round $i$. The leakage is $L_t = |\mathcal{I}_t|/|F|$.

9.3.2 Posterior Concentration and Incentive Misalignment

As the builder observes sampled tasks across rounds, their estimate $\hat{\pi}_{E,t}$ of the evaluation distribution concentrates around the true $\pi_E$. By standard results in Bayesian nonparametrics, the posterior concentration rate is $O(d_{\text{eff}} \log m / m)$ where $m$ is the number of observations and $d_{\text{eff}}$ is the effective dimension of the distribution class.

Proposition: Incentive Misalignment Under Distribution Learning

Let $\hat{\pi}_{E,t}$ denote the builder’s posterior mean estimate of $\pi_E$ at time $t$. The builder’s optimal strategy is $\theta^*_t = \arg\max_\theta \mathbb{E}_{f \sim \hat{\pi}_{E,t}}[f(\theta)]$. Then:

Initial alignment: When the prior over $\pi_E$ is diffuse, the builder optimizes broadly, approximating performance on the full universe $F$.
Posterior concentration leads to specialization: As $\hat{\pi}_{E,t}$ concentrates around $\pi_E$: \[ \lim_{t \to \infty} \theta^*_t = \arg\max_{\theta \in \Theta} \mathbb{E}_{f \sim \pi_E}[f(\theta)] \] If $\pi_E$ is non-uniform, the builder specializes to high-density regions at the expense of broad capability.

This is the core tension: privatization works in one-shot settings, but repeated evaluation reveals information that enables strategic specialization. The builder cannot identify a fixed evaluation set (since $F_E^{(t)}$ is resampled each round), but from repeated observations the builder learns $\pi_E$ itself — the distribution from which evaluation tasks are drawn.

9.3.3 The Pareto Frontier

Define the residual misalignment at time $t$ as: \[ \Delta_t = \mathbb{E}_{f \sim \pi_E^{(t)}}[f(\theta^*_t)] - \frac{1}{|F|} u_E(\theta^*_t) \]

This measures the gap between what the builder optimizes for (expected performance under $\pi_E$) and what society wants (aggregate performance over all of $F$). When $\pi_E$ is uniform, $\Delta_t = 0$.

No evaluation strategy can simultaneously achieve low variance and low leakage. Variance decreases in $k$; leakage increases in $k$. Varying $k$ traces a Pareto frontier where the evaluator must choose their preferred operating point.

Figure. The information-variance Pareto frontier across $k \in \{5, 10, 20, 50, 100\}$ over 200 rounds with $|F| = 200$ tasks. Left: evaluation noise (score variance) is lower for larger $k$. Center: the builder learns $\pi_E$ faster with larger $k$, as measured by KL divergence between the builder’s posterior and the true $\pi_E$. Right: as the builder’s posterior concentrates, incentive misalignment grows. No single $k$ achieves both low noise and low misalignment — a fundamental tradeoff.

9.4 Restoring Alignment

The Pareto frontier of the previous section seems like an impasse: any choice of $k$ sacrifices either measurement precision or incentive alignment. The resolution comes from recognizing that the evaluator is also a learner. If the evaluator corrects their distribution over time, the information the builder accumulates becomes obsolete.

9.4.1 Distribution Correction

Assumption: Distribution Correction

The evaluator updates $\pi_E^{(t)}$ via linear interpolation toward uniform at rate $\rho \in (0, 1]$: \[ \pi_E^{(t)} = (1 - \rho)\,\pi_E^{(t-1)} + \rho \cdot \text{Uniform}(F) \] so that $\text{KL}(\pi_E^{(t)} \| \text{Uniform}(F)) \to 0$ exponentially with rate $\rho$.

In practice, evaluators correct their biases through several channels: incident reports from users who discover model failures that benchmarks missed; systematic audits of coverage across capability dimensions (languages, modalities, reasoning types); improving benchmark acquisition capacity over time. The parameter $\rho$ captures the fraction of bias removed per evaluation round.

Proposition: Alignment Recovery Under Distribution Correction

Under the repeated evaluation game with distribution correction, the builder’s optimal strategy converges to maximizing the designer’s utility: \[ \theta^*_t \to \arg\max_{\theta \in \Theta} \frac{1}{|F|} \sum_{f \in F} f(\theta) = \arg\max_{\theta \in \Theta} u_E(\theta) \] as $\pi_E^{(t)} \to \text{Uniform}(F)$.

Proof sketch. The builder’s expected reward under mechanism $M$ is $\mathbb{E}_{f \sim \pi_E^{(t)}}[f(\theta)]$. As $\pi_E^{(t)} \to \text{Uniform}(F)$, this converges to $(1/|F|) u_E(\theta)$. In the limit, the builder knows $\pi_E = \text{Uniform}(F)$ perfectly, but this knowledge provides no advantage: optimizing for the uniform distribution is the designer’s objective. $\square$

The key insight is a race between two learners: the builder learns $\pi_E^{(t)}$ from observations, while the evaluator corrects $\pi_E^{(t)}$ toward uniform. The residual misalignment depends on the relative rates of these two processes.

Proposition: Misalignment Bound

Let $D_0 = \text{KL}(\pi_E^{(0)} \| \text{Uniform}(F))$ and $m_t = k \cdot \min(t, \rho^{-1})$ be the builder’s effective sample size. The residual misalignment satisfies: \[ \Delta_t \leq \min\!\left(\underbrace{\frac{m_t}{m_t + |F|}}_{\text{estimation-limited}},\;\; \underbrace{(1-\rho)^t \sqrt{\frac{D_0}{2}}}_{\text{correction-limited}}\right) \]

The two terms capture complementary regimes:

Estimation-limited ($t$ small): $m_t \approx kt$, giving $\Delta_t \lesssim kt/|F|$. The builder has too few observations to identify the evaluator’s biases.
Correction-limited ($t$ large): $\Delta_t \lesssim (1-\rho)^t \sqrt{D_0/2}$. The builder may know $\pi_E^{(t)}$ well, but the evaluator’s correction has driven the bias toward zero, leaving nothing to exploit.

Proof sketch. The correction-limited term follows from Pinsker’s inequality: $\Delta_t \leq \text{TV}(\pi_E^{(t)}, \text{Uniform}(F)) \leq \sqrt{D_0/2}\,(1-\rho)^t$, since the total variation decays geometrically under linear interpolation. The estimation-limited term follows from Bayesian shrinkage: a builder with $m_t$ observations and a Dirichlet$(1,\ldots,1)$ prior has posterior mean that is an $m_t/(m_t + |F|)$-fraction of the way from uniform to the empirical distribution. $\square$

Figure. Alignment recovery under distribution correction with $|F| = 200$ and $k = 20$. Left: residual misalignment converges to zero for all $\rho > 0$, with faster correction rates producing faster convergence. The $\rho = 0$ baseline (gray) maintains persistent misalignment. Right: KL divergence decays exponentially on a log scale, with rate proportional to $\rho$.

9.4.2 Optimal Evaluation Size

With distribution correction, the Pareto frontier from Section 9.3.3 collapses. Each revealed task provides only transient advantage to the builder, because the bias it reveals gets corrected within $O(1/\rho)$ rounds. The evaluator’s per-round loss decomposes as: \[ \mathcal{L}(k) = \underbrace{\frac{\sigma^2}{k}}_{\text{evaluation variance}} + \underbrace{\frac{\gamma k}{\rho}}_{\text{steady-state exploitation}} \]

The first term decreases in $k$ (more tasks, less noise); the second increases in $k$ (more tasks, faster builder learning) and decreases in $\rho$ (faster correction, shorter exploitation window).

Proposition: Optimal Sample Size

Given correction rate $\rho > 0$ and gaming penalty $\gamma > 0$, the optimal number of evaluation tasks per round is: \[ k^* = \sigma \sqrt{\frac{\rho}{\gamma}} \] with minimum loss $\mathcal{L}(k^*) = 2\sigma\sqrt{\gamma/\rho}$.

Proof. Setting $\partial \mathcal{L}/\partial k = -\sigma^2/k^2 + \gamma/\rho = 0$ gives $k^2 = \sigma^2 \rho / \gamma$, hence $k^* = \sigma\sqrt{\rho/\gamma}$. The second-order condition $\partial^2 \mathcal{L}/\partial k^2 = 2\sigma^2/k^3 > 0$ confirms this is a minimum. $\square$

Several key implications follow:

$k^*$ increases with $\sqrt{\rho}$: Faster correction allows larger evaluation sets. An evaluator who corrects twice as fast can evaluate $\sqrt{2} \approx 1.4$ times as many tasks.
$k^*$ increases with $\sigma$: Higher task variance demands more samples for a reliable signal, just as in the static case.
Limiting behavior: As $\rho \to 0$ (no correction), $k^* \to 0$ — the evaluator should reveal as few tasks as possible. As $\rho \to \infty$ (instant correction), $k^* \to \infty$ — leakage is immediately outdated.
The key lever is $\rho$: Since $\mathcal{L}(k^*) = 2\sigma\sqrt{\gamma/\rho}$, the evaluator’s optimal loss decreases as $1/\sqrt{\rho}$. Investing in distribution correction (coverage audits, benchmark changelogs, incident-driven task additions) is the most effective lever for benchmark design.

Figure. Evaluator loss $\mathcal{L}(k) = \sigma^2/k + \gamma k / \rho$ for different correction rates, with stars marking the optimal $k^*$. Faster correction (larger $\rho$) shifts $k^*$ rightward and lowers the minimum loss. The static case ($\rho \to 0$, dashed) is monotonically decreasing, reflecting the irresolvable tradeoff without correction. Parameters: $\sigma = 5$, $\gamma = 0.05$.

9.4.3 Noise-Gated Holdout

Distribution correction makes leaked information obsolete by moving the target. A complementary mechanism prevents leakage in the first place by gating the information flow.

Definition: Holdout Evaluation Mechanism

The evaluator publishes a reference set $S_0 \subset F$ of size $n_0$, sampled from $\pi_E$. At each round $t$:

Draw a holdout set $S_t$ of size $k$ from $\pi_E^{(t)}$ and noise $\xi_t \sim \text{Laplace}(0, 1/\varepsilon)$.
Compute the reference score $r_T(\theta_t) = \frac{1}{n_0}\sum_{f \in S_0} f(\theta_t)$ and the holdout score $r_H(\theta_t) = \frac{1}{k}\sum_{f \in S_t} f(\theta_t)$.
Threshold test: If $|r_H(\theta_t) + \xi_t - r_T(\theta_t)| < \eta$, publish $r_T(\theta_t)$ (“silent” round). Otherwise, publish $r_H(\theta_t) + \xi_t$ (“alarm” round).

The mechanism is self-correcting: it reveals less information precisely when the builder is behaving well ($\Delta_{\text{game}} \approx 0$, so the alarm fires only due to noise, with probability $p_{\text{alarm}} = e^{-\varepsilon\eta}$), and reveals more when the builder is gaming (the score discrepancy reliably triggers the alarm). The effective leakage per round for a non-gaming builder is: \[ \lambda_{\text{out}}^{\text{holdout}} \approx e^{-\varepsilon\eta} \cdot k \]

Setting $\varepsilon\eta = 3$ yields a $\sim 20\times$ slowdown in the builder’s learning rate, buying the evaluator substantially more time for distribution correction.

Why Noise Alone Cannot Replace Distribution Correction

The holdout mechanism slows the builder’s learning but does not change their incentives conditional on what they have learned. Regardless of how much Laplace noise is added, the builder’s optimal strategy given their posterior $\hat{\pi}_{E,t}$ remains $\theta_t^* = \arg\max_\theta \mathbb{E}_{f \sim \hat{\pi}_{E,t}}[f(\theta)]$. If $\pi_E$ is biased and fixed, the builder will eventually learn this bias and specialize accordingly. Only distribution correction, which drives $\pi_E^{(t)} \to \text{Uniform}(F)$, ensures that the builder’s eventual knowledge provides no exploitable advantage. This is the key difference from pure differential privacy: DP slows learning; distribution correction eliminates the incentive to exploit what is learned.

This connects to the differential privacy literature (Dwork et al. 2015; Dwork and Roth 2014). The holdout mechanism inherits DP guarantees from the Sparse Vector Algorithm, bounding the max-information between the builder’s observations and the holdout set. However, DP addresses a different threat: it prevents reconstruction of which specific tasks are in the holdout (overfitting to the test set), while our concern is that the builder learns the evaluation distribution and specializes accordingly.

9.5 Metric Design as Principal-Agent Problem

We now shift from “which tasks to show” to “which metrics to report.” Even with a perfectly designed evaluation mechanism, the choice of metric can create perverse incentives.

9.5.1 When Metrics Create Perverse Incentives

A vivid illustration comes from healthcare. In 2001, the New York State Department of Health began publishing hospital mortality rates for cardiac surgery. Dranove et al. (2003) documented the consequences: hospitals began avoiding severely ill patients rather than improving care. By publishing average treated outcome (ATO) as the quality metric, the system rewarded patient selection rather than treatment effectiveness.

The AI evaluation analogy is direct. A leaderboard that rewards average benchmark score incentivizes developers to specialize on easy benchmarks or to select favorable evaluation conditions, rather than improving broadly. The metric itself creates the misalignment.

Formalizing this as a principal-agent problem (Laffont and Tirole 1986): the principal (evaluator) chooses a reward function $w$, and the agent (developer) best-responds with policy $\pi^w$. The principal’s regret is $R(\pi^w) = \max_{\tilde{\pi} \in \Pi} V(\tilde{\pi}) - V(\pi^w)$, where $V(\pi) = \mathbb{E}[Y(\pi) - Y(0)]$ is the total treatment effect.

Proposition: ATO Has Unbounded Regret

The average treated outcome $w_{\text{ATO}}(x, t, y) = y \cdot \mathbf{1}[t = 1]$ can have unbounded regret. An agent maximizing $\mathbb{E}[w_{\text{ATO}}]$ may achieve $R(\pi^{w_{\text{ATO}}}) = \max_\pi V(\pi)$ — the worst possible policy.

The intuition is stark: a hospital maximizing average treated outcome selects only the healthiest patients for treatment, achieving excellent ATO but zero (or negative) total treatment effect. In the AI setting, a developer maximizing average benchmark accuracy across their selected tasks may simply drop the hardest benchmarks from their evaluation suite.

Proposition: Total Treatment Effect Achieves Zero Regret

The total treatment effect metric $w_{\text{TT}}(x, t, y) = y - \hat{\mu}_0(x)$ achieves zero regret when the principal has unbiased counterfactual estimates $\hat{\mu}_0(x) = \mathbb{E}[Y(0) \mid X = x]$.

The key insight: by subtracting the counterfactual baseline (what would have happened without treatment), the metric removes the incentive for patient selection. In AI evaluation, this translates to scoring the “capability uplift” — performance relative to a baseline — rather than raw performance. This connects to the doubly robust estimation from Chapter 7: counterfactual metrics require causal reasoning about what performance would have been absent the developer’s effort.

9.5.2 Information Asymmetry and Metric Elicitation

Beyond metric choice, there is a prior question: how much should the developer reveal about their model’s capabilities? Consider a developer (agent) with a private cost-correlated variable $X$ — for instance, knowledge about which capability dimensions their model excels at. The evaluator (principal) can design contracts conditioned on $X$ if it is revealed.

Drawing on the information elicitation framework of Wang et al. (2024), the key question is: when does the developer prefer to reveal, conceal, or garble information about their capabilities?

Reveal: The developer discloses $X$ fully. The principal designs a targeted contract, which benefits both parties when the information enables efficient allocation.
Conceal: The developer hides $X$. The principal must use a one-size-fits-all contract, which is robust but potentially inefficient.
Garble: The developer discloses a noisy version of $X$. This intermediate option can be Pareto-improving.

Proposition: Principal Always Benefits from Revelation

The principal’s expected utility is weakly higher under revelation than concealment: $V_P^{\text{reveal}} \geq V_P^{\text{conceal}}$.

Proof sketch. Under revelation, the principal’s optimization problem has a strictly larger feasible set (contracts conditioned on $X$), so the optimum can only improve. $\square$

But the developer’s incentives are more nuanced. Revelation helps when conditioning on $X$ sufficiently differentiates high-cost and low-cost types, allowing the principal to offer more efficient contracts. Concealment is preferred when revelation would lead the principal to extract all of the developer’s surplus through perfectly targeted contracts.

9.5.3 Garbling as Differential Privacy

The most interesting case is garbling: the developer reveals $Y = X$ with probability $\varepsilon$ and $Y = \xi$ (noise) with probability $1 - \varepsilon$. This is precisely the randomized response mechanism from differential privacy.

Proposition: Garbling Can Pareto-Dominate Both Extremes

Under fairly wide conditions (e.g., two exponential cost types with different means), the agent may prefer garbled disclosure to both full concealment and full revelation. Moreover, garbling can increase total welfare compared to concealment.

The intuition is that garbling provides the principal with just enough information to design better contracts without enabling full surplus extraction. In the AI evaluation context, this maps to:

Model cards with calibrated noise: Developers disclose approximate capability profiles rather than exact performance vectors.
Differential privacy in metric reporting: Adding Laplace noise to reported metrics creates a garbling mechanism.
Benchmark transparency tiers: Publish aggregate scores publicly, but release fine-grained breakdowns only through a privacy-preserving mechanism.

This connects directly to the holdout mechanism of Section 9.4.3: both use calibrated noise to create better incentive equilibria. The parallel is deep — the evaluator’s decision about how much to reveal about the evaluation distribution and the developer’s decision about how much to reveal about model capabilities are dual problems in information design.

Figure. Left: heatmap of the agent’s net benefit from revelation vs. concealment across pairs of exponential cost types $(\lambda_0, \lambda_1)$. Red regions indicate the agent prefers revelation; blue regions favor concealment. The black contour marks indifference. Right: agent utility under garbling (noisy disclosure with probability $\varepsilon$) for fixed cost types. An intermediate garbling level can outperform both full concealment and full revelation.

9.6 Representative Benchmark Selection

A distinct but related design question: given a large suite of evaluation metrics, how should one select a representative subset? This is the “lite benchmark” problem faced by BIG-bench (200+ metrics $\to$ BIG-bench Lite, 24), HELM ($\to$ HELM Lite), and Cal Hospital Compare (hundreds of quality measures $\to$ 12).

9.6.1 The Subset Selection Problem

Let there be $n$ metrics and $m$ alternatives (models). Each metric $i$ produces a ranking $\sigma_i$ over the alternatives. We seek a subset $K \subseteq N = [n]$ of metrics that is “representative” of the full set $N$. But what does “representative” mean formally? The work of Procaccia et al. (2025) provides two precise definitions grounded in social choice theory.

9.6.2 Positional Representation

The first notion prevents under-representation at every rank cutoff.

Definition: Positional Representation

A subset $K$ satisfies positional representation for group size $g$ if for every rank cutoff $r \in [m]$ and every alternative $a$: \[ C(K, r, a) \geq \left\lfloor \frac{C(N, r, a)}{g} \right\rfloor \] where $C(S, r, a) = |\{i \in S : \sigma_i(a) \leq r\}|$ counts how many metrics in $S$ rank alternative $a$ in the top $r$.

In words: if alternative $a$ is ranked in the top $r$ by at least $\ell \cdot g$ metrics in $N$, then $a$ must be ranked in the top $r$ by at least $\ell$ metrics in $K$. The parameter $g$ controls the granularity of representation — smaller $g$ requires finer representation but demands larger $|K|$.

Theorem: Positional Representation Bounds

The minimum subset size needed to guarantee positional representation satisfies: \[ \Omega\!\left(\frac{n}{g} \cdot \frac{\log m}{\log(n \log m / g)}\right) \leq |K| \leq O\!\left(\frac{n}{g} \log m\right) \] The upper bound is achieved by a polynomial-time greedy algorithm (Algorithm 1 in Procaccia et al. (2025)) based on set cover.

The greedy algorithm works as follows: iterate through the preference profile row by row, coloring entries when an alternative accumulates $g$ appearances. Then greedily select metrics that cover the most remaining colors. The connection to set cover yields the logarithmic factor.

9.6.3 Positional Proportionality

Positional representation prevents under-representation but not over-representation. A stronger notion prevents both:

Definition: Positional Proportionality

A subset $K$ satisfies $\epsilon$-positional proportionality if for every alternative $a$ and every rank cutoff $r$: \[ \left| \frac{C(N, r, a)}{|N|} - \frac{C(K, r, a)}{|K|} \right| \leq \epsilon \]

This preserves the fraction of metrics ranking each alternative at each position, up to additive error $\epsilon$.

Theorem: Positional Proportionality Bounds

The minimum subset size for $\epsilon$-positional proportionality satisfies: \[ \Omega\!\left(\frac{1}{\epsilon^2} \log m\right) \leq |K| \leq O\!\left(\frac{1}{\epsilon^2} \log m\right) \] These bounds are tight up to constant factors.

The tight bounds follow from a connection to uniform convergence: a random subset of size $O(\epsilon^{-2} \log m)$ satisfies positional proportionality with high probability, by a Chernoff-type argument applied simultaneously to all $O(m)$ alternatives and rank cutoffs.

A particularly useful consequence connects proportionality to scoring rules:

Theorem: Scoring Rule Approximation

If $K$ satisfies $\epsilon$-positional proportionality, then for any scoring rule with score vector $s$ and every alternative $a$: \[ |f_s(a, \sigma_N) - f_s(a, \sigma_K)| \leq \epsilon \] where $f_s(a, \sigma_S) = \frac{1}{|S|} \sum_{i \in S} s_{\sigma_i(a)}$ is the average score of alternative $a$ under scoring rule $s$ in metric set $S$.

This is the key practical guarantee: a subset satisfying positional proportionality approximates any scoring rule on the original set of metrics. The connection to Section 5.11.3 is direct: positional proportionality provides a formal criterion for when a “lite” benchmark preserves the same information as the full suite, complementing the dimensionality analysis of factor models.

Figure. Left: subset size $|K|$ required by the greedy algorithm for positional representation at various group sizes $g$, compared to random selection and the theoretical $O(n/g \cdot \log m)$ bound. The greedy algorithm consistently outperforms random selection. Right: for $g=5$, comparison of the fraction of metrics ranking each model in the top 5, between the full set ($n=50$) and the greedy-selected subset. The subset closely preserves the ranking structure.

9.7 Synthesis: Design Principles for Strategic Benchmarks

The three threads of this chapter — information design, metric design, and representative selection — converge on six actionable design principles for AI evaluation.

Table 9.1: Six design principles for strategic AI evaluation

Principle	Formal Basis	Recommendation
1. Randomize and Refresh	Theorem 1 (one-shot alignment), Prop. (deterministic failure)	Use randomized evaluation with task renewal. Static benchmarks are Goodhart-vulnerable by construction.
2. Correct and Grow	Prop. (alignment recovery), Prop. (optimal $k^*$)	Invest in distribution correction ($\rho$) as the primary lever. The optimal $k^*$ scales with $\sqrt{\rho}$.
3. Gate Information Release	Holdout mechanism, Prop. (learning slowdown)	Use threshold tests to condition information flow on builder behavior. Combine with distribution correction.
4. Align Metrics with Welfare	Prop. (ATO regret), Prop. (TT zero regret)	Score total treatment effect, not averages. Account for counterfactual baselines.
5. Allow Partial Transparency	Prop. (garbling dominance)	Calibrated noise in metric reporting (garbling) can create Pareto improvements over both full transparency and full opacity.
6. Ensure Representative Subsets	Thms. (positional representation/proportionality bounds)	Use formal representation criteria rather than ad hoc selection when creating lite benchmarks.

These principles interact in important ways. Principles 1–3 address the information channel (how much the evaluator reveals about the evaluation mechanism). Principle 4 addresses the metric channel (what the evaluator measures). Principle 5 addresses the developer channel (what the developer reveals about their model). Principle 6 addresses the scope channel (which metrics to include).

A well-designed evaluation system should operate on all four channels simultaneously: randomize task selection (Principle 1), invest in distribution correction (Principle 2), gate information via holdout mechanisms (Principle 3), use welfare-aligned metrics (Principle 4), allow developers to disclose through privacy-preserving mechanisms (Principle 5), and ensure the metric suite is formally representative (Principle 6).

9.8 Discussion Questions

The evaluation game framework assumes a benevolent evaluator (social planner). What changes if the evaluator also has strategic incentives — for example, a company running its own benchmark to favor its own models?
How does the information-variance tradeoff relate to the reliability-validity tradeoff from Chapter 5 and Chapter 1? Is there a formal connection between $\sigma^2/k$ (evaluation variance) and Cronbach’s $\alpha$?
Design a mechanism where the builder’s incentive is to improve on the hardest tasks rather than the average. How should $u_E(\theta)$ and $r(\theta, S)$ be modified?
In what sense is the Chatbot Arena a randomized evaluation mechanism? Does it satisfy the conditions of Theorem 1? What information leaks through the adaptive matching process?
How should a government regulator set $\gamma$ (the gaming penalty) for a safety benchmark? What factors should influence this choice?
If two “lite” benchmarks both satisfy $\epsilon$-positional proportionality with the same $\epsilon$ but select different subsets, which should be preferred? What additional criteria might break ties?
The holdout mechanism draws on differential privacy. What is the relationship between the privacy budget $\varepsilon$ in the holdout mechanism and the notion of $\varepsilon$-differential privacy from the data privacy literature?
Distribution correction requires the evaluator to “know their bias.” How can an evaluator identify which regions of the task universe they under-represent? What role do meta-evaluations and coverage audits play?

9.9 Bibliographic Notes

The Stackelberg evaluation game and the information-variance tradeoff are developed in Truong et al. (2025). The distribution correction mechanism and optimal evaluation size results are from the same work. The analysis of metric design as a principal-agent problem draws on Wang et al. (2024), which develops counterfactual quality metrics (Chapter 5) and information elicitation in agency games (Chapter 6). The positional representation and proportionality framework is from Procaccia et al. (2025).

Goodhart’s Law was articulated by Goodhart (1984) in the context of monetary policy. Manheim and Garrabrant (2018) formalize four variants. The connection to reward overoptimization in RLHF is explored by Gao et al. (2023).

Strategic classification — the study of agents who manipulate their features to achieve favorable classification outcomes — was formalized by Hardt et al. (2016). Perdomo et al. (2020) introduce performative prediction, where the model itself changes the data distribution. Braverman and Garg (2020) show that randomness is necessary for efficient classification under strategic behavior.

Bayesian persuasion (Kamenica and Gentzkow 2011) provides the information design framework connecting evaluator disclosure to builder behavior. Bergemann and Morris (2019) survey the broader information design literature. Contract theory (Laffont and Tirole 1986; Holmstrom and Milgrom 1991) provides the principal-agent foundations for metric design.

Differential privacy and adaptive data analysis (Dwork et al. 2015; Dwork and Roth 2014) underpin the holdout mechanism. Blum and Hardt (2015) apply reusable holdout techniques to machine learning competitions. The connection between DP and strategic robustness is developed in the holdout mechanism analysis.

Social choice and benchmarking connect to Zhang and Hardt (2024) (Arrow’s impossibility for benchmarks), Colombo et al. (2022) (Borda count for benchmark aggregation), and Rofin and Mikhailov (2023) (scoring rules for benchmark ranking). The committee selection literature, particularly justified representation (Aziz et al. 2017), provides the social choice foundations for positional representation.

9.10 Exercises

(Easy) Show that if $\pi_E = \text{Uniform}(F)$, the one-shot alignment theorem holds for any sample size $k$, not just $k = 1$. Hint: Show that for any $k$-subset mechanism, the expected reward is proportional to $u_E(\theta)$.
(Easy) Verify that $k^* = \sigma\sqrt{\rho/\gamma}$ minimizes $\mathcal{L}(k) = \sigma^2/k + \gamma k/\rho$. What is $\mathcal{L}(k^*)$? What happens to $k^*$ when $\gamma$ doubles?
(Medium) Derive the estimation-limited term in the misalignment bound using a Dirichlet$(1, \ldots, 1)$ prior over distributions on $F$ with $m$ effective observations. Show that the posterior predictive is the shrinkage estimator $\hat{\pi}_E = \frac{m}{m+N}\hat{p} + \frac{N}{m+N}\text{Uniform}(F)$ and that $\text{TV}(\hat{\pi}_E, \text{Uniform}(F)) \leq m/(m+N)$.
(Medium) In the agency game with binary $X$, suppose $C \mid X=0 \sim \text{Exp}(\lambda_0)$ and $C \mid X=1$ is zero-cost. Derive the condition on $\lambda_0$ and the principal’s value $b$ under which the agent prefers concealment to revelation.
(Medium) Show that any subset $K$ satisfying $\epsilon$-positional proportionality approximates any scoring rule within $\epsilon$. Hint: Write $f_s(a, \sigma_S)$ as a weighted sum of cumulative counts $C(S, r, a)/|S|$ and apply Abel summation.
(Hard) Extend the one-shot alignment theorem to the case where the builder has a non-uniform prior $p_M$ over $F$. Show that the builder’s best response maximizes $\mathbb{E}_{f \sim p_M}[f(\theta)]$. Under what conditions does this coincide with maximizing $u_E(\theta)$?
(Hard) Prove that the holdout mechanism slows the builder’s posterior concentration by a factor of $e^{\varepsilon\eta}$. Hint: Compute the effective sample size $m_t^{\text{holdout}} = n_0 + p_{\text{alarm}} \cdot k \cdot \min(t, \rho^{-1})$ and compare to $m_t^{\text{baseline}}$.
(Hard, computational) Implement the greedy algorithm for positional representation on the HELM Lite data (or a synthetic dataset with $n = 100$ metrics, $m = 50$ models). Compare $|K|$ to the existing lite benchmark subset for different values of the group size $g$. At what $g$ does the greedy algorithm match the size of the existing subset?

References

Aziz, Haris, Markus Brill, Vincent Conitzer, Edith Elkind, Rupert Freeman, and Toby Walsh. 2017. “Justified Representation in Approval-Based Committee Voting.” Social Choice and Welfare 48: 461–85.

Bergemann, Dirk, and Stephen Morris. 2019. “Information Design: A Unified Perspective.” Journal of Economic Literature 57 (1): 44–95.

Blum, Avrim, and Moritz Hardt. 2015. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” Proceedings of the 32nd International Conference on Machine Learning, 1006–14.

Braverman, Mark, and Sumegha Garg. 2020. “The Role of Randomness and Noise in Strategic Classification.” Foundations of Responsible Computing (FORC).

Colombo, Pierre, Benjamin Clavié, Nathan Nogué, and Pablo Piantanida. 2022. “What Are the Best Systems? New Perspectives on NLP Benchmarking.” arXiv Preprint arXiv:2202.03799.

Dranove, David, Daniel Kessler, Mark McClellan, and Mark Satterthwaite. 2003. “Is More Information Better? The Effects of ‘Report Cards’ on Health Care Providers.” Journal of Political Economy 111 (3): 555–88.

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. 2015. “Generalization in Adaptive Data Analysis and Holdout Reuse.” Advances in Neural Information Processing Systems 28.

Dwork, Cynthia, and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. In Foundations and Trends in Theoretical Computer Science, vol. 9. Now Publishers.

Gao, Leo, John Schulman, and Jacob Hilton. 2023. “Scaling Laws for Reward Model Overoptimization.” Proceedings of the 40th International Conference on Machine Learning, 10835–66.

Goodhart, Charles A. E. 1984. “Problems of Monetary Management: The U.K. Experience.” Monetary Theory and Practice, 91–121.

Hardt, Moritz, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. 2016. “Strategic Classification.” Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 111–22.

Holmstrom, Bengt, and Paul Milgrom. 1991. “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design.” Journal of Law, Economics, and Organization 7: 24–52.

Kamenica, Emir, and Matthew Gentzkow. 2011. “Bayesian Persuasion.” American Economic Review 101 (6): 2590–615.

Laffont, Jean-Jacques, and Jean Tirole. 1986. “Using Cost Observation to Regulate Firms.” Journal of Political Economy 94 (3): 614–41.

Manheim, David, and Scott Garrabrant. 2018. “Categorizing Variants of Goodhart’s Law.” arXiv Preprint arXiv:1803.04585.

Perdomo, Juan, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. 2020. “Performative Prediction.” Proceedings of the 37th International Conference on Machine Learning, 7599–609.

Procaccia, Ariel D., Benjamin Schiffer, Serena Wang, and Shirley Zhang. 2025. “Metritocracy: Representative Metrics for Lite Benchmarks.” arXiv Preprint arXiv:2506.09813.

Rofin, Mark, and Danila Mikhailov. 2023. “VOTE’N’RANK: Revision of Benchmarking with Social Choice Theory.” Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 670–86.

Truong, Son, Serena Wang, Hoda Heidari, and Rishi Bommasani. 2025. “Incentive-Aligned Evaluation via Private Benchmark.” arXiv Preprint arXiv:2506.00000.

Vapnik, Vladimir N. 1998. Statistical Learning Theory. Wiley.

Wang, Serena, Michael I. Jordan, Katrina Ligett, and R. Preston McAfee. 2024. “Relying on the Metrics of Evaluated Agents.” arXiv Preprint arXiv:2402.14005.

Zhang, Shirley, and Moritz Hardt. 2024. “The Inherent Tradeoffs in LLM Benchmarks.” arXiv Preprint.

--- format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); }); </script> filters: - pyodide pyodide: packages: - numpy - matplotlib - scipy --- # Design {#sec-design} ::: {.callout-note title="Intended Learning Outcomes"} By the end of this chapter, the reader will be able to: 1. **Explain** why Goodhart's Law is not merely a cautionary aphorism but a formal game-theoretic phenomenon, and classify its four variants in AI evaluation contexts. 2. **Formalize** AI evaluation as a Stackelberg game between an evaluation designer and a model builder, and prove that deterministic benchmarks fail while randomized mechanisms achieve one-shot incentive alignment. 3. **Analyze** the information-variance tradeoff in repeated evaluation: why reducing noise requires revealing information, and how this degrades alignment over time. 4. **Derive** distribution correction as the primary mechanism for restoring alignment, and compute the optimal evaluation size $k^*$ as a function of correction rate $\rho$. 5. **Model** metric design as a principal-agent problem and characterize when developers prefer to reveal, conceal, or garble evaluation information. 6. **Apply** positional representation and positional proportionality criteria to select representative benchmark subsets using social choice theory. 7. **Synthesize** design principles spanning strategic robustness, information control, and representativeness for AI benchmarks. ::: ::: {.callout-tip title="Suggested Lecture Plan" collapse="true"} This chapter can be covered in **3 lectures** (75-90 minutes each): **Lecture 1: The Evaluation Game** - Goodhart's Law and strategic manipulation (15 min) - Stackelberg benchmark game: setup and failure of deterministic mechanisms (20 min) - One-shot incentive alignment via randomization (25 min) - Hands-on: deterministic vs. randomized evaluation simulation (15 min) **Lecture 2: Information Leakage and Restoring Alignment** - Repeated evaluation and the information-variance tradeoff (20 min) - Distribution correction and alignment recovery (25 min) - Optimal evaluation size and the holdout mechanism (20 min) - Hands-on: Pareto frontier and correction simulations (10 min) **Lecture 3: Metric Design and Representative Selection** - Metric design as a principal-agent problem (25 min) - Information elicitation: reveal, conceal, garble (20 min) - Representative benchmark selection via social choice theory (20 min) - Design principles synthesis (10 min) ::: ::: {.callout-note title="Notation"} This chapter introduces game-theoretic notation: $F$ (task universe), $F_E/F_M$ (evaluator/builder task sets), $\pi_E/\pi_M$ (sampling distributions), $f(\theta)$ (task performance), $u_E$ (evaluator utility), $\Delta_t$ (misalignment), and $\gamma$ (gaming penalty). See the [notation appendix](notation.qmd) for the complete notation reference. ::: ## When Measurement Becomes a Target {#sec-goodhart} > "When a measure becomes a target, it ceases to be a good measure." --- Charles Goodhart (1975) In previous chapters, we treated evaluation as a *statistical* problem: how to estimate latent abilities from noisy observations (@sec-foundations), how to quantify measurement precision (@sec-reliability), and how to assess whether a benchmark measures what it claims to measure (@sec-validity, @sec-causality). Throughout, we implicitly assumed that the data-generating process is *fixed* --- that the act of measurement does not change the phenomenon being measured. This assumption breaks down once benchmarks become *influential*. When a developer's reputation, funding, or regulatory standing depends on benchmark scores, the developer has every incentive to optimize specifically for the benchmark --- not for the broader capability it claims to measure. The data-generating process is no longer fixed: it shifts in response to the measurement itself. This is Goodhart's Law, and in AI evaluation it is not a metaphor but a demonstrable, recurring phenomenon. **Concrete examples.** The Foundation Model Transparency Index (FMTI) saw scores jump dramatically after its first release, not because companies became more transparent, but because they learned *which specific indicators* were measured and optimized their disclosures accordingly. Chatbot Arena has faced concerns about strategic submission: developers can selectively deploy models optimized for the types of queries that appear on the platform. And benchmark contamination --- where training data includes evaluation items --- is rational behavior for a developer whose market value depends on leaderboard position (@sec-contamination). @manheim2018categorizing identify four variants of the Goodhart effect, each with a distinct causal mechanism: 1. **Regressional Goodhart**: The proxy $\hat{u}$ and the true objective $u$ are correlated but not identical. Optimizing $\hat{u}$ overshoots because extreme values of $\hat{u}$ tend to arise from noise, not from genuinely extreme $u$. This is the measurement error story from @sec-reliability: a benchmark with $\alpha = 0.7$ captures 70% true variance and 30% noise, so selecting the top-scoring model partially selects for lucky noise. 2. **Extremal Goodhart**: The relationship between $\hat{u}$ and $u$ that holds in the bulk of the distribution breaks down in the tails. A benchmark that reliably ranks typical models may fail catastrophically when applied to a model specifically engineered to maximize the benchmark score. 3. **Causal Goodhart**: The proxy and the objective share a common cause. Optimizing the proxy can break this causal link. If "passes safety benchmark" and "is actually safe" are both caused by "was trained carefully," a developer can find shortcuts to the benchmark that bypass the common cause. 4. **Adversarial Goodhart**: An agent actively exploits the gap between proxy and objective. This is the regime that dominates in competitive AI evaluation, and the focus of this chapter. The shift from the earlier measurement chapters to this one is a shift in the source of the threat. Measurement error (@sec-reliability) is stochastic. Validity threats (@sec-validity-threats) are systematic but unintentional. Strategic manipulation is *deliberate and adaptive*: the DGP itself changes in response to the measurement. This connects to the literature on performative prediction [@perdomo2020performative], where the act of deploying a model changes the distribution it operates on, and to strategic classification [@hardt2016strategic], where agents manipulate their features to achieve favorable outcomes. The remainder of this chapter builds three layers of strategic analysis: (i) benchmark disclosure and information design --- how much should evaluators reveal about the evaluation mechanism? (ii) metric selection and reporting granularity --- which metrics should evaluators report, and at what level of detail? (iii) mechanism design for repeated evaluation --- how can evaluators maintain alignment when the game repeats? ## The Evaluation Game {#sec-evaluation-game} We now formalize the strategic interaction between an evaluation designer and a model builder as a Stackelberg game. The designer moves first by committing to an evaluation mechanism; the builder responds by training a model. The key question: under what conditions can a benchmark mechanism incentivize builders to improve performance on the *full* task distribution, rather than merely optimizing for the specific tasks being measured? ### Setup: Evaluator and Builder {#sec-game-setup} Let $\Theta$ denote the space of all possible models. For a given model $\theta \in \Theta$, let $f(\theta) \in [0, 1]$ denote the model's performance on task $f$. The universe of all possible tasks is the finite set $F$ with $|F| = N$. ::: {.callout-note title="Definition: Evaluation Designer's Utility"} The evaluation designer's utility for a model $\theta$ is the aggregate performance across all tasks: $$ u_E(\theta) = \sum_{f \in F} f(\theta) $$ The designer acts as a social planner seeking models that perform broadly rather than narrowly. ::: Each party acquires tasks by sampling from $F$: the designer draws tasks according to $\pi_E$ over $F$, forming the evaluation set $F_E$, while the builder draws according to $\pi_M$, forming the training set $F_M$. In practice, $\pi_M$ is approximately uniform (builders have broad access to training data), while $\pi_E$ is typically biased toward tasks that are easy to construct, grade, or that reflect particular evaluation priorities. ::: {.callout-note title="Definition: Stackelberg Benchmark Game"} The game proceeds in three stages, parameterized by a mechanism $(M, r)$ where $M$ is a sampling function and $r: \Theta \times \mathcal{P}(F) \to \mathbb{R}$ is a reward function: 1. **Designer's move (ex-ante):** The evaluation designer publishes the mechanism $(M, r)$. If $M$ is randomized, the designer privately draws randomness $\omega$ and computes $S^* = M(F_E, \omega)$, withholding the realization. If $M$ is deterministic, $S^* \subseteq F_E$ is fixed. 2. **Builder's move:** The builder observes the mechanism $(M, r)$ but not the realized set $S^*$, and selects a model $\theta^* \in \Theta$. 3. **Evaluation stage (ex-post):** The designer publishes the score $r(\theta^*, S^*)$. The builder's payoff is $r(\theta^*, S^*)$; the designer's payoff is $u_E(\theta^*)$. ::: The misalignment between the builder's incentive (performance on $S^*$) and the designer's objective (performance on all of $F$) creates the Goodhart problem. ### Failure of Deterministic Mechanisms {#sec-deterministic-failure} Consider first a deterministic mechanism where the designer publishes a fixed subset $S^* \subseteq F_E$. ::: {.callout-warning title="Proposition: Failure of Deterministic Mechanisms"} If the designer publishes a deterministic set $S^*$ with additive reward $r(\theta, S) = \sum_{f \in S} f(\theta)$, the builder's best response is: $$ \theta^* \in \arg\max_{\theta \in \Theta} \sum_{f \in S^*} f(\theta) $$ This provides *no incentive* for performance on tasks outside $S^*$. ::: *Proof.* The builder observes $S^*$ directly and maximizes the known objective. Tasks in $F \setminus S^*$ do not affect the score. $\square$ This is exactly Goodhart's adversarial variant: the builder exploits the known structure of the evaluation to specialize. It is also the construct underrepresentation problem from @sec-construct-underrepresentation viewed through a strategic lens: a fixed benchmark *systematically ignores* capabilities outside $S^*$, and a rational builder responds accordingly. ### One-Shot Alignment via Randomization {#sec-one-shot-alignment} While deterministic mechanisms fail, randomization offers a path forward. When the builder faces genuine uncertainty about which tasks will be evaluated, the strategic landscape changes fundamentally. ::: {.callout-warning title="Theorem 1: One-Shot Incentive Alignment (Omniscient Builder)"} Consider a randomized mechanism where (i) $M$ draws a single task uniformly: $s \sim \text{Unif}(F_E)$, (ii) the reward is $r(\theta, \{s\}) = s(\theta)$, and (iii) the builder has a uniform prior $\pi$ over which tasks comprise $F_E$. Then the builder's expected reward is proportional to the designer's utility: $$ \mathbb{E}_{F_E \sim \pi}\!\left[\mathbb{E}_{s \sim \text{Unif}(F_E)}[s(\theta)]\right] = \frac{1}{|F|}\, u_E(\theta) $$ and the builder's best response maximizes $u_E(\theta)$. ::: *Proof sketch.* Under a symmetric (uniform) prior over which tasks compose $F_E$, the builder has no information distinguishing any task $f_i$ from $f_j$. By symmetry, the marginal probability that any specific task $f$ is the sampled task is constant: $P(s = f) = 1/|F|$. Therefore the builder's expected reward is $(1/|F|) \sum_{f \in F} f(\theta) = (1/|F|) \, u_E(\theta)$, and maximizing expected reward is equivalent to maximizing $u_E(\theta)$. $\square$ The omniscient version assumes the builder knows all of $F$. A more realistic version considers a builder with limited information: ::: {.callout-warning title="Theorem 2: One-Shot Alignment (Limited Information)"} Suppose the builder draws tasks i.i.d. from distribution $p_M$ and believes the evaluator also draws from $p_M$. Under a single-sample mechanism, the builder's best response is: $$ \theta^* \in \arg\max_{\theta \in \Theta} \mathbb{E}_{f \sim p_M}[f(\theta)] $$ That is, the builder optimizes over their own best approximation of the task universe. ::: *Proof sketch.* The builder's expected reward, taken over their prior on $F_E$ and the randomness in $M$, equals $\mathbb{E}_{f \sim p_M}[f(\theta)]$. Since $F_M$ is drawn from $p_M$, the builder's best response is empirical risk minimization over $F_M$, which is the best they can do given their information. $\square$ This result is powerful: the single-sample mechanism is the best an evaluation designer can do. The builder's best response is to approximate their own approximation of $F$ to the best of their ability --- essentially training broadly over all available data. Under incentive alignment, the builder performs **empirical risk minimization** (ERM) over their task set $F_M$. By classical uniform convergence [@vapnik1998statistical], restricting to any subset $S \subset F_M$ can only increase regret; the builder has no incentive to "game" by focusing on a narrow task set. ### Discussion: What Randomization Buys {#sec-randomization-discussion} The one-shot alignment result reveals a fundamental tension with the statistical efficiency goals of earlier chapters. In @sec-fisher-design and @sec-cat, we showed that *targeted* evaluation --- choosing items to maximize Fisher information about $\theta$ --- is statistically optimal. But targeted evaluation requires revealing information about which tasks the evaluator considers informative, which creates exploitable structure. Randomized evaluation is not optimal for *estimation precision* but is optimal for *incentive alignment*. This tension between statistical efficiency and strategic robustness is a recurring theme: - Computerized adaptive testing (@sec-cat) reveals the evaluator's information about ability through the item selection strategy, making the evaluator's priorities transparent. - Bayesian persuasion [@kamenica2011bayesian] provides the information design framework: the evaluator is a sender choosing an information structure, and the builder is a receiver who updates and best-responds. The evaluator's problem is to design a signal that induces the builder to take the action the evaluator prefers. The resolution, as we develop in the next two sections, is that the evaluator must *invest in correction* to make the inevitable information leakage harmless. {{< include _plt_setup.qmd >}} ```{pyodide-python} #| label: deterministic-vs-randomized #| autorun: true import numpy as np import matplotlib.pyplot as plt np.random.seed(42) # --- Setup --- N = 50 # total tasks in universe F n_eval = 10 # tasks in evaluation set n_rounds = 200 # simulation rounds # Task values: how much each task contributes to "true capability" task_values = np.random.uniform(0.3, 1.0, N) # --- Deterministic mechanism: builder knows S* --- # Builder puts all effort into the known evaluation tasks det_eval_set = np.random.choice(N, n_eval, replace=False) det_perf = np.zeros(N) det_perf[det_eval_set] = 0.95 # high on evaluated tasks det_perf[~np.isin(np.arange(N), det_eval_set)] = 0.2 # low elsewhere # --- Randomized mechanism: builder faces uncertainty --- # Under randomized eval, builder's best response is to train broadly rand_perf = np.full(N, 0.65) # uniform effort across all tasks # --- Evaluate --- # True utility = sum of performance across ALL tasks det_utility = det_perf.sum() rand_utility = rand_perf.sum() # Eval score = average on sampled tasks n_trials = 500 det_scores = np.array([det_perf[np.random.choice(N, n_eval, replace=False)].mean() for _ in range(n_trials)]) rand_scores = np.array([rand_perf[np.random.choice(N, n_eval, replace=False)].mean() for _ in range(n_trials)]) # --- Plot --- fig, axes = plt.subplots(1, 3, figsize=(6, 2)) colors = ['#5B8DEE', '#E8637A', '#45BF7C', '#F0A35C', '#B07CD8'] # Panel 1: Task-level performance ax = axes[0] sorted_idx_det = np.argsort(-det_perf) ax.bar(range(N), det_perf[sorted_idx_det], color=colors[1], alpha=0.7, width=1.0) ax.set_xlabel('Task (sorted)') ax.set_ylabel('Performance') ax.set_title('Deterministic') ax.set_ylim(0, 1.05) ax.axhline(y=det_perf.mean(), color='k', ls='--', lw=0.8) ax = axes[1] ax.bar(range(N), rand_perf[np.argsort(-rand_perf)], color=colors[0], alpha=0.7, width=1.0) ax.set_xlabel('Task (sorted)') ax.set_title('Randomized') ax.set_ylim(0, 1.05) ax.axhline(y=rand_perf.mean(), color='k', ls='--', lw=0.8) # Panel 3: Utility comparison ax = axes[2] bar_x = [0, 1] bar_vals = [det_utility / N, rand_utility / N] bar_colors = [colors[1], colors[0]] bars = ax.bar(bar_x, bar_vals, color=bar_colors, width=0.6, alpha=0.8) ax.set_xticks(bar_x) ax.set_xticklabels(['Det.', 'Rand.']) ax.set_ylabel('Avg. true utility') ax.set_title('Alignment') ax.set_ylim(0, 0.85) for b, v in zip(bars, bar_vals): ax.text(b.get_x() + b.get_width()/2, v + 0.02, f'{v:.2f}', ha='center', va='bottom') plt.tight_layout() plt.show() ``` **Figure.** Under a deterministic mechanism (left), the builder concentrates effort on the $k=10$ known evaluation tasks and neglects the rest, achieving high benchmark scores but low true utility. Under a randomized mechanism (center), the builder spreads effort broadly. The right panel compares average true utility across *all* tasks: the randomized mechanism produces a substantially higher utility despite lower peak performance on any single task. ## The Information-Variance Tradeoff {#sec-info-variance-tradeoff} Theorem 1 established that randomized evaluation achieves incentive alignment in one-shot settings. But benchmarks operate as *repeated games*: weekly leaderboards, monthly submissions, quarterly benchmark releases. In repeated evaluation, randomization alone fails because information *leaks*. ### Repeated Evaluation and Information Leakage {#sec-repeated-game} ::: {.callout-note title="Definition: Repeated Evaluation Game"} At each round $t = 1, 2, \ldots$: 1. The evaluator draws a fresh evaluation set $F_E^{(t)}$ by sampling from $F$ according to $\pi_E$. 2. The evaluator samples $k$ tasks from $F_E^{(t)}$ via mechanism $M$ and publishes the score. 3. The builder observes the $k$ evaluated tasks, updates their estimate $\hat{\pi}_{E,t}$ of the evaluation distribution, and selects model $\theta_t$. ::: The variance of a $k$-task average reward scales as $\text{Var}(\hat{r}_k(\theta)) \approx \sigma^2/k$. To reliably distinguish two models with performance gap $\Delta$ at 95% confidence requires $k \geq 4\sigma^2/\Delta^2$. As models improve and converge ($\Delta \to 0$), the required sample size grows quadratically. This creates inexorable market pressure to increase $k$. But increasing $k$ to reduce variance simultaneously increases the rate at which the builder accumulates information about $\pi_E$. The builder's *information set* at time $t$ is $\mathcal{I}_t = \bigcup_{i=1}^{t-1} S_i$, where $S_i$ is the set of tasks sampled in round $i$. The *leakage* is $L_t = |\mathcal{I}_t|/|F|$. ### Posterior Concentration and Incentive Misalignment {#sec-posterior-concentration} As the builder observes sampled tasks across rounds, their estimate $\hat{\pi}_{E,t}$ of the evaluation distribution concentrates around the true $\pi_E$. By standard results in Bayesian nonparametrics, the posterior concentration rate is $O(d_{\text{eff}} \log m / m)$ where $m$ is the number of observations and $d_{\text{eff}}$ is the effective dimension of the distribution class. ::: {.callout-warning title="Proposition: Incentive Misalignment Under Distribution Learning"} Let $\hat{\pi}_{E,t}$ denote the builder's posterior mean estimate of $\pi_E$ at time $t$. The builder's optimal strategy is $\theta^*_t = \arg\max_\theta \mathbb{E}_{f \sim \hat{\pi}_{E,t}}[f(\theta)]$. Then: 1. **Initial alignment:** When the prior over $\pi_E$ is diffuse, the builder optimizes broadly, approximating performance on the full universe $F$. 2. **Posterior concentration leads to specialization:** As $\hat{\pi}_{E,t}$ concentrates around $\pi_E$: $$ \lim_{t \to \infty} \theta^*_t = \arg\max_{\theta \in \Theta} \mathbb{E}_{f \sim \pi_E}[f(\theta)] $$ If $\pi_E$ is non-uniform, the builder specializes to high-density regions at the expense of broad capability. ::: This is the core tension: *privatization works in one-shot settings, but repeated evaluation reveals information that enables strategic specialization.* The builder cannot identify a fixed evaluation set (since $F_E^{(t)}$ is resampled each round), but from repeated observations the builder learns $\pi_E$ itself --- the *distribution* from which evaluation tasks are drawn. ### The Pareto Frontier {#sec-pareto-frontier} Define the **residual misalignment** at time $t$ as: $$ \Delta_t = \mathbb{E}_{f \sim \pi_E^{(t)}}[f(\theta^*_t)] - \frac{1}{|F|} u_E(\theta^*_t) $$ This measures the gap between what the builder optimizes for (expected performance under $\pi_E$) and what society wants (aggregate performance over all of $F$). When $\pi_E$ is uniform, $\Delta_t = 0$. No evaluation strategy can simultaneously achieve low variance and low leakage. Variance decreases in $k$; leakage increases in $k$. Varying $k$ traces a Pareto frontier where the evaluator must choose their preferred operating point. ```{pyodide-python} #| label: pareto-frontier #| autorun: true import numpy as np import matplotlib.pyplot as plt np.random.seed(42) # --- Simulation parameters --- N = 200 # task universe size n_rounds = 200 # number of rounds k_values = [5, 10, 20, 50, 100] n_trials = 20 # Monte Carlo trials # Initial biased evaluator distribution: overweight first 40% of tasks by 3x pi_E_true = np.ones(N) / N pi_E_true[:int(0.4 * N)] *= 3.0 pi_E_true /= pi_E_true.sum() colors = ['#5B8DEE', '#45BF7C', '#F0A35C', '#E8637A', '#B07CD8'] fig, axes = plt.subplots(1, 3, figsize=(6, 2)) for ki, k in enumerate(k_values): all_variance = np.zeros((n_trials, n_rounds)) all_kl = np.zeros((n_trials, n_rounds)) all_misalign = np.zeros((n_trials, n_rounds)) for trial in range(n_trials): # Builder starts with uniform prior (Dirichlet(1,...,1)) counts = np.ones(N) for t in range(n_rounds): # Evaluator samples k tasks from pi_E tasks = np.random.choice(N, size=k, p=pi_E_true, replace=True) # Score variance: use task values as performance task_scores = np.random.normal(0.5, 0.2, k) all_variance[trial, t] = np.var(task_scores) # Builder updates posterior for task in tasks: counts[task] += 1 # Builder's estimate of pi_E pi_hat = counts / counts.sum() # KL divergence from true pi_E mask = pi_E_true > 0 kl = np.sum(pi_E_true[mask] * np.log(pi_E_true[mask] / pi_hat[mask])) all_kl[trial, t] = kl # Misalignment: TV distance as proxy tv = 0.5 * np.sum(np.abs(pi_hat - np.ones(N)/N)) bias_tv = 0.5 * np.sum(np.abs(pi_E_true - np.ones(N)/N)) # Builder exploits learned bias: misalignment ~ min(learned_bias, true_bias) all_misalign[trial, t] = min(tv, bias_tv) col = colors[ki % len(colors)] rounds = np.arange(n_rounds) # Panel 1: Evaluation variance (averaged over rounds, show as function of t) avg_var = all_variance.mean(axis=0) axes[0].plot(rounds, np.convolve(avg_var, np.ones(10)/10, mode='same'), color=col, label=f'k={k}', lw=1.2) # Panel 2: KL divergence (builder's learning of pi_E) avg_kl = all_kl.mean(axis=0) axes[1].plot(rounds, avg_kl, color=col, label=f'k={k}', lw=1.2) # Panel 3: Misalignment avg_mis = all_misalign.mean(axis=0) axes[2].plot(rounds, avg_mis, color=col, label=f'k={k}', lw=1.2) axes[0].set_xlabel('Round') axes[0].set_ylabel('Score variance') axes[0].set_title('Evaluation noise') axes[0].legend(ncol=2, loc='upper right') axes[1].set_xlabel('Round') axes[1].set_ylabel('KL divergence') axes[1].set_title("Builder's learning") axes[2].set_xlabel('Round') axes[2].set_ylabel('Misalignment') axes[2].set_title('Incentive drift') plt.tight_layout() plt.show() ``` **Figure.** The information-variance Pareto frontier across $k \in \{5, 10, 20, 50, 100\}$ over 200 rounds with $|F| = 200$ tasks. Left: evaluation noise (score variance) is lower for larger $k$. Center: the builder learns $\pi_E$ faster with larger $k$, as measured by KL divergence between the builder's posterior and the true $\pi_E$. Right: as the builder's posterior concentrates, incentive misalignment grows. No single $k$ achieves both low noise and low misalignment --- a fundamental tradeoff. ## Restoring Alignment {#sec-restoring-alignment} The Pareto frontier of the previous section seems like an impasse: any choice of $k$ sacrifices either measurement precision or incentive alignment. The resolution comes from recognizing that the *evaluator is also a learner*. If the evaluator corrects their distribution over time, the information the builder accumulates becomes obsolete. ### Distribution Correction {#sec-distribution-correction} ::: {.callout-note title="Assumption: Distribution Correction"} The evaluator updates $\pi_E^{(t)}$ via linear interpolation toward uniform at rate $\rho \in (0, 1]$: $$ \pi_E^{(t)} = (1 - \rho)\,\pi_E^{(t-1)} + \rho \cdot \text{Uniform}(F) $$ so that $\text{KL}(\pi_E^{(t)} \| \text{Uniform}(F)) \to 0$ exponentially with rate $\rho$. ::: In practice, evaluators correct their biases through several channels: *incident reports* from users who discover model failures that benchmarks missed; *systematic audits* of coverage across capability dimensions (languages, modalities, reasoning types); *improving benchmark acquisition capacity* over time. The parameter $\rho$ captures the fraction of bias removed per evaluation round. ::: {.callout-warning title="Proposition: Alignment Recovery Under Distribution Correction"} Under the repeated evaluation game with distribution correction, the builder's optimal strategy converges to maximizing the designer's utility: $$ \theta^*_t \to \arg\max_{\theta \in \Theta} \frac{1}{|F|} \sum_{f \in F} f(\theta) = \arg\max_{\theta \in \Theta} u_E(\theta) $$ as $\pi_E^{(t)} \to \text{Uniform}(F)$. ::: *Proof sketch.* The builder's expected reward under mechanism $M$ is $\mathbb{E}_{f \sim \pi_E^{(t)}}[f(\theta)]$. As $\pi_E^{(t)} \to \text{Uniform}(F)$, this converges to $(1/|F|) u_E(\theta)$. In the limit, the builder knows $\pi_E = \text{Uniform}(F)$ perfectly, but this knowledge provides no advantage: optimizing for the uniform distribution *is* the designer's objective. $\square$ The key insight is a *race between two learners*: the builder learns $\pi_E^{(t)}$ from observations, while the evaluator corrects $\pi_E^{(t)}$ toward uniform. The residual misalignment depends on the relative rates of these two processes. ::: {.callout-warning title="Proposition: Misalignment Bound"} Let $D_0 = \text{KL}(\pi_E^{(0)} \| \text{Uniform}(F))$ and $m_t = k \cdot \min(t, \rho^{-1})$ be the builder's effective sample size. The residual misalignment satisfies: $$ \Delta_t \leq \min\!\left(\underbrace{\frac{m_t}{m_t + |F|}}_{\text{estimation-limited}},\;\; \underbrace{(1-\rho)^t \sqrt{\frac{D_0}{2}}}_{\text{correction-limited}}\right) $$ ::: The two terms capture complementary regimes: - **Estimation-limited** ($t$ small): $m_t \approx kt$, giving $\Delta_t \lesssim kt/|F|$. The builder has too few observations to identify the evaluator's biases. - **Correction-limited** ($t$ large): $\Delta_t \lesssim (1-\rho)^t \sqrt{D_0/2}$. The builder may know $\pi_E^{(t)}$ well, but the evaluator's correction has driven the bias toward zero, leaving nothing to exploit. *Proof sketch.* The correction-limited term follows from Pinsker's inequality: $\Delta_t \leq \text{TV}(\pi_E^{(t)}, \text{Uniform}(F)) \leq \sqrt{D_0/2}\,(1-\rho)^t$, since the total variation decays geometrically under linear interpolation. The estimation-limited term follows from Bayesian shrinkage: a builder with $m_t$ observations and a Dirichlet$(1,\ldots,1)$ prior has posterior mean that is an $m_t/(m_t + |F|)$-fraction of the way from uniform to the empirical distribution. $\square$ ```{pyodide-python} #| label: distribution-correction #| autorun: true import numpy as np import matplotlib.pyplot as plt np.random.seed(42) # --- Parameters --- N = 200 n_rounds = 200 k = 20 rho_values = [0, 0.01, 0.05, 0.1, 0.2] n_trials = 20 # Initial biased evaluator distribution pi_E_init = np.ones(N) / N pi_E_init[:int(0.4 * N)] *= 3.0 pi_E_init /= pi_E_init.sum() # True task performance: each task has a "value" task_values = np.random.uniform(0.3, 0.9, N) colors = ['#888888', '#B07CD8', '#F0A35C', '#45BF7C', '#5B8DEE'] fig, axes = plt.subplots(1, 2, figsize=(6, 2)) for ri, rho in enumerate(rho_values): all_misalign = np.zeros((n_trials, n_rounds)) all_kl = np.zeros((n_trials, n_rounds)) for trial in range(n_trials): pi_E = pi_E_init.copy() uniform = np.ones(N) / N for t in range(n_rounds): # Correct distribution if rho > 0: pi_E = (1 - rho) * pi_E + rho * uniform # Misalignment: total variation between pi_E and uniform tv = 0.5 * np.sum(np.abs(pi_E - uniform)) all_misalign[trial, t] = tv # KL divergence mask = pi_E > 1e-15 kl = np.sum(pi_E[mask] * np.log(pi_E[mask] / uniform[mask])) all_kl[trial, t] = kl col = colors[ri] rounds = np.arange(n_rounds) label = f'ρ={rho}' if rho > 0 else 'ρ=0 (no correction)' axes[0].plot(rounds, all_misalign.mean(axis=0), color=col, label=label, lw=1.5) kl_mean = all_kl.mean(axis=0) kl_mean = np.maximum(kl_mean, 1e-10) axes[1].plot(rounds, kl_mean, color=col, label=label, lw=1.5) axes[0].set_xlabel('Round') axes[0].set_ylabel('Misalignment (TV distance)') axes[0].set_title('Residual misalignment') axes[0].legend() axes[1].set_xlabel('Round') axes[1].set_ylabel('KL divergence') axes[1].set_title('Distribution bias') axes[1].set_yscale('log') axes[1].legend() plt.tight_layout() plt.show() ``` **Figure.** Alignment recovery under distribution correction with $|F| = 200$ and $k = 20$. Left: residual misalignment converges to zero for all $\rho > 0$, with faster correction rates producing faster convergence. The $\rho = 0$ baseline (gray) maintains persistent misalignment. Right: KL divergence decays exponentially on a log scale, with rate proportional to $\rho$. ### Optimal Evaluation Size {#sec-optimal-k} With distribution correction, the Pareto frontier from @sec-pareto-frontier collapses. Each revealed task provides only *transient* advantage to the builder, because the bias it reveals gets corrected within $O(1/\rho)$ rounds. The evaluator's per-round loss decomposes as: $$ \mathcal{L}(k) = \underbrace{\frac{\sigma^2}{k}}_{\text{evaluation variance}} + \underbrace{\frac{\gamma k}{\rho}}_{\text{steady-state exploitation}} $$ The first term decreases in $k$ (more tasks, less noise); the second increases in $k$ (more tasks, faster builder learning) and decreases in $\rho$ (faster correction, shorter exploitation window). ::: {.callout-warning title="Proposition: Optimal Sample Size"} Given correction rate $\rho > 0$ and gaming penalty $\gamma > 0$, the optimal number of evaluation tasks per round is: $$ k^* = \sigma \sqrt{\frac{\rho}{\gamma}} $$ with minimum loss $\mathcal{L}(k^*) = 2\sigma\sqrt{\gamma/\rho}$. ::: *Proof.* Setting $\partial \mathcal{L}/\partial k = -\sigma^2/k^2 + \gamma/\rho = 0$ gives $k^2 = \sigma^2 \rho / \gamma$, hence $k^* = \sigma\sqrt{\rho/\gamma}$. The second-order condition $\partial^2 \mathcal{L}/\partial k^2 = 2\sigma^2/k^3 > 0$ confirms this is a minimum. $\square$ Several key implications follow: - **$k^*$ increases with $\sqrt{\rho}$:** Faster correction allows larger evaluation sets. An evaluator who corrects twice as fast can evaluate $\sqrt{2} \approx 1.4$ times as many tasks. - **$k^*$ increases with $\sigma$:** Higher task variance demands more samples for a reliable signal, just as in the static case. - **Limiting behavior:** As $\rho \to 0$ (no correction), $k^* \to 0$ --- the evaluator should reveal as few tasks as possible. As $\rho \to \infty$ (instant correction), $k^* \to \infty$ --- leakage is immediately outdated. - **The key lever is $\rho$:** Since $\mathcal{L}(k^*) = 2\sigma\sqrt{\gamma/\rho}$, the evaluator's *optimal* loss decreases as $1/\sqrt{\rho}$. Investing in distribution correction (coverage audits, benchmark changelogs, incident-driven task additions) is the most effective lever for benchmark design. ```{pyodide-python} #| label: optimal-k-surface #| autorun: true import numpy as np import matplotlib.pyplot as plt sigma = 5.0 gamma = 0.05 rho_values = [0.05, 0.1, 0.2, 0.5] k_range = np.linspace(1, 80, 200) colors = ['#E8637A', '#F0A35C', '#45BF7C', '#5B8DEE'] fig, ax = plt.subplots(1, 1, figsize=(3.5, 3)) for i, rho in enumerate(rho_values): loss = sigma**2 / k_range + gamma * k_range / rho k_star = sigma * np.sqrt(rho / gamma) loss_star = 2 * sigma * np.sqrt(gamma / rho) ax.plot(k_range, loss, color=colors[i], lw=1.5, label=f'ρ={rho}') ax.plot(k_star, loss_star, '*', color=colors[i], markersize=10, zorder=5) ax.annotate(f'k*={k_star:.0f}', (k_star, loss_star), textcoords='offset points', xytext=(8, 5), fontsize=6, color=colors[i]) # Static case (rho -> 0): only variance term ax.plot(k_range, sigma**2 / k_range, '--', color='#888888', lw=1, label='ρ→0 (static)') ax.set_xlabel('Evaluation tasks per round (k)') ax.set_ylabel('Evaluator loss L(k)') ax.set_title('Optimal evaluation size') ax.legend() ax.set_ylim(0, 15) ax.set_xlim(0, 80) plt.tight_layout() plt.show() ``` **Figure.** Evaluator loss $\mathcal{L}(k) = \sigma^2/k + \gamma k / \rho$ for different correction rates, with stars marking the optimal $k^*$. Faster correction (larger $\rho$) shifts $k^*$ rightward and lowers the minimum loss. The static case ($\rho \to 0$, dashed) is monotonically decreasing, reflecting the irresolvable tradeoff without correction. Parameters: $\sigma = 5$, $\gamma = 0.05$. ### Noise-Gated Holdout {#sec-holdout} Distribution correction makes leaked information obsolete by moving the target. A complementary mechanism *prevents* leakage in the first place by gating the information flow. ::: {.callout-note title="Definition: Holdout Evaluation Mechanism"} The evaluator publishes a reference set $S_0 \subset F$ of size $n_0$, sampled from $\pi_E$. At each round $t$: 1. Draw a holdout set $S_t$ of size $k$ from $\pi_E^{(t)}$ and noise $\xi_t \sim \text{Laplace}(0, 1/\varepsilon)$. 2. Compute the reference score $r_T(\theta_t) = \frac{1}{n_0}\sum_{f \in S_0} f(\theta_t)$ and the holdout score $r_H(\theta_t) = \frac{1}{k}\sum_{f \in S_t} f(\theta_t)$. 3. **Threshold test:** If $|r_H(\theta_t) + \xi_t - r_T(\theta_t)| < \eta$, publish $r_T(\theta_t)$ ("silent" round). Otherwise, publish $r_H(\theta_t) + \xi_t$ ("alarm" round). ::: The mechanism is **self-correcting**: it reveals less information precisely when the builder is behaving well ($\Delta_{\text{game}} \approx 0$, so the alarm fires only due to noise, with probability $p_{\text{alarm}} = e^{-\varepsilon\eta}$), and reveals more when the builder is gaming (the score discrepancy reliably triggers the alarm). The effective leakage per round for a non-gaming builder is: $$ \lambda_{\text{out}}^{\text{holdout}} \approx e^{-\varepsilon\eta} \cdot k $$ Setting $\varepsilon\eta = 3$ yields a $\sim 20\times$ slowdown in the builder's learning rate, buying the evaluator substantially more time for distribution correction. ::: {.callout-warning title="Why Noise Alone Cannot Replace Distribution Correction"} The holdout mechanism slows the builder's learning but does not change their *incentives* conditional on what they have learned. Regardless of how much Laplace noise is added, the builder's optimal strategy given their posterior $\hat{\pi}_{E,t}$ remains $\theta_t^* = \arg\max_\theta \mathbb{E}_{f \sim \hat{\pi}_{E,t}}[f(\theta)]$. If $\pi_E$ is biased and fixed, the builder will eventually learn this bias and specialize accordingly. Only distribution correction, which drives $\pi_E^{(t)} \to \text{Uniform}(F)$, ensures that the builder's eventual knowledge provides no exploitable advantage. This is the key difference from pure differential privacy: DP slows learning; distribution correction eliminates the *incentive* to exploit what is learned. ::: This connects to the differential privacy literature [@dwork2015generalization; @dwork2014algorithmic]. The holdout mechanism inherits DP guarantees from the Sparse Vector Algorithm, bounding the max-information between the builder's observations and the holdout set. However, DP addresses a different threat: it prevents reconstruction of which specific tasks are in the holdout (overfitting to the test *set*), while our concern is that the builder learns the evaluation *distribution* and specializes accordingly. ## Metric Design as Principal-Agent Problem {#sec-metric-design} We now shift from "which tasks to show" to "which metrics to report." Even with a perfectly designed evaluation mechanism, the *choice of metric* can create perverse incentives. ### When Metrics Create Perverse Incentives {#sec-perverse-incentives} A vivid illustration comes from healthcare. In 2001, the New York State Department of Health began publishing hospital mortality rates for cardiac surgery. @dranove2003paying documented the consequences: hospitals began *avoiding* severely ill patients rather than improving care. By publishing average treated outcome (ATO) as the quality metric, the system rewarded patient selection rather than treatment effectiveness. The AI evaluation analogy is direct. A leaderboard that rewards average benchmark score incentivizes developers to specialize on easy benchmarks or to select favorable evaluation conditions, rather than improving broadly. The metric itself creates the misalignment. Formalizing this as a principal-agent problem [@laffont1986using]: the principal (evaluator) chooses a reward function $w$, and the agent (developer) best-responds with policy $\pi^w$. The principal's regret is $R(\pi^w) = \max_{\tilde{\pi} \in \Pi} V(\tilde{\pi}) - V(\pi^w)$, where $V(\pi) = \mathbb{E}[Y(\pi) - Y(0)]$ is the total treatment effect. ::: {.callout-warning title="Proposition: ATO Has Unbounded Regret"} The average treated outcome $w_{\text{ATO}}(x, t, y) = y \cdot \mathbf{1}[t = 1]$ can have unbounded regret. An agent maximizing $\mathbb{E}[w_{\text{ATO}}]$ may achieve $R(\pi^{w_{\text{ATO}}}) = \max_\pi V(\pi)$ --- the *worst possible* policy. ::: The intuition is stark: a hospital maximizing average treated outcome selects only the healthiest patients for treatment, achieving excellent ATO but zero (or negative) total treatment effect. In the AI setting, a developer maximizing average benchmark accuracy across their selected tasks may simply drop the hardest benchmarks from their evaluation suite. ::: {.callout-warning title="Proposition: Total Treatment Effect Achieves Zero Regret"} The total treatment effect metric $w_{\text{TT}}(x, t, y) = y - \hat{\mu}_0(x)$ achieves zero regret when the principal has unbiased counterfactual estimates $\hat{\mu}_0(x) = \mathbb{E}[Y(0) \mid X = x]$. ::: The key insight: by subtracting the counterfactual baseline (what would have happened without treatment), the metric removes the incentive for patient selection. In AI evaluation, this translates to scoring the "capability uplift" --- performance relative to a baseline --- rather than raw performance. This connects to the doubly robust estimation from @sec-causality: counterfactual metrics require causal reasoning about what performance *would have been* absent the developer's effort. ### Information Asymmetry and Metric Elicitation {#sec-information-elicitation} Beyond metric choice, there is a prior question: how much should the developer *reveal* about their model's capabilities? Consider a developer (agent) with a private cost-correlated variable $X$ --- for instance, knowledge about which capability dimensions their model excels at. The evaluator (principal) can design contracts conditioned on $X$ if it is revealed. Drawing on the information elicitation framework of @wang2024relying, the key question is: when does the developer prefer to reveal, conceal, or *garble* information about their capabilities? - **Reveal:** The developer discloses $X$ fully. The principal designs a targeted contract, which benefits both parties when the information enables efficient allocation. - **Conceal:** The developer hides $X$. The principal must use a one-size-fits-all contract, which is robust but potentially inefficient. - **Garble:** The developer discloses a noisy version of $X$. This intermediate option can be Pareto-improving. ::: {.callout-warning title="Proposition: Principal Always Benefits from Revelation"} The principal's expected utility is weakly higher under revelation than concealment: $V_P^{\text{reveal}} \geq V_P^{\text{conceal}}$. ::: *Proof sketch.* Under revelation, the principal's optimization problem has a strictly larger feasible set (contracts conditioned on $X$), so the optimum can only improve. $\square$ But the developer's incentives are more nuanced. Revelation helps when conditioning on $X$ sufficiently differentiates high-cost and low-cost types, allowing the principal to offer more efficient contracts. Concealment is preferred when revelation would lead the principal to extract all of the developer's surplus through perfectly targeted contracts. ### Garbling as Differential Privacy {#sec-garbling-privacy} The most interesting case is *garbling*: the developer reveals $Y = X$ with probability $\varepsilon$ and $Y = \xi$ (noise) with probability $1 - \varepsilon$. This is precisely the randomized response mechanism from differential privacy. ::: {.callout-warning title="Proposition: Garbling Can Pareto-Dominate Both Extremes"} Under fairly wide conditions (e.g., two exponential cost types with different means), the agent may prefer garbled disclosure to both full concealment and full revelation. Moreover, garbling can increase total welfare compared to concealment. ::: The intuition is that garbling provides the principal with *just enough* information to design better contracts without enabling full surplus extraction. In the AI evaluation context, this maps to: - **Model cards with calibrated noise:** Developers disclose approximate capability profiles rather than exact performance vectors. - **Differential privacy in metric reporting:** Adding Laplace noise to reported metrics creates a garbling mechanism. - **Benchmark transparency tiers:** Publish aggregate scores publicly, but release fine-grained breakdowns only through a privacy-preserving mechanism. This connects directly to the holdout mechanism of @sec-holdout: both use calibrated noise to create better incentive equilibria. The parallel is deep --- the evaluator's decision about how much to reveal about the evaluation distribution and the developer's decision about how much to reveal about model capabilities are dual problems in information design. ```{pyodide-python} #| label: agency-game #| autorun: true import numpy as np import matplotlib.pyplot as plt from scipy.optimize import minimize_scalar np.random.seed(42) # --- Agency game simulation --- # Two exponential cost types: lambda_0 (high cost), lambda_1 (low cost) lambda_0_range = np.linspace(0.5, 5.0, 30) lambda_1_range = np.linspace(0.5, 5.0, 30) # Principal's value from agent effort b = 2.0 # Compute revelation advantage for agent V_diff = np.zeros((len(lambda_0_range), len(lambda_1_range))) for i, lam0 in enumerate(lambda_0_range): for j, lam1 in enumerate(lambda_1_range): # Prior: equal probability of each type p0, p1 = 0.5, 0.5 # Under revelation: principal offers type-specific contracts # Optimal effort for type k: e* = max(0, b - lambda_k) # Agent surplus under reveal: E[max(0, w_k - C_k)] for targeted w_k e_rev_0 = max(0, b - lam0) e_rev_1 = max(0, b - lam1) # Agent expected utility: wage minus expected cost V_agent_rev = p0 * max(0, (b - lam0)**2 / (2*b)) + p1 * max(0, (b - lam1)**2 / (2*b)) # Under concealment: principal offers pooling contract lam_pool = p0 * lam0 + p1 * lam1 e_con = max(0, b - lam_pool) V_agent_con = max(0, (b - lam_pool)**2 / (2*b)) V_diff[i, j] = V_agent_rev - V_agent_con fig, axes = plt.subplots(1, 2, figsize=(6, 2)) # Panel 1: Heatmap of V_reveal - V_conceal im = axes[0].imshow(V_diff, extent=[lambda_1_range[0], lambda_1_range[-1], lambda_0_range[-1], lambda_0_range[0]], aspect='auto', cmap='RdBu_r', vmin=-0.5, vmax=0.5) axes[0].set_xlabel('λ₁ (low-cost type)') axes[0].set_ylabel('λ₀ (high-cost type)') axes[0].set_title('Agent: V_reveal − V_conceal') axes[0].contour(lambda_1_range, lambda_0_range, V_diff, levels=[0], colors='black', linewidths=1.5) plt.colorbar(im, ax=axes[0], shrink=0.8) # Panel 2: Agent utility under garbling (varying epsilon) lam0_fixed, lam1_fixed = 3.0, 1.0 epsilons = np.linspace(0, 1, 100) V_garble = np.zeros(len(epsilons)) for ei, eps in enumerate(epsilons): # Under garbling: principal observes Y = X w.p. eps, Y = noise w.p. 1-eps # Effective type distribution from principal's perspective # Garbled: principal assigns probability # P(type=0 | Y=0) = eps * p0 / (eps*p0 + (1-eps)*0.5) # Simplified: interpolate between reveal and conceal V_garble[ei] = eps * (0.5 * max(0, (b - lam0_fixed)**2 / (2*b)) + 0.5 * max(0, (b - lam1_fixed)**2 / (2*b))) + \ (1 - eps) * max(0, (b - 0.5*lam0_fixed - 0.5*lam1_fixed)**2 / (2*b)) axes[1].plot(epsilons, V_garble, color='#5B8DEE', lw=2) axes[1].axhline(y=V_garble[0], color='#E8637A', ls='--', lw=1, label='Conceal') axes[1].axhline(y=V_garble[-1], color='#45BF7C', ls='--', lw=1, label='Reveal') best_eps = epsilons[np.argmax(V_garble)] axes[1].axvline(x=best_eps, color='#F0A35C', ls=':', lw=1, label=f'ε*={best_eps:.2f}') axes[1].set_xlabel('Garbling parameter ε') axes[1].set_ylabel('Agent utility') axes[1].set_title(f'Garbling (λ₀={lam0_fixed}, λ₁={lam1_fixed})') axes[1].legend() plt.tight_layout() plt.show() ``` **Figure.** Left: heatmap of the agent's net benefit from revelation vs. concealment across pairs of exponential cost types $(\lambda_0, \lambda_1)$. Red regions indicate the agent prefers revelation; blue regions favor concealment. The black contour marks indifference. Right: agent utility under garbling (noisy disclosure with probability $\varepsilon$) for fixed cost types. An intermediate garbling level can outperform both full concealment and full revelation. ## Representative Benchmark Selection {#sec-representative-selection} A distinct but related design question: given a large suite of evaluation metrics, how should one select a representative subset? This is the "lite benchmark" problem faced by BIG-bench (200+ metrics $\to$ BIG-bench Lite, 24), HELM ($\to$ HELM Lite), and Cal Hospital Compare (hundreds of quality measures $\to$ 12). ### The Subset Selection Problem {#sec-subset-selection} Let there be $n$ metrics and $m$ alternatives (models). Each metric $i$ produces a ranking $\sigma_i$ over the alternatives. We seek a subset $K \subseteq N = [n]$ of metrics that is "representative" of the full set $N$. But what does "representative" mean formally? The work of @procaccia2025metritocracy provides two precise definitions grounded in social choice theory. ### Positional Representation {#sec-positional-representation} The first notion prevents *under-representation* at every rank cutoff. ::: {.callout-note title="Definition: Positional Representation"} A subset $K$ satisfies **positional representation** for group size $g$ if for every rank cutoff $r \in [m]$ and every alternative $a$: $$ C(K, r, a) \geq \left\lfloor \frac{C(N, r, a)}{g} \right\rfloor $$ where $C(S, r, a) = |\{i \in S : \sigma_i(a) \leq r\}|$ counts how many metrics in $S$ rank alternative $a$ in the top $r$. ::: In words: if alternative $a$ is ranked in the top $r$ by at least $\ell \cdot g$ metrics in $N$, then $a$ must be ranked in the top $r$ by at least $\ell$ metrics in $K$. The parameter $g$ controls the granularity of representation --- smaller $g$ requires finer representation but demands larger $|K|$. ::: {.callout-warning title="Theorem: Positional Representation Bounds"} The minimum subset size needed to guarantee positional representation satisfies: $$ \Omega\!\left(\frac{n}{g} \cdot \frac{\log m}{\log(n \log m / g)}\right) \leq |K| \leq O\!\left(\frac{n}{g} \log m\right) $$ The upper bound is achieved by a polynomial-time greedy algorithm (Algorithm 1 in @procaccia2025metritocracy) based on set cover. ::: The greedy algorithm works as follows: iterate through the preference profile row by row, coloring entries when an alternative accumulates $g$ appearances. Then greedily select metrics that cover the most remaining colors. The connection to set cover yields the logarithmic factor. ### Positional Proportionality {#sec-positional-proportionality} Positional representation prevents under-representation but not over-representation. A stronger notion prevents both: ::: {.callout-note title="Definition: Positional Proportionality"} A subset $K$ satisfies **$\epsilon$-positional proportionality** if for every alternative $a$ and every rank cutoff $r$: $$ \left| \frac{C(N, r, a)}{|N|} - \frac{C(K, r, a)}{|K|} \right| \leq \epsilon $$ ::: This preserves the *fraction* of metrics ranking each alternative at each position, up to additive error $\epsilon$. ::: {.callout-warning title="Theorem: Positional Proportionality Bounds"} The minimum subset size for $\epsilon$-positional proportionality satisfies: $$ \Omega\!\left(\frac{1}{\epsilon^2} \log m\right) \leq |K| \leq O\!\left(\frac{1}{\epsilon^2} \log m\right) $$ These bounds are tight up to constant factors. ::: The tight bounds follow from a connection to uniform convergence: a random subset of size $O(\epsilon^{-2} \log m)$ satisfies positional proportionality with high probability, by a Chernoff-type argument applied simultaneously to all $O(m)$ alternatives and rank cutoffs. A particularly useful consequence connects proportionality to scoring rules: ::: {.callout-warning title="Theorem: Scoring Rule Approximation"} If $K$ satisfies $\epsilon$-positional proportionality, then for any scoring rule with score vector $s$ and every alternative $a$: $$ |f_s(a, \sigma_N) - f_s(a, \sigma_K)| \leq \epsilon $$ where $f_s(a, \sigma_S) = \frac{1}{|S|} \sum_{i \in S} s_{\sigma_i(a)}$ is the average score of alternative $a$ under scoring rule $s$ in metric set $S$. ::: This is the key practical guarantee: a subset satisfying positional proportionality approximates *any* scoring rule on the original set of metrics. The connection to @sec-dimensionality is direct: positional proportionality provides a formal criterion for when a "lite" benchmark preserves the same information as the full suite, complementing the dimensionality analysis of factor models. ```{pyodide-python} #| label: positional-representation #| autorun: true import numpy as np import matplotlib.pyplot as plt np.random.seed(42) # --- Generate synthetic benchmark data --- n_metrics = 50 # full set m_models = 20 # alternatives # Generate random rankings: each metric produces a permutation rankings = np.zeros((n_metrics, m_models), dtype=int) for i in range(n_metrics): rankings[i] = np.random.permutation(m_models) def count_top_r(rankings, subset, r, a): """Count how many metrics in subset rank alternative a in top r.""" return sum(1 for i in subset if rankings[i, a] < r) def greedy_positional_representation(rankings, g): """Greedy algorithm for positional representation.""" n, m = rankings.shape # Phase 1: Color entries colors = {} color_id = 0 alt_counts = {a: [] for a in range(m)} # track uncolored positions per alt for r in range(m): for a in range(m): entries = [(i, r) for i in range(n) if rankings[i, a] == r] for (i, pos) in entries: alt_counts[a].append((i, pos)) if len(alt_counts[a]) >= g: for (mi, mp) in alt_counts[a][-g:]: colors[(mi, a, mp)] = color_id color_id += 1 alt_counts[a] = [] # Phase 2: Greedy set cover K = set() all_colors = set(colors.values()) remaining = all_colors.copy() metric_colors = {i: set() for i in range(n)} for (mi, a, pos), c in colors.items(): metric_colors[mi].add(c) while remaining: best_metric = max(range(n), key=lambda i: len(metric_colors[i] & remaining) if i not in K else -1) if len(metric_colors[best_metric] & remaining) == 0: break K.add(best_metric) remaining -= metric_colors[best_metric] return K # --- Run for different group sizes --- g_values = [2, 3, 5, 8, 10, 15, 20, 25] greedy_sizes = [] random_sizes = [] for g in g_values: K = greedy_positional_representation(rankings, g) greedy_sizes.append(len(K)) # Random baseline: how many random metrics needed? n_trials_rand = 50 rand_needed = [] for _ in range(n_trials_rand): perm = np.random.permutation(n_metrics) for size in range(1, n_metrics + 1): K_rand = set(perm[:size]) # Check positional representation satisfied = True for r in range(m_models): for a in range(m_models): c_full = count_top_r(rankings, range(n_metrics), r, a) c_sub = count_top_r(rankings, K_rand, r, a) if c_sub < c_full // g: satisfied = False break if not satisfied: break if satisfied: rand_needed.append(size) break else: rand_needed.append(n_metrics) random_sizes.append(np.mean(rand_needed)) # --- Plot --- fig, axes = plt.subplots(1, 2, figsize=(6, 2)) colors_plot = ['#5B8DEE', '#E8637A', '#45BF7C'] axes[0].plot(g_values, greedy_sizes, 'o-', color=colors_plot[0], lw=1.5, markersize=4, label='Greedy') axes[0].plot(g_values, random_sizes, 's--', color=colors_plot[1], lw=1.5, markersize=4, label='Random') axes[0].plot(g_values, [n_metrics/g * np.log(m_models) for g in g_values], ':', color='#888888', lw=1, label='(n/g)·log(m) bound') axes[0].set_xlabel('Group size g') axes[0].set_ylabel('Selected subset size |K|') axes[0].set_title('Positional representation') axes[0].legend() # Panel 2: Show ranking preservation for the g=5 case K_example = greedy_positional_representation(rankings, g=5) K_list = sorted(K_example) # Top-5 ranking of each model: fraction of metrics ranking it in top 5 top_r = 5 frac_full = np.array([count_top_r(rankings, range(n_metrics), top_r, a) / n_metrics for a in range(m_models)]) frac_subset = np.array([count_top_r(rankings, K_list, top_r, a) / len(K_list) for a in range(m_models)]) sort_idx = np.argsort(-frac_full) x = np.arange(m_models) w = 0.35 axes[1].bar(x - w/2, frac_full[sort_idx], w, color=colors_plot[0], alpha=0.7, label=f'Full (n={n_metrics})') axes[1].bar(x + w/2, frac_subset[sort_idx], w, color=colors_plot[2], alpha=0.7, label=f'Subset (|K|={len(K_list)})') axes[1].set_xlabel('Model (sorted)') axes[1].set_ylabel(f'Fraction in top {top_r}') axes[1].set_title(f'Top-{top_r} preservation (g=5)') axes[1].legend() axes[1].set_xticks([]) plt.tight_layout() plt.show() ``` **Figure.** Left: subset size $|K|$ required by the greedy algorithm for positional representation at various group sizes $g$, compared to random selection and the theoretical $O(n/g \cdot \log m)$ bound. The greedy algorithm consistently outperforms random selection. Right: for $g=5$, comparison of the fraction of metrics ranking each model in the top 5, between the full set ($n=50$) and the greedy-selected subset. The subset closely preserves the ranking structure. ## Synthesis: Design Principles for Strategic Benchmarks {#sec-design-principles} The three threads of this chapter --- information design, metric design, and representative selection --- converge on six actionable design principles for AI evaluation. | Principle | Formal Basis | Recommendation | |-----------|-------------|----------------| | **1. Randomize and Refresh** | Theorem 1 (one-shot alignment), Prop. (deterministic failure) | Use randomized evaluation with task renewal. Static benchmarks are Goodhart-vulnerable by construction. | | **2. Correct and Grow** | Prop. (alignment recovery), Prop. (optimal $k^*$) | Invest in distribution correction ($\rho$) as the primary lever. The optimal $k^*$ scales with $\sqrt{\rho}$. | | **3. Gate Information Release** | Holdout mechanism, Prop. (learning slowdown) | Use threshold tests to condition information flow on builder behavior. Combine with distribution correction. | | **4. Align Metrics with Welfare** | Prop. (ATO regret), Prop. (TT zero regret) | Score total treatment effect, not averages. Account for counterfactual baselines. | | **5. Allow Partial Transparency** | Prop. (garbling dominance) | Calibrated noise in metric reporting (garbling) can create Pareto improvements over both full transparency and full opacity. | | **6. Ensure Representative Subsets** | Thms. (positional representation/proportionality bounds) | Use formal representation criteria rather than ad hoc selection when creating lite benchmarks. | : Six design principles for strategic AI evaluation {#tbl-design-principles} These principles interact in important ways. Principles 1--3 address the *information channel* (how much the evaluator reveals about the evaluation mechanism). Principle 4 addresses the *metric channel* (what the evaluator measures). Principle 5 addresses the *developer channel* (what the developer reveals about their model). Principle 6 addresses the *scope channel* (which metrics to include). A well-designed evaluation system should operate on all four channels simultaneously: randomize task selection (Principle 1), invest in distribution correction (Principle 2), gate information via holdout mechanisms (Principle 3), use welfare-aligned metrics (Principle 4), allow developers to disclose through privacy-preserving mechanisms (Principle 5), and ensure the metric suite is formally representative (Principle 6). ## Discussion Questions {#sec-design-discussion} 1. The evaluation game framework assumes a *benevolent* evaluator (social planner). What changes if the evaluator also has strategic incentives --- for example, a company running its own benchmark to favor its own models? 2. How does the information-variance tradeoff relate to the reliability-validity tradeoff from @sec-reliability and @sec-validity? Is there a formal connection between $\sigma^2/k$ (evaluation variance) and Cronbach's $\alpha$? 3. Design a mechanism where the builder's incentive is to improve on the *hardest* tasks rather than the average. How should $u_E(\theta)$ and $r(\theta, S)$ be modified? 4. In what sense is the Chatbot Arena a randomized evaluation mechanism? Does it satisfy the conditions of Theorem 1? What information leaks through the adaptive matching process? 5. How should a government regulator set $\gamma$ (the gaming penalty) for a safety benchmark? What factors should influence this choice? 6. If two "lite" benchmarks both satisfy $\epsilon$-positional proportionality with the same $\epsilon$ but select different subsets, which should be preferred? What additional criteria might break ties? 7. The holdout mechanism draws on differential privacy. What is the relationship between the privacy budget $\varepsilon$ in the holdout mechanism and the notion of $\varepsilon$-differential privacy from the data privacy literature? 8. Distribution correction requires the evaluator to "know their bias." How can an evaluator identify which regions of the task universe they under-represent? What role do meta-evaluations and coverage audits play? ## Bibliographic Notes {#sec-design-biblio} The Stackelberg evaluation game and the information-variance tradeoff are developed in @truong2025guard. The distribution correction mechanism and optimal evaluation size results are from the same work. The analysis of metric design as a principal-agent problem draws on @wang2024relying, which develops counterfactual quality metrics (Chapter 5) and information elicitation in agency games (Chapter 6). The positional representation and proportionality framework is from @procaccia2025metritocracy. **Goodhart's Law** was articulated by @goodhart1984problems in the context of monetary policy. @manheim2018categorizing formalize four variants. The connection to reward overoptimization in RLHF is explored by @gao2023scaling. **Strategic classification** --- the study of agents who manipulate their features to achieve favorable classification outcomes --- was formalized by @hardt2016strategic. @perdomo2020performative introduce performative prediction, where the model itself changes the data distribution. @braverman2020role show that randomness is necessary for efficient classification under strategic behavior. **Bayesian persuasion** [@kamenica2011bayesian] provides the information design framework connecting evaluator disclosure to builder behavior. @bergemann2019information survey the broader information design literature. **Contract theory** [@laffont1986using; @holmstrom1991multitask] provides the principal-agent foundations for metric design. **Differential privacy and adaptive data analysis** [@dwork2015generalization; @dwork2014algorithmic] underpin the holdout mechanism. @blum2015ladder apply reusable holdout techniques to machine learning competitions. The connection between DP and strategic robustness is developed in the holdout mechanism analysis. **Social choice and benchmarking** connect to @zhang2024inherent (Arrow's impossibility for benchmarks), @colombo2022what (Borda count for benchmark aggregation), and @rofin2022votenrank (scoring rules for benchmark ranking). The committee selection literature, particularly justified representation [@aziz2017justified], provides the social choice foundations for positional representation. ## Exercises {#sec-design-exercises} 1. **(Easy)** Show that if $\pi_E = \text{Uniform}(F)$, the one-shot alignment theorem holds for any sample size $k$, not just $k = 1$. *Hint:* Show that for any $k$-subset mechanism, the expected reward is proportional to $u_E(\theta)$. 2. **(Easy)** Verify that $k^* = \sigma\sqrt{\rho/\gamma}$ minimizes $\mathcal{L}(k) = \sigma^2/k + \gamma k/\rho$. What is $\mathcal{L}(k^*)$? What happens to $k^*$ when $\gamma$ doubles? 3. **(Medium)** Derive the estimation-limited term in the misalignment bound using a Dirichlet$(1, \ldots, 1)$ prior over distributions on $F$ with $m$ effective observations. Show that the posterior predictive is the shrinkage estimator $\hat{\pi}_E = \frac{m}{m+N}\hat{p} + \frac{N}{m+N}\text{Uniform}(F)$ and that $\text{TV}(\hat{\pi}_E, \text{Uniform}(F)) \leq m/(m+N)$. 4. **(Medium)** In the agency game with binary $X$, suppose $C \mid X=0 \sim \text{Exp}(\lambda_0)$ and $C \mid X=1$ is zero-cost. Derive the condition on $\lambda_0$ and the principal's value $b$ under which the agent prefers concealment to revelation. 5. **(Medium)** Show that any subset $K$ satisfying $\epsilon$-positional proportionality approximates any scoring rule within $\epsilon$. *Hint:* Write $f_s(a, \sigma_S)$ as a weighted sum of cumulative counts $C(S, r, a)/|S|$ and apply Abel summation. 6. **(Hard)** Extend the one-shot alignment theorem to the case where the builder has a non-uniform prior $p_M$ over $F$. Show that the builder's best response maximizes $\mathbb{E}_{f \sim p_M}[f(\theta)]$. Under what conditions does this coincide with maximizing $u_E(\theta)$? 7. **(Hard)** Prove that the holdout mechanism slows the builder's posterior concentration by a factor of $e^{\varepsilon\eta}$. *Hint:* Compute the effective sample size $m_t^{\text{holdout}} = n_0 + p_{\text{alarm}} \cdot k \cdot \min(t, \rho^{-1})$ and compare to $m_t^{\text{baseline}}$. 8. **(Hard, computational)** Implement the greedy algorithm for positional representation on the HELM Lite data (or a synthetic dataset with $n = 100$ metrics, $m = 50$ models). Compare $|K|$ to the existing lite benchmark subset for different values of the group size $g$. At what $g$ does the greedy algorithm match the size of the existing subset?