8  Information and Mechanism Design

Intended Learning Outcomes

By the end of this chapter, you will be able to:

  1. Explain why Goodhart’s Law is not merely a cautionary aphorism but a formal game-theoretic phenomenon, and classify its four variants in AI evaluation contexts.
  2. Formalize AI evaluation as a Stackelberg game between an evaluation designer and a model builder, and prove that deterministic benchmarks fail while randomized mechanisms achieve one-shot incentive alignment.
  3. Analyze the information-variance tradeoff in repeated evaluation: why reducing noise requires revealing information, and how this degrades alignment over time.
  4. Derive distribution correction as the primary mechanism for restoring alignment, and compute the optimal evaluation size \(k^*\) as a function of correction rate \(\rho\).
  5. Model metric design as a principal-agent problem and characterize when developers prefer to reveal, conceal, or garble evaluation information.
  6. Apply positional representation and positional proportionality criteria to select representative benchmark subsets using social choice theory.
  7. Synthesize design principles spanning strategic robustness, information control, and representativeness for AI benchmarks.

This chapter can be covered in 3 lectures (75-90 minutes each):

Lecture 1: The Evaluation Game

  • Goodhart’s Law and strategic manipulation (15 min)
  • Stackelberg benchmark game: setup and failure of deterministic mechanisms (20 min)
  • One-shot incentive alignment via randomization (25 min)
  • Hands-on: deterministic vs. randomized evaluation simulation (15 min)

Lecture 2: Information Leakage and Restoring Alignment

  • Repeated evaluation and the information-variance tradeoff (20 min)
  • Distribution correction and alignment recovery (25 min)
  • Optimal evaluation size and the holdout mechanism (20 min)
  • Hands-on: Pareto frontier and correction simulations (10 min)

Lecture 3: Metric Design and Representative Selection

  • Metric design as a principal-agent problem (25 min)
  • Information elicitation: reveal, conceal, garble (20 min)
  • Representative benchmark selection via social choice theory (20 min)
  • Design principles synthesis (10 min)
Notation

This chapter introduces game-theoretic notation: \(F\) (task universe), \(F_E/F_M\) (evaluator/builder task sets), \(\pi_E/\pi_M\) (sampling distributions), \(f(\theta)\) (task performance), \(u_E\) (evaluator utility), \(\Delta_t\) (misalignment), and \(\gamma\) (gaming penalty). See ?sec-notation for the complete notation reference.

8.1 When Measurement Becomes a Target

“When a measure becomes a target, it ceases to be a good measure.” — Charles Goodhart (1975)

In previous chapters, we treated evaluation as a statistical problem: how to estimate latent abilities from noisy observations (Chapter 2), how to quantify measurement precision (Chapter 5), and how to assess whether a benchmark measures what it claims to measure (Chapter 6, Chapter 7). Throughout, we implicitly assumed that the data-generating process is fixed — that the act of measurement does not change the phenomenon being measured.

This assumption breaks down once benchmarks become influential. When a developer’s reputation, funding, or regulatory standing depends on benchmark scores, the developer has every incentive to optimize specifically for the benchmark — not for the broader capability it claims to measure. The data-generating process is no longer fixed: it shifts in response to the measurement itself. This is Goodhart’s Law, and in AI evaluation it is not a metaphor but a demonstrable, recurring phenomenon.

Concrete examples. The Foundation Model Transparency Index (FMTI) saw scores jump dramatically after its first release, not because companies became more transparent, but because they learned which specific indicators were measured and optimized their disclosures accordingly. Chatbot Arena has faced concerns about strategic submission: developers can selectively deploy models optimized for the types of queries that appear on the platform. And benchmark contamination — where training data includes evaluation items — is rational behavior for a developer whose market value depends on leaderboard position (Section 6.4.3).

Manheim and Garrabrant (2018) identify four variants of the Goodhart effect, each with a distinct causal mechanism:

  1. Regressional Goodhart: The proxy \(\hat{u}\) and the true objective \(u\) are correlated but not identical. Optimizing \(\hat{u}\) overshoots because extreme values of \(\hat{u}\) tend to arise from noise, not from genuinely extreme \(u\). This is the measurement error story from Chapter 5: a benchmark with \(\alpha = 0.7\) captures 70% true variance and 30% noise, so selecting the top-scoring model partially selects for lucky noise.

  2. Extremal Goodhart: The relationship between \(\hat{u}\) and \(u\) that holds in the bulk of the distribution breaks down in the tails. A benchmark that reliably ranks typical models may fail catastrophically when applied to a model specifically engineered to maximize the benchmark score.

  3. Causal Goodhart: The proxy and the objective share a common cause. Optimizing the proxy can break this causal link. If “passes safety benchmark” and “is actually safe” are both caused by “was trained carefully,” a developer can find shortcuts to the benchmark that bypass the common cause.

  4. Adversarial Goodhart: An agent actively exploits the gap between proxy and objective. This is the regime that dominates in competitive AI evaluation, and the focus of this chapter.

The shift from Chapters 3–6 to this chapter is a shift in the source of the threat. Measurement error (Chapter 5) is stochastic. Validity threats (Section 6.4) are systematic but unintentional. Strategic manipulation is deliberate and adaptive: the DGP itself changes in response to the measurement. This connects to the literature on performative prediction (Perdomo et al. 2020), where the act of deploying a model changes the distribution it operates on, and to strategic classification (Hardt et al. 2016), where agents manipulate their features to achieve favorable outcomes.

The remainder of this chapter builds three layers of strategic analysis: (i) benchmark disclosure and information design — how much should evaluators reveal about the evaluation mechanism? (ii) metric selection and reporting granularity — which metrics should evaluators report, and at what level of detail? (iii) mechanism design for repeated evaluation — how can evaluators maintain alignment when the game repeats?

8.2 The Evaluation Game

We now formalize the strategic interaction between an evaluation designer and a model builder as a Stackelberg game. The designer moves first by committing to an evaluation mechanism; the builder responds by training a model. The key question: under what conditions can a benchmark mechanism incentivize builders to improve performance on the full task distribution, rather than merely optimizing for the specific tasks being measured?

8.2.1 Setup: Evaluator and Builder

Let \(\Theta\) denote the space of all possible models. For a given model \(\theta \in \Theta\), let \(f(\theta) \in [0, 1]\) denote the model’s performance on task \(f\). The universe of all possible tasks is the finite set \(F\) with \(|F| = N\).

Definition: Evaluation Designer’s Utility

The evaluation designer’s utility for a model \(\theta\) is the aggregate performance across all tasks: \[ u_E(\theta) = \sum_{f \in F} f(\theta) \] The designer acts as a social planner seeking models that perform broadly rather than narrowly.

Each party acquires tasks by sampling from \(F\): the designer draws tasks according to \(\pi_E\) over \(F\), forming the evaluation set \(F_E\), while the builder draws according to \(\pi_M\), forming the training set \(F_M\). In practice, \(\pi_M\) is approximately uniform (builders have broad access to training data), while \(\pi_E\) is typically biased toward tasks that are easy to construct, grade, or that reflect particular evaluation priorities.

Definition: Stackelberg Benchmark Game

The game proceeds in three stages, parameterized by a mechanism \((M, r)\) where \(M\) is a sampling function and \(r: \Theta \times \mathcal{P}(F) \to \mathbb{R}\) is a reward function:

  1. Designer’s move (ex-ante): The evaluation designer publishes the mechanism \((M, r)\). If \(M\) is randomized, the designer privately draws randomness \(\omega\) and computes \(S^* = M(F_E, \omega)\), withholding the realization. If \(M\) is deterministic, \(S^* \subseteq F_E\) is fixed.

  2. Builder’s move: The builder observes the mechanism \((M, r)\) but not the realized set \(S^*\), and selects a model \(\theta^* \in \Theta\).

  3. Evaluation stage (ex-post): The designer publishes the score \(r(\theta^*, S^*)\). The builder’s payoff is \(r(\theta^*, S^*)\); the designer’s payoff is \(u_E(\theta^*)\).

The misalignment between the builder’s incentive (performance on \(S^*\)) and the designer’s objective (performance on all of \(F\)) creates the Goodhart problem.

8.2.2 Failure of Deterministic Mechanisms

Consider first a deterministic mechanism where the designer publishes a fixed subset \(S^* \subseteq F_E\).

Proposition: Failure of Deterministic Mechanisms

If the designer publishes a deterministic set \(S^*\) with additive reward \(r(\theta, S) = \sum_{f \in S} f(\theta)\), the builder’s best response is: \[ \theta^* \in \arg\max_{\theta \in \Theta} \sum_{f \in S^*} f(\theta) \] This provides no incentive for performance on tasks outside \(S^*\).

Proof. The builder observes \(S^*\) directly and maximizes the known objective. Tasks in \(F \setminus S^*\) do not affect the score. \(\square\)

This is exactly Goodhart’s adversarial variant: the builder exploits the known structure of the evaluation to specialize. It is also the construct underrepresentation problem from Section 6.4.2 viewed through a strategic lens: a fixed benchmark systematically ignores capabilities outside \(S^*\), and a rational builder responds accordingly.

8.2.3 One-Shot Alignment via Randomization

While deterministic mechanisms fail, randomization offers a path forward. When the builder faces genuine uncertainty about which tasks will be evaluated, the strategic landscape changes fundamentally.

Theorem 1: One-Shot Incentive Alignment (Omniscient Builder)

Consider a randomized mechanism where (i) \(M\) draws a single task uniformly: \(s \sim \text{Unif}(F_E)\), (ii) the reward is \(r(\theta, \{s\}) = s(\theta)\), and (iii) the builder has a uniform prior \(\pi\) over which tasks comprise \(F_E\). Then the builder’s expected reward is proportional to the designer’s utility: \[ \mathbb{E}_{F_E \sim \pi}\!\left[\mathbb{E}_{s \sim \text{Unif}(F_E)}[s(\theta)]\right] = \frac{1}{|F|}\, u_E(\theta) \] and the builder’s best response maximizes \(u_E(\theta)\).

Proof sketch. Under a symmetric (uniform) prior over which tasks compose \(F_E\), the builder has no information distinguishing any task \(f_i\) from \(f_j\). By symmetry, the marginal probability that any specific task \(f\) is the sampled task is constant: \(P(s = f) = 1/|F|\). Therefore the builder’s expected reward is \((1/|F|) \sum_{f \in F} f(\theta) = (1/|F|) \, u_E(\theta)\), and maximizing expected reward is equivalent to maximizing \(u_E(\theta)\). \(\square\)

The omniscient version assumes the builder knows all of \(F\). A more realistic version considers a builder with limited information:

Theorem 2: One-Shot Alignment (Limited Information)

Suppose the builder draws tasks i.i.d. from distribution \(p_M\) and believes the evaluator also draws from \(p_M\). Under a single-sample mechanism, the builder’s best response is: \[ \theta^* \in \arg\max_{\theta \in \Theta} \mathbb{E}_{f \sim p_M}[f(\theta)] \] That is, the builder optimizes over their own best approximation of the task universe.

Proof sketch. The builder’s expected reward, taken over their prior on \(F_E\) and the randomness in \(M\), equals \(\mathbb{E}_{f \sim p_M}[f(\theta)]\). Since \(F_M\) is drawn from \(p_M\), the builder’s best response is empirical risk minimization over \(F_M\), which is the best they can do given their information. \(\square\)

This result is powerful: the single-sample mechanism is “the best you can do” as an evaluation designer. The builder’s best response is to approximate their own approximation of \(F\) to the best of their ability — essentially training broadly over all available data. Under incentive alignment, the builder performs empirical risk minimization (ERM) over their task set \(F_M\). By classical uniform convergence (Vapnik 1998), restricting to any subset \(S \subset F_M\) can only increase regret; the builder has no incentive to “game” by focusing on a narrow task set.

8.2.4 Discussion: What Randomization Buys

The one-shot alignment result reveals a fundamental tension with the statistical efficiency goals of earlier chapters. In ?sec-fisher-information and Section 4.2.2, we showed that targeted evaluation — choosing items to maximize Fisher information about \(\theta\) — is statistically optimal. But targeted evaluation requires revealing information about which tasks the evaluator considers informative, which creates exploitable structure.

Randomized evaluation is not optimal for estimation precision but is optimal for incentive alignment. This tension between statistical efficiency and strategic robustness is a recurring theme:

  • Computerized adaptive testing (Section 4.2.2) reveals the evaluator’s information about ability through the item selection strategy, making the evaluator’s priorities transparent.
  • Bayesian persuasion (Kamenica and Gentzkow 2011) provides the information design framework: the evaluator is a sender choosing an information structure, and the builder is a receiver who updates and best-responds. The evaluator’s problem is to design a signal that induces the builder to take the action the evaluator prefers.

The resolution, as we develop in the next two sections, is that the evaluator must invest in correction to make the inevitable information leakage harmless.

Figure. Under a deterministic mechanism (left), the builder concentrates effort on the \(k=10\) known evaluation tasks and neglects the rest, achieving high benchmark scores but low true utility. Under a randomized mechanism (center), the builder spreads effort broadly. The right panel compares average true utility across all tasks: the randomized mechanism produces a substantially higher utility despite lower peak performance on any single task.

8.3 The Information-Variance Tradeoff

Theorem 1 established that randomized evaluation achieves incentive alignment in one-shot settings. But benchmarks operate as repeated games: weekly leaderboards, monthly submissions, quarterly benchmark releases. In repeated evaluation, randomization alone fails because information leaks.

8.3.1 Repeated Evaluation and Information Leakage

Definition: Repeated Evaluation Game

At each round \(t = 1, 2, \ldots\):

  1. The evaluator draws a fresh evaluation set \(F_E^{(t)}\) by sampling from \(F\) according to \(\pi_E\).
  2. The evaluator samples \(k\) tasks from \(F_E^{(t)}\) via mechanism \(M\) and publishes the score.
  3. The builder observes the \(k\) evaluated tasks, updates their estimate \(\hat{\pi}_{E,t}\) of the evaluation distribution, and selects model \(\theta_t\).

The variance of a \(k\)-task average reward scales as \(\text{Var}(\hat{r}_k(\theta)) \approx \sigma^2/k\). To reliably distinguish two models with performance gap \(\Delta\) at 95% confidence requires \(k \geq 4\sigma^2/\Delta^2\). As models improve and converge (\(\Delta \to 0\)), the required sample size grows quadratically. This creates inexorable market pressure to increase \(k\).

But increasing \(k\) to reduce variance simultaneously increases the rate at which the builder accumulates information about \(\pi_E\). The builder’s information set at time \(t\) is \(\mathcal{I}_t = \bigcup_{i=1}^{t-1} S_i\), where \(S_i\) is the set of tasks sampled in round \(i\). The leakage is \(L_t = |\mathcal{I}_t|/|F|\).

8.3.2 Posterior Concentration and Incentive Misalignment

As the builder observes sampled tasks across rounds, their estimate \(\hat{\pi}_{E,t}\) of the evaluation distribution concentrates around the true \(\pi_E\). By standard results in Bayesian nonparametrics, the posterior concentration rate is \(O(d_{\text{eff}} \log m / m)\) where \(m\) is the number of observations and \(d_{\text{eff}}\) is the effective dimension of the distribution class.

Proposition: Incentive Misalignment Under Distribution Learning

Let \(\hat{\pi}_{E,t}\) denote the builder’s posterior mean estimate of \(\pi_E\) at time \(t\). The builder’s optimal strategy is \(\theta^*_t = \arg\max_\theta \mathbb{E}_{f \sim \hat{\pi}_{E,t}}[f(\theta)]\). Then:

  1. Initial alignment: When the prior over \(\pi_E\) is diffuse, the builder optimizes broadly, approximating performance on the full universe \(F\).

  2. Posterior concentration leads to specialization: As \(\hat{\pi}_{E,t}\) concentrates around \(\pi_E\): \[ \lim_{t \to \infty} \theta^*_t = \arg\max_{\theta \in \Theta} \mathbb{E}_{f \sim \pi_E}[f(\theta)] \] If \(\pi_E\) is non-uniform, the builder specializes to high-density regions at the expense of broad capability.

This is the core tension: privatization works in one-shot settings, but repeated evaluation reveals information that enables strategic specialization. The builder cannot identify a fixed evaluation set (since \(F_E^{(t)}\) is resampled each round), but from repeated observations the builder learns \(\pi_E\) itself — the distribution from which evaluation tasks are drawn.

8.3.3 The Pareto Frontier

Define the residual misalignment at time \(t\) as: \[ \Delta_t = \mathbb{E}_{f \sim \pi_E^{(t)}}[f(\theta^*_t)] - \frac{1}{|F|} u_E(\theta^*_t) \]

This measures the gap between what the builder optimizes for (expected performance under \(\pi_E\)) and what society wants (aggregate performance over all of \(F\)). When \(\pi_E\) is uniform, \(\Delta_t = 0\).

No evaluation strategy can simultaneously achieve low variance and low leakage. Variance decreases in \(k\); leakage increases in \(k\). Varying \(k\) traces a Pareto frontier where the evaluator must choose their preferred operating point.

Figure. The information-variance Pareto frontier across \(k \in \{5, 10, 20, 50, 100\}\) over 200 rounds with \(|F| = 200\) tasks. Left: evaluation noise (score variance) is lower for larger \(k\). Center: the builder learns \(\pi_E\) faster with larger \(k\), as measured by KL divergence between the builder’s posterior and the true \(\pi_E\). Right: as the builder’s posterior concentrates, incentive misalignment grows. No single \(k\) achieves both low noise and low misalignment — a fundamental tradeoff.

8.4 Restoring Alignment

The Pareto frontier of the previous section seems like an impasse: any choice of \(k\) sacrifices either measurement precision or incentive alignment. The resolution comes from recognizing that the evaluator is also a learner. If the evaluator corrects their distribution over time, the information the builder accumulates becomes obsolete.

8.4.1 Distribution Correction

Assumption: Distribution Correction

The evaluator updates \(\pi_E^{(t)}\) via linear interpolation toward uniform at rate \(\rho \in (0, 1]\): \[ \pi_E^{(t)} = (1 - \rho)\,\pi_E^{(t-1)} + \rho \cdot \text{Uniform}(F) \] so that \(\text{KL}(\pi_E^{(t)} \| \text{Uniform}(F)) \to 0\) exponentially with rate \(\rho\).

In practice, evaluators correct their biases through several channels: incident reports from users who discover model failures that benchmarks missed; systematic audits of coverage across capability dimensions (languages, modalities, reasoning types); improving benchmark acquisition capacity over time. The parameter \(\rho\) captures the fraction of bias removed per evaluation round.

Proposition: Alignment Recovery Under Distribution Correction

Under the repeated evaluation game with distribution correction, the builder’s optimal strategy converges to maximizing the designer’s utility: \[ \theta^*_t \to \arg\max_{\theta \in \Theta} \frac{1}{|F|} \sum_{f \in F} f(\theta) = \arg\max_{\theta \in \Theta} u_E(\theta) \] as \(\pi_E^{(t)} \to \text{Uniform}(F)\).

Proof sketch. The builder’s expected reward under mechanism \(M\) is \(\mathbb{E}_{f \sim \pi_E^{(t)}}[f(\theta)]\). As \(\pi_E^{(t)} \to \text{Uniform}(F)\), this converges to \((1/|F|) u_E(\theta)\). In the limit, the builder knows \(\pi_E = \text{Uniform}(F)\) perfectly, but this knowledge provides no advantage: optimizing for the uniform distribution is the designer’s objective. \(\square\)

The key insight is a race between two learners: the builder learns \(\pi_E^{(t)}\) from observations, while the evaluator corrects \(\pi_E^{(t)}\) toward uniform. The residual misalignment depends on the relative rates of these two processes.

Proposition: Misalignment Bound

Let \(D_0 = \text{KL}(\pi_E^{(0)} \| \text{Uniform}(F))\) and \(m_t = k \cdot \min(t, \rho^{-1})\) be the builder’s effective sample size. The residual misalignment satisfies: \[ \Delta_t \leq \min\!\left(\underbrace{\frac{m_t}{m_t + |F|}}_{\text{estimation-limited}},\;\; \underbrace{(1-\rho)^t \sqrt{\frac{D_0}{2}}}_{\text{correction-limited}}\right) \]

The two terms capture complementary regimes:

  • Estimation-limited (\(t\) small): \(m_t \approx kt\), giving \(\Delta_t \lesssim kt/|F|\). The builder has too few observations to identify the evaluator’s biases.
  • Correction-limited (\(t\) large): \(\Delta_t \lesssim (1-\rho)^t \sqrt{D_0/2}\). The builder may know \(\pi_E^{(t)}\) well, but the evaluator’s correction has driven the bias toward zero, leaving nothing to exploit.

Proof sketch. The correction-limited term follows from Pinsker’s inequality: \(\Delta_t \leq \text{TV}(\pi_E^{(t)}, \text{Uniform}(F)) \leq \sqrt{D_0/2}\,(1-\rho)^t\), since the total variation decays geometrically under linear interpolation. The estimation-limited term follows from Bayesian shrinkage: a builder with \(m_t\) observations and a Dirichlet\((1,\ldots,1)\) prior has posterior mean that is an \(m_t/(m_t + |F|)\)-fraction of the way from uniform to the empirical distribution. \(\square\)

Figure. Alignment recovery under distribution correction with \(|F| = 200\) and \(k = 20\). Left: residual misalignment converges to zero for all \(\rho > 0\), with faster correction rates producing faster convergence. The \(\rho = 0\) baseline (gray) maintains persistent misalignment. Right: KL divergence decays exponentially on a log scale, with rate proportional to \(\rho\).

8.4.2 Optimal Evaluation Size

With distribution correction, the Pareto frontier from Section 8.3.3 collapses. Each revealed task provides only transient advantage to the builder, because the bias it reveals gets corrected within \(O(1/\rho)\) rounds. The evaluator’s per-round loss decomposes as: \[ \mathcal{L}(k) = \underbrace{\frac{\sigma^2}{k}}_{\text{evaluation variance}} + \underbrace{\frac{\gamma k}{\rho}}_{\text{steady-state exploitation}} \]

The first term decreases in \(k\) (more tasks, less noise); the second increases in \(k\) (more tasks, faster builder learning) and decreases in \(\rho\) (faster correction, shorter exploitation window).

Proposition: Optimal Sample Size

Given correction rate \(\rho > 0\) and gaming penalty \(\gamma > 0\), the optimal number of evaluation tasks per round is: \[ k^* = \sigma \sqrt{\frac{\rho}{\gamma}} \] with minimum loss \(\mathcal{L}(k^*) = 2\sigma\sqrt{\gamma/\rho}\).

Proof. Setting \(\partial \mathcal{L}/\partial k = -\sigma^2/k^2 + \gamma/\rho = 0\) gives \(k^2 = \sigma^2 \rho / \gamma\), hence \(k^* = \sigma\sqrt{\rho/\gamma}\). The second-order condition \(\partial^2 \mathcal{L}/\partial k^2 = 2\sigma^2/k^3 > 0\) confirms this is a minimum. \(\square\)

Several key implications follow:

  • \(k^*\) increases with \(\sqrt{\rho}\): Faster correction allows larger evaluation sets. An evaluator who corrects twice as fast can evaluate \(\sqrt{2} \approx 1.4\) times as many tasks.
  • \(k^*\) increases with \(\sigma\): Higher task variance demands more samples for a reliable signal, just as in the static case.
  • Limiting behavior: As \(\rho \to 0\) (no correction), \(k^* \to 0\) — the evaluator should reveal as few tasks as possible. As \(\rho \to \infty\) (instant correction), \(k^* \to \infty\) — leakage is immediately outdated.
  • The key lever is \(\rho\): Since \(\mathcal{L}(k^*) = 2\sigma\sqrt{\gamma/\rho}\), the evaluator’s optimal loss decreases as \(1/\sqrt{\rho}\). Investing in distribution correction (coverage audits, benchmark changelogs, incident-driven task additions) is the most effective lever for benchmark design.

Figure. Evaluator loss \(\mathcal{L}(k) = \sigma^2/k + \gamma k / \rho\) for different correction rates, with stars marking the optimal \(k^*\). Faster correction (larger \(\rho\)) shifts \(k^*\) rightward and lowers the minimum loss. The static case (\(\rho \to 0\), dashed) is monotonically decreasing, reflecting the irresolvable tradeoff without correction. Parameters: \(\sigma = 5\), \(\gamma = 0.05\).

8.4.3 Noise-Gated Holdout

Distribution correction makes leaked information obsolete by moving the target. A complementary mechanism prevents leakage in the first place by gating the information flow.

Definition: Holdout Evaluation Mechanism

The evaluator publishes a reference set \(S_0 \subset F\) of size \(n_0\), sampled from \(\pi_E\). At each round \(t\):

  1. Draw a holdout set \(S_t\) of size \(k\) from \(\pi_E^{(t)}\) and noise \(\xi_t \sim \text{Laplace}(0, 1/\varepsilon)\).
  2. Compute the reference score \(r_T(\theta_t) = \frac{1}{n_0}\sum_{f \in S_0} f(\theta_t)\) and the holdout score \(r_H(\theta_t) = \frac{1}{k}\sum_{f \in S_t} f(\theta_t)\).
  3. Threshold test: If \(|r_H(\theta_t) + \xi_t - r_T(\theta_t)| < \eta\), publish \(r_T(\theta_t)\) (“silent” round). Otherwise, publish \(r_H(\theta_t) + \xi_t\) (“alarm” round).

The mechanism is self-correcting: it reveals less information precisely when the builder is behaving well (\(\Delta_{\text{game}} \approx 0\), so the alarm fires only due to noise, with probability \(p_{\text{alarm}} = e^{-\varepsilon\eta}\)), and reveals more when the builder is gaming (the score discrepancy reliably triggers the alarm). The effective leakage per round for a non-gaming builder is: \[ \lambda_{\text{out}}^{\text{holdout}} \approx e^{-\varepsilon\eta} \cdot k \]

Setting \(\varepsilon\eta = 3\) yields a \(\sim 20\times\) slowdown in the builder’s learning rate, buying the evaluator substantially more time for distribution correction.

Why Noise Alone Cannot Replace Distribution Correction

The holdout mechanism slows the builder’s learning but does not change their incentives conditional on what they have learned. Regardless of how much Laplace noise is added, the builder’s optimal strategy given their posterior \(\hat{\pi}_{E,t}\) remains \(\theta_t^* = \arg\max_\theta \mathbb{E}_{f \sim \hat{\pi}_{E,t}}[f(\theta)]\). If \(\pi_E\) is biased and fixed, the builder will eventually learn this bias and specialize accordingly. Only distribution correction, which drives \(\pi_E^{(t)} \to \text{Uniform}(F)\), ensures that the builder’s eventual knowledge provides no exploitable advantage. This is the key difference from pure differential privacy: DP slows learning; distribution correction eliminates the incentive to exploit what is learned.

This connects to the differential privacy literature (Dwork et al. 2015; Dwork and Roth 2014). The holdout mechanism inherits DP guarantees from the Sparse Vector Algorithm, bounding the max-information between the builder’s observations and the holdout set. However, DP addresses a different threat: it prevents reconstruction of which specific tasks are in the holdout (overfitting to the test set), while our concern is that the builder learns the evaluation distribution and specializes accordingly.

8.5 Metric Design as Principal-Agent Problem

We now shift from “which tasks to show” to “which metrics to report.” Even with a perfectly designed evaluation mechanism, the choice of metric can create perverse incentives.

8.5.1 When Metrics Create Perverse Incentives

A vivid illustration comes from healthcare. In 2001, the New York State Department of Health began publishing hospital mortality rates for cardiac surgery. Dranove et al. (2003) documented the consequences: hospitals began avoiding severely ill patients rather than improving care. By publishing average treated outcome (ATO) as the quality metric, the system rewarded patient selection rather than treatment effectiveness.

The AI evaluation analogy is direct. A leaderboard that rewards average benchmark score incentivizes developers to specialize on easy benchmarks or to select favorable evaluation conditions, rather than improving broadly. The metric itself creates the misalignment.

Formalizing this as a principal-agent problem (Laffont and Tirole 1986): the principal (evaluator) chooses a reward function \(w\), and the agent (developer) best-responds with policy \(\pi^w\). The principal’s regret is \(R(\pi^w) = \max_{\tilde{\pi} \in \Pi} V(\tilde{\pi}) - V(\pi^w)\), where \(V(\pi) = \mathbb{E}[Y(\pi) - Y(0)]\) is the total treatment effect.

Proposition: ATO Has Unbounded Regret

The average treated outcome \(w_{\text{ATO}}(x, t, y) = y \cdot \mathbf{1}[t = 1]\) can have unbounded regret. An agent maximizing \(\mathbb{E}[w_{\text{ATO}}]\) may achieve \(R(\pi^{w_{\text{ATO}}}) = \max_\pi V(\pi)\) — the worst possible policy.

The intuition is stark: a hospital maximizing average treated outcome selects only the healthiest patients for treatment, achieving excellent ATO but zero (or negative) total treatment effect. In the AI setting, a developer maximizing average benchmark accuracy across their selected tasks may simply drop the hardest benchmarks from their evaluation suite.

Proposition: Total Treatment Effect Achieves Zero Regret

The total treatment effect metric \(w_{\text{TT}}(x, t, y) = y - \hat{\mu}_0(x)\) achieves zero regret when the principal has unbiased counterfactual estimates \(\hat{\mu}_0(x) = \mathbb{E}[Y(0) \mid X = x]\).

The key insight: by subtracting the counterfactual baseline (what would have happened without treatment), the metric removes the incentive for patient selection. In AI evaluation, this translates to scoring the “capability uplift” — performance relative to a baseline — rather than raw performance. This connects to the doubly robust estimation from Chapter 7: counterfactual metrics require causal reasoning about what performance would have been absent the developer’s effort.

8.5.2 Information Asymmetry and Metric Elicitation

Beyond metric choice, there is a prior question: how much should the developer reveal about their model’s capabilities? Consider a developer (agent) with a private cost-correlated variable \(X\) — for instance, knowledge about which capability dimensions their model excels at. The evaluator (principal) can design contracts conditioned on \(X\) if it is revealed.

Drawing on the information elicitation framework of Wang et al. (2024), the key question is: when does the developer prefer to reveal, conceal, or garble information about their capabilities?

  • Reveal: The developer discloses \(X\) fully. The principal designs a targeted contract, which benefits both parties when the information enables efficient allocation.
  • Conceal: The developer hides \(X\). The principal must use a one-size-fits-all contract, which is robust but potentially inefficient.
  • Garble: The developer discloses a noisy version of \(X\). This intermediate option can be Pareto-improving.
Proposition: Principal Always Benefits from Revelation

The principal’s expected utility is weakly higher under revelation than concealment: \(V_P^{\text{reveal}} \geq V_P^{\text{conceal}}\).

Proof sketch. Under revelation, the principal’s optimization problem has a strictly larger feasible set (contracts conditioned on \(X\)), so the optimum can only improve. \(\square\)

But the developer’s incentives are more nuanced. Revelation helps when conditioning on \(X\) sufficiently differentiates high-cost and low-cost types, allowing the principal to offer more efficient contracts. Concealment is preferred when revelation would lead the principal to extract all of the developer’s surplus through perfectly targeted contracts.

8.5.3 Garbling as Differential Privacy

The most interesting case is garbling: the developer reveals \(Y = X\) with probability \(\varepsilon\) and \(Y = \xi\) (noise) with probability \(1 - \varepsilon\). This is precisely the randomized response mechanism from differential privacy.

Proposition: Garbling Can Pareto-Dominate Both Extremes

Under fairly wide conditions (e.g., two exponential cost types with different means), the agent may prefer garbled disclosure to both full concealment and full revelation. Moreover, garbling can increase total welfare compared to concealment.

The intuition is that garbling provides the principal with just enough information to design better contracts without enabling full surplus extraction. In the AI evaluation context, this maps to:

  • Model cards with calibrated noise: Developers disclose approximate capability profiles rather than exact performance vectors.
  • Differential privacy in metric reporting: Adding Laplace noise to reported metrics creates a garbling mechanism.
  • Benchmark transparency tiers: Publish aggregate scores publicly, but release fine-grained breakdowns only through a privacy-preserving mechanism.

This connects directly to the holdout mechanism of Section 8.4.3: both use calibrated noise to create better incentive equilibria. The parallel is deep — the evaluator’s decision about how much to reveal about the evaluation distribution and the developer’s decision about how much to reveal about model capabilities are dual problems in information design.

Figure. Left: heatmap of the agent’s net benefit from revelation vs. concealment across pairs of exponential cost types \((\lambda_0, \lambda_1)\). Red regions indicate the agent prefers revelation; blue regions favor concealment. The black contour marks indifference. Right: agent utility under garbling (noisy disclosure with probability \(\varepsilon\)) for fixed cost types. An intermediate garbling level can outperform both full concealment and full revelation.

8.6 Representative Benchmark Selection

A distinct but related design question: given a large suite of evaluation metrics, how should one select a representative subset? This is the “lite benchmark” problem faced by BIG-bench (200+ metrics \(\to\) BIG-bench Lite, 24), HELM (\(\to\) HELM Lite), and Cal Hospital Compare (hundreds of quality measures \(\to\) 12).

8.6.1 The Subset Selection Problem

Let there be \(n\) metrics and \(m\) alternatives (models). Each metric \(i\) produces a ranking \(\sigma_i\) over the alternatives. We seek a subset \(K \subseteq N = [n]\) of metrics that is “representative” of the full set \(N\). But what does “representative” mean formally? The work of Procaccia et al. (2025) provides two precise definitions grounded in social choice theory.

8.6.2 Positional Representation

The first notion prevents under-representation at every rank cutoff.

Definition: Positional Representation

A subset \(K\) satisfies positional representation for group size \(g\) if for every rank cutoff \(r \in [m]\) and every alternative \(a\): \[ C(K, r, a) \geq \left\lfloor \frac{C(N, r, a)}{g} \right\rfloor \] where \(C(S, r, a) = |\{i \in S : \sigma_i(a) \leq r\}|\) counts how many metrics in \(S\) rank alternative \(a\) in the top \(r\).

In words: if alternative \(a\) is ranked in the top \(r\) by at least \(\ell \cdot g\) metrics in \(N\), then \(a\) must be ranked in the top \(r\) by at least \(\ell\) metrics in \(K\). The parameter \(g\) controls the granularity of representation — smaller \(g\) requires finer representation but demands larger \(|K|\).

Theorem: Positional Representation Bounds

The minimum subset size needed to guarantee positional representation satisfies: \[ \Omega\!\left(\frac{n}{g} \cdot \frac{\log m}{\log(n \log m / g)}\right) \leq |K| \leq O\!\left(\frac{n}{g} \log m\right) \] The upper bound is achieved by a polynomial-time greedy algorithm (Algorithm 1 in Procaccia et al. (2025)) based on set cover.

The greedy algorithm works as follows: iterate through the preference profile row by row, coloring entries when an alternative accumulates \(g\) appearances. Then greedily select metrics that cover the most remaining colors. The connection to set cover yields the logarithmic factor.

8.6.3 Positional Proportionality

Positional representation prevents under-representation but not over-representation. A stronger notion prevents both:

Definition: Positional Proportionality

A subset \(K\) satisfies \(\epsilon\)-positional proportionality if for every alternative \(a\) and every rank cutoff \(r\): \[ \left| \frac{C(N, r, a)}{|N|} - \frac{C(K, r, a)}{|K|} \right| \leq \epsilon \]

This preserves the fraction of metrics ranking each alternative at each position, up to additive error \(\epsilon\).

Theorem: Positional Proportionality Bounds

The minimum subset size for \(\epsilon\)-positional proportionality satisfies: \[ \Omega\!\left(\frac{1}{\epsilon^2} \log m\right) \leq |K| \leq O\!\left(\frac{1}{\epsilon^2} \log m\right) \] These bounds are tight up to constant factors.

The tight bounds follow from a connection to uniform convergence: a random subset of size \(O(\epsilon^{-2} \log m)\) satisfies positional proportionality with high probability, by a Chernoff-type argument applied simultaneously to all \(O(m)\) alternatives and rank cutoffs.

A particularly useful consequence connects proportionality to scoring rules:

Theorem: Scoring Rule Approximation

If \(K\) satisfies \(\epsilon\)-positional proportionality, then for any scoring rule with score vector \(s\) and every alternative \(a\): \[ |f_s(a, \sigma_N) - f_s(a, \sigma_K)| \leq \epsilon \] where \(f_s(a, \sigma_S) = \frac{1}{|S|} \sum_{i \in S} s_{\sigma_i(a)}\) is the average score of alternative \(a\) under scoring rule \(s\) in metric set \(S\).

This is the key practical guarantee: a subset satisfying positional proportionality approximates any scoring rule on the original set of metrics. The connection to Section 6.5.2 is direct: positional proportionality provides a formal criterion for when a “lite” benchmark preserves the same information as the full suite, complementing the dimensionality analysis of factor models.

Figure. Left: subset size \(|K|\) required by the greedy algorithm for positional representation at various group sizes \(g\), compared to random selection and the theoretical \(O(n/g \cdot \log m)\) bound. The greedy algorithm consistently outperforms random selection. Right: for \(g=5\), comparison of the fraction of metrics ranking each model in the top 5, between the full set (\(n=50\)) and the greedy-selected subset. The subset closely preserves the ranking structure.

8.7 Synthesis: Design Principles for Strategic Benchmarks

The three threads of this chapter — information design, metric design, and representative selection — converge on six actionable design principles for AI evaluation.

Table 8.1: Six design principles for strategic AI evaluation
Principle Formal Basis Recommendation
1. Randomize and Refresh Theorem 1 (one-shot alignment), Prop. (deterministic failure) Use randomized evaluation with task renewal. Static benchmarks are Goodhart-vulnerable by construction.
2. Correct and Grow Prop. (alignment recovery), Prop. (optimal \(k^*\)) Invest in distribution correction (\(\rho\)) as the primary lever. The optimal \(k^*\) scales with \(\sqrt{\rho}\).
3. Gate Information Release Holdout mechanism, Prop. (learning slowdown) Use threshold tests to condition information flow on builder behavior. Combine with distribution correction.
4. Align Metrics with Welfare Prop. (ATO regret), Prop. (TT zero regret) Score total treatment effect, not averages. Account for counterfactual baselines.
5. Allow Partial Transparency Prop. (garbling dominance) Calibrated noise in metric reporting (garbling) can create Pareto improvements over both full transparency and full opacity.
6. Ensure Representative Subsets Thms. (positional representation/proportionality bounds) Use formal representation criteria rather than ad hoc selection when creating lite benchmarks.

These principles interact in important ways. Principles 1–3 address the information channel (how much the evaluator reveals about the evaluation mechanism). Principle 4 addresses the metric channel (what the evaluator measures). Principle 5 addresses the developer channel (what the developer reveals about their model). Principle 6 addresses the scope channel (which metrics to include).

A well-designed evaluation system should operate on all four channels simultaneously: randomize task selection (Principle 1), invest in distribution correction (Principle 2), gate information via holdout mechanisms (Principle 3), use welfare-aligned metrics (Principle 4), allow developers to disclose through privacy-preserving mechanisms (Principle 5), and ensure the metric suite is formally representative (Principle 6).

8.8 Discussion Questions

  1. The evaluation game framework assumes a benevolent evaluator (social planner). What changes if the evaluator also has strategic incentives — for example, a company running its own benchmark to favor its own models?

  2. How does the information-variance tradeoff relate to the reliability-validity tradeoff from Chapter 5 and Chapter 6? Is there a formal connection between \(\sigma^2/k\) (evaluation variance) and Cronbach’s \(\alpha\)?

  3. Can you design a mechanism where the builder’s incentive is to improve on the hardest tasks rather than the average? How would you modify \(u_E(\theta)\) and \(r(\theta, S)\)?

  4. In what sense is the Chatbot Arena a randomized evaluation mechanism? Does it satisfy the conditions of Theorem 1? What information leaks through the adaptive matching process?

  5. How should a government regulator set \(\gamma\) (the gaming penalty) for a safety benchmark? What factors should influence this choice?

  6. If two “lite” benchmarks both satisfy \(\epsilon\)-positional proportionality with the same \(\epsilon\) but select different subsets, which should be preferred? What additional criteria might break ties?

  7. The holdout mechanism draws on differential privacy. What is the relationship between the privacy budget \(\varepsilon\) in the holdout mechanism and the notion of \(\varepsilon\)-differential privacy from the data privacy literature?

  8. Distribution correction requires the evaluator to “know their bias.” How can an evaluator identify which regions of the task universe they under-represent? What role do meta-evaluations and coverage audits play?

8.9 Bibliographic Notes

The Stackelberg evaluation game and the information-variance tradeoff are developed in Truong et al. (2025). The distribution correction mechanism and optimal evaluation size results are from the same work. The analysis of metric design as a principal-agent problem draws on Wang et al. (2024), which develops counterfactual quality metrics (Chapter 5) and information elicitation in agency games (Chapter 6). The positional representation and proportionality framework is from Procaccia et al. (2025).

Goodhart’s Law was articulated by Goodhart (1984) in the context of monetary policy. Manheim and Garrabrant (2018) formalize four variants. The connection to reward overoptimization in RLHF is explored by Gao, Schulman, and Hilton (2023).

Strategic classification — the study of agents who manipulate their features to achieve favorable classification outcomes — was formalized by Hardt et al. (2016). Perdomo et al. (2020) introduce performative prediction, where the model itself changes the data distribution. Braverman and Garg (2020) show that randomness is necessary for efficient classification under strategic behavior.

Bayesian persuasion (Kamenica and Gentzkow 2011) provides the information design framework connecting evaluator disclosure to builder behavior. Bergemann and Morris (2019) survey the broader information design literature. Contract theory (Laffont and Tirole 1986; Holmstrom and Milgrom 1991) provides the principal-agent foundations for metric design.

Differential privacy and adaptive data analysis (Dwork et al. 2015; Dwork and Roth 2014) underpin the holdout mechanism. Blum and Hardt (2015) apply reusable holdout techniques to machine learning competitions. The connection between DP and strategic robustness is developed in the holdout mechanism analysis.

Social choice and benchmarking connect to Zhang and Hardt (2024) (Arrow’s impossibility for benchmarks), Colombo et al. (2022) (Borda count for benchmark aggregation), and Rofin and Mikhailov (2023) (scoring rules for benchmark ranking). The committee selection literature, particularly justified representation (Aziz et al. 2017), provides the social choice foundations for positional representation.

8.10 Exercises

  1. (Easy) Show that if \(\pi_E = \text{Uniform}(F)\), the one-shot alignment theorem holds for any sample size \(k\), not just \(k = 1\). Hint: Show that for any \(k\)-subset mechanism, the expected reward is proportional to \(u_E(\theta)\).

  2. (Easy) Verify that \(k^* = \sigma\sqrt{\rho/\gamma}\) minimizes \(\mathcal{L}(k) = \sigma^2/k + \gamma k/\rho\). What is \(\mathcal{L}(k^*)\)? What happens to \(k^*\) when \(\gamma\) doubles?

  3. (Medium) Derive the estimation-limited term in the misalignment bound using a Dirichlet\((1, \ldots, 1)\) prior over distributions on \(F\) with \(m\) effective observations. Show that the posterior predictive is the shrinkage estimator \(\hat{\pi}_E = \frac{m}{m+N}\hat{p} + \frac{N}{m+N}\text{Uniform}(F)\) and that \(\text{TV}(\hat{\pi}_E, \text{Uniform}(F)) \leq m/(m+N)\).

  4. (Medium) In the agency game with binary \(X\), suppose \(C \mid X=0 \sim \text{Exp}(\lambda_0)\) and \(C \mid X=1\) is zero-cost. Derive the condition on \(\lambda_0\) and the principal’s value \(b\) under which the agent prefers concealment to revelation.

  5. (Medium) Show that any subset \(K\) satisfying \(\epsilon\)-positional proportionality approximates any scoring rule within \(\epsilon\). Hint: Write \(f_s(a, \sigma_S)\) as a weighted sum of cumulative counts \(C(S, r, a)/|S|\) and apply Abel summation.

  6. (Hard) Extend the one-shot alignment theorem to the case where the builder has a non-uniform prior \(p_M\) over \(F\). Show that the builder’s best response maximizes \(\mathbb{E}_{f \sim p_M}[f(\theta)]\). Under what conditions does this coincide with maximizing \(u_E(\theta)\)?

  7. (Hard) Prove that the holdout mechanism slows the builder’s posterior concentration by a factor of \(e^{\varepsilon\eta}\). Hint: Compute the effective sample size \(m_t^{\text{holdout}} = n_0 + p_{\text{alarm}} \cdot k \cdot \min(t, \rho^{-1})\) and compare to \(m_t^{\text{baseline}}\).

  8. (Hard, computational) Implement the greedy algorithm for positional representation on the HELM Lite data (or a synthetic dataset with \(n = 100\) metrics, \(m = 50\) models). Compare \(|K|\) to the existing lite benchmark subset for different values of the group size \(g\). At what \(g\) does the greedy algorithm match the size of the existing subset?

References

Aziz, Haris, Markus Brill, Vincent Conitzer, Edith Elkind, Rupert Freeman, and Toby Walsh. 2017. “Justified Representation in Approval-Based Committee Voting.” In Social Choice and Welfare, 48:461–85. 2. Springer.
Bergemann, Dirk, and Stephen Morris. 2019. “Information Design: A Unified Perspective.” Journal of Economic Literature 57 (1): 44–95.
Blum, Avrim, and Moritz Hardt. 2015. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” In Proceedings of the 32nd International Conference on Machine Learning, 1006–14.
Braverman, Mark, and Sumegha Garg. 2020. “The Role of Randomness and Noise in Strategic Classification.” Foundations of Responsible Computing (FORC).
Colombo, Pierre, Benjamin Clavié, Nathan Nogué, and Pablo Piantanida. 2022. “What Are the Best Systems? New Perspectives on NLP Benchmarking.” arXiv Preprint arXiv:2202.03799.
Dranove, David, Daniel Kessler, Mark McClellan, and Mark Satterthwaite. 2003. “Is More Information Better? The Effects of ‘Report Cards’ on Health Care Providers.” Journal of Political Economy 111 (3): 555–88.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. 2015. “Generalization in Adaptive Data Analysis and Holdout Reuse.” In Advances in Neural Information Processing Systems. Vol. 28.
Dwork, Cynthia, and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. Vol. 9. 3–4. Now Publishers.
Gao, Leo, John Schulman, and Jacob Hilton. 2023. “Scaling Laws for Reward Model Overoptimization.” Proceedings of the 40th International Conference on Machine Learning, 10835–66.
Goodhart, Charles A. E. 1984. “Problems of Monetary Management: The U.K. Experience.” Monetary Theory and Practice, 91–121.
Hardt, Moritz, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. 2016. “Strategic Classification.” In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 111–22. ACM.
Holmstrom, Bengt, and Paul Milgrom. 1991. “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design.” Journal of Law, Economics, and Organization 7: 24–52.
Kamenica, Emir, and Matthew Gentzkow. 2011. “Bayesian Persuasion.” American Economic Review 101 (6): 2590–2615.
Laffont, Jean-Jacques, and Jean Tirole. 1986. “Using Cost Observation to Regulate Firms.” Journal of Political Economy 94 (3): 614–41.
Manheim, David, and Scott Garrabrant. 2018. “Categorizing Variants of Goodhart’s Law.” arXiv Preprint arXiv:1803.04585.
Perdomo, Juan, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. 2020. “Performative Prediction.” Proceedings of the 37th International Conference on Machine Learning, 7599–609.
Procaccia, Ariel D., Benjamin Schiffer, Serena Wang, and Shirley Zhang. 2025. “Metritocracy: Representative Metrics for Lite Benchmarks.” arXiv Preprint arXiv:2506.09813.
Rofin, Mark, and Danila Mikhailov. 2023. VOTE’N’RANK: Revision of Benchmarking with Social Choice Theory.” Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 670–86.
Truong, Son, Serena Wang, Hoda Heidari, and Rishi Bommasani. 2025. “Incentive-Aligned Evaluation via Private Benchmark.” arXiv Preprint arXiv:2506.00000.
Vapnik, Vladimir N. 1998. Statistical Learning Theory. New York: Wiley.
Wang, Serena, Michael I. Jordan, Katrina Ligett, and R. Preston McAfee. 2024. “Relying on the Metrics of Evaluated Agents.” arXiv Preprint arXiv:2402.14005.
Zhang, Shirley, and Moritz Hardt. 2024. “The Inherent Tradeoffs in LLM Benchmarks.” arXiv Preprint.