8 Ecosystem

Intended Learning Outcomes

By the end of this chapter, the reader will be able to:

Map the stakeholders of a modern AI benchmark—builders, evaluators/platforms, adopters, investors, and regulators—and identify the decision each one delegates to the metric and the incentive each has to distort it.
Explain why an expanding audience turns Goodhart’s Law from an academic curiosity into a structural force, using the economics of signaling and the market for lemons.
Distinguish the two directions of gaming: builders who game the metric and evaluators/platforms who game the audience, and classify real episodes into a mechanism-based taxonomy.
Connect each gaming channel to a design lever from Chapter 9, and identify which channels remain open research problems.
Articulate the governance frontier: multi-principal evaluation, the role of open-source models, and how to organize a trustworthy third-party evaluation ecosystem.

Suggested Lecture Plan

This chapter can be covered in 2 lectures (75–90 minutes each):

Lecture 1: The Ecosystem and Its Failures

From scoreboard to target: the expanding audience (15 min)
Benchmarks as signaling and the market for lemons (20 min)
A taxonomy of gaming: builder-side and evaluator-side (35 min)

Lecture 2: Governance

The gaming-channel / design-lever map (20 min)
Multi-principal evaluation and the strategic evaluator (25 min)
The role of open-source; organizing the community (25 min)

Chapter Roadmap

This chapter is the diagnosis that motivates the design of Chapter 9. We ask three questions in sequence: who are the actors (Section 8.2), what game do they play (Section 8.3), and how should the system be governed (Section 8.5). The technical machinery for re-engineering the game—randomization, distribution correction, holdout mechanisms, welfare-aligned metrics—is developed in Chapter 9; here we establish why that machinery is needed and which failures it must close.

8.1 From Scoreboard to Target

“When a measure becomes a target, it ceases to be a good measure.” — Charles Goodhart (1975)

A benchmark begins life as an internal instrument: a model developer’s private scoreboard for tracking research progress. In that regime Goodhart’s Law is mostly latent—the only party who can optimize against the metric is the same party who benefits from an honest reading of it. The structural shift of the past few years is that the audience of a benchmark has expanded far beyond the developer who builds the model.

Today a single leaderboard number is consumed by:

Builders, who use it to allocate research effort and to claim state-of-the-art.
Adopters (enterprises, agencies, application developers), who use it as a procurement and due-diligence signal in build-vs-buy and vendor-selection decisions.
Investors, who read benchmark leadership as a proxy for a technical moat when pricing funding rounds.
Governments and regulators, who increasingly reference capability and safety evaluations in policy. (TODO: cite the specific provisions to lean on—e.g., the EU AI Act’s general-purpose-AI obligations and codes of practice, and the relevant US framework. Verify current status as of the writing date; the policy landscape moves quickly.)

Once any of these parties can move resources on the basis of a score, the developer’s reputation, funding, or regulatory standing depends on that score—and the data-generating process is no longer fixed. This is exactly the setting in which Goodhart’s adversarial variant dominates (Section 9.1). The remainder of this chapter takes the audience expansion as the root cause and traces its consequences.

The chapter’s thesis

A public benchmark is a signaling instrument. As its audience expands beyond builders, both sides of the market acquire an incentive to game it: builders game the metric, and benchmark providers game the audience. Restoring trust is therefore not only a statistical problem (Chapter 5, Chapter 1) but an institutional design problem—the subject of Chapter 9 and of this chapter’s governance discussion.

8.2 The Actors and Their Incentives

We organize the ecosystem as a set of principals, each delegating a decision to the metric, each with a characteristic way of distorting it.

Table 8.1: Stakeholders of a modern benchmark and how each can distort it

Actor	Decision delegated to the metric	Incentive to distort	Primary failure mode
Builder / developer	Where to allocate training effort; what to claim	Reputation, funding, market share	Optimize to the test (Section 8.3.1)
Evaluator / platform	What to publish and how to rank	Relevance, traffic, revenue, own-model favoritism	Game the audience (Section 8.3.2)
Adopter	Build vs. buy; which vendor	Lower procurement risk cheaply	Over-trust a decoupled proxy
Investor	How to price a round	Back the apparent leader	Reward narrative over capability
Regulator	Whether a system is safe to deploy	Public safety, accountability	Incentivize sandbagging (Section 8.3.1)

The classical evaluation model—and the formal game of Section 9.2—contains only the first two rows, and assumes the evaluator is a benevolent social planner. The central move of this chapter is to relax that assumption: the lower three rows add principals on the demand side, and the second row, once the evaluator has its own commercial incentives, becomes strategic. This is the multi-principal setting we return to in Section 8.5.1.

8.2.1 Benchmarks as Signaling and the Market for Lemons

Why should adopters and investors—parties who never train a model—rely on benchmarks at all? Because a benchmark is a certification technology that reduces an information asymmetry. The buyer cannot directly observe a model’s quality before deploying it; the seller can. This is precisely Akerlof (1970)’s market for lemons: absent a credible quality signal, buyers rationally discount all sellers toward the average, and high-quality producers are driven out. A trustworthy benchmark is the device that lets a high-quality model separate itself—the AI analogue of Spence (1973)’s job-market signaling.

This framing does real work in the chapter:

It explains the adopter and investor rows of Table 8.1 without hand-waving: both are receivers in a signaling game, differing only in the decision they take.
It makes the cost of gaming precise. Gaming works by making the cheap signal (a tuned benchmark score) imitate the expensive one (genuine capability). When gaming is cheap relative to real improvement, the separating equilibrium collapses back to pooling—and the benchmark stops carrying information for all downstream audiences, even though it may still be locally useful to guide a builder’s own research.
It connects directly to Chapter 9: the design levers there (randomization, correction, holdout, welfare-aligned metrics) are all attempts to raise the cost of the cheap signal until separation is restored.

Figure. A deliberately simple illustration of the thesis: each model’s benchmark score is its true utility plus a gaming term plus noise. As gaming becomes cheaper (x-axis), the rank correlation between the benchmark and true utility decays—the metric decouples from what downstream audiences care about. (TODO: replace with a proper separating-vs-pooling signaling figure to make the Spence connection explicit.)

8.3 A Taxonomy of Gaming

The single most important organizing decision for this chapter is to classify gaming by mechanism, not by anecdote. Two episodes that look different in the press (a contaminated math benchmark, a leaderboard scandal) may exploit the same channel—or the same-looking episodes may exploit different channels that need different defenses. We split the taxonomy by who is gaming whom.

8.3.1 Builders Game the Metric

This is the direction modeled formally in Section 9.2. Each sub-type names a distinct exploitable channel:

Contamination / memorization. Training data includes (or overlaps) evaluation items, so the benchmark measures recall rather than capability. Concrete: the OpenLLM Leaderboard’s eventual archival amid contamination and overfitting concerns (Aidar Myrzakhan 2024); clean evidence from holdout re-tests such as the GSM1k re-examination of grade-school arithmetic (Zhang et al. 2024); and detection methods such as black-box contamination proofs (Oren et al. 2023). Channel: the test items are public text that leaks into pretraining.
Adaptive overfitting via repeated querying. Even without seeing the test items, repeated submissions against a fixed holdout let a builder overfit the leaderboard. This is the failure the Ladder mechanism and reusable-holdout techniques were designed to bound (Blum and Hardt 2015). Channel: information leaks through the score itself across rounds (Section 9.4).
Strategic submission and selection. A builder deploys models tuned for the queries a platform tends to surface, or selectively reports favorable runs. (TODO: cite the Llama-4 Chatbot Arena episode (Singh et al. 2025) and the Terminal-Bench case once the references are settled.) Channel: the builder learns and targets the evaluation distribution.
Strategic post-training. Ranking-aware fine-tuning shifts a model up a leaderboard without broad capability gains. (TODO: cite “Model Rankings under Strategic Post-Training,” strategicposttraining2026—verify authors/venue.) Channel: optimization directed at the rank statistic rather than the construct.
Sandbagging (the inverse). When the audience is a regulator, the incentive flips: a developer (or a model) may underperform on a dangerous-capability evaluation to appear safe (Weij et al. 2024). This is why government-facing evaluations need different safeguards than capability leaderboards; see the frontier dangerous-capability evaluation program of Phuong et al. (2024) and the red-teaming methodology of Chapter 10. Channel: strategic concealment rather than strategic exaggeration.

8.3.2 Evaluators Game the Audience

This is the direction the formal game of Chapter 9 does not yet model, and it is the chapter’s novel thread. Once a benchmark provider has commercial or reputational incentives, the provider is itself a strategic player:

Selective disclosure / pay-to-play. Privileged providers obtain more matches, private pre-testing, or the ability to retract unfavorable results—asymmetries documented in the Chatbot Arena analysis of Singh et al. (2025; see also Chiang et al. 2024 for the platform). Channel: the evaluator controls what the audience sees.
Conflicts of interest. A platform that ranks models while also building or investing in one of them has an incentive to tilt the playing field.
Market structure. Why is trustworthy third-party evaluation hard to sustain as a business? The economics of the eval market—who pays, and whether the payer wants an honest signal—explain the recurring failure of independent eval startups (Liao 2025). This is a governance observation: the party best positioned to certify quality is often the party with the least incentive to do so honestly.

Why this split matters

Every defense in Chapter 9 assumes a benevolent evaluator and targets builder-side gaming. The evaluator-side failures of Section 8.3.2 are therefore largely unaddressed by current mechanism design—they are an open frontier (Section 8.5). Naming them as a separate direction is what keeps this chapter an argument rather than a list.

8.4 The Gaming-Channel / Design-Lever Map

This table is the spine that ties diagnosis (this chapter) to design (Chapter 9). Each gaming channel from Section 8.3 is closed—or left open—by a specific lever. The unclosed cells are the chapter’s research agenda.

Table 8.2: Mapping each gaming channel to its design lever. Bold rows are evaluator-side failures not yet covered by the formal game.

Gaming channel	Section	Design lever	Where developed
Contamination / memorization	Section 8.3.1	Randomize & refresh tasks; live benchmarks	Section 9.2.3; (White et al. 2025; Jain et al. 2024)
Adaptive overfitting (repeated querying)	Section 8.3.1	Noise-gated holdout; differential privacy	Section 9.4.3
Learning the evaluation distribution	Section 8.3.1	Distribution correction toward uniform	Section 9.4.1
Rank-targeted / metric gaming	Section 8.3.1	Welfare-aligned (counterfactual) metrics	Section 9.5
Sandbagging on safety evals	Section 8.3.1	Adversarial / red-team elicitation	Chapter 10
Selective disclosure, pay-to-play	Section 8.3.2	(open) audit, transparency, conflict rules	Section 8.5
Evaluator conflicts of interest	Section 8.3.2	(open) institutional separation	Section 8.5.1

A practical reading of Table 8.2: the builder-side rows are well-served by the mechanisms of Section 9.7, while the evaluator-side rows require governance rather than mechanism design—rules about who may run a benchmark, what they must disclose, and who audits them.

8.5 The Governance Frontier

The remaining sections promote the chapter’s hardest questions to first-class topics. These are genuinely open; we mark them as such.

8.5.1 Multi-Principal Evaluation and the Strategic Evaluator

The formal game of Section 9.2 has one evaluator and one builder, with the evaluator acting as a benevolent social planner. The ecosystem of Table 8.1 has many principals with conflicting objectives, and an evaluator who may itself be strategic. Open questions:

What is the right model of an evaluator with its own utility? If the evaluator maximizes traffic or favors an affiliated model, what mechanism keeps its published ranking informative? (This is Discussion Question 1 of Chapter 9, promoted here to a research direction.)
What is a model of adopter and government behavior? Both are receivers in the signaling game of Section 8.2.1, but with different loss functions and different abilities to audit. A formal account would let us ask which guarantees a benchmark must offer each.
Who watches the watchers? Independent audit, reproducible evaluation infrastructure (Biderman et al. 2024), and conflict-of-interest rules are the institutional analogues of the technical levers in Chapter 9.

8.5.2 The Role of Open-Source Models

Open weights change the governance calculus in both directions, and it is worth stating the tension plainly:

For trust: open weights enable independent third-party auditing, reproducible evaluation, and inspection that closed APIs preclude.
Against trust: open weights make contamination and targeted gaming easier, since the model can be fine-tuned directly against any public benchmark.

(TODO: take a position here, and connect to the public/private benchmark work—the public version supports open auditing while a private holdout preserves an incentive-compatible signal. Add the citation for that paper once it has a key; truong2025guard covers the private-benchmark design.)

8.5.3 Organizing for Trustworthy Evaluation

A closing, deliberately normative section: how should the research community organize a future in which capable AI is evaluated in ways that serve all of its audiences, not just its builders? (TODO: this is the “intelligence serves all” theme—write the vision to leave the reader with, and connect it back to the predictive-evaluation half of the book.)

8.6 Discussion Questions

The market-for-lemons framing (Section 8.2.1) predicts that high-quality models exit when no credible signal exists. Has the opposite been observed—a credible benchmark causing a flight to quality? What made it credible?
Table 8.2 leaves the evaluator-side rows open. Sketch a mechanism (technical or institutional) that would make selective disclosure costly for a platform.
Sandbagging (Section 8.3.1) inverts the usual gaming direction. Does any of the design machinery in Chapter 9 help against strategic under-performance, or is it fundamentally a red-teaming problem (Chapter 10)?
Adopters and investors consume the same number but take different decisions. Should a benchmark report different statistics to different audiences, and does that reintroduce the garbling problem of Section 9.5?
Argue both sides of Section 8.5.2: does open-sourcing models make the evaluation ecosystem more or less trustworthy on net?
The Kaggle era had a benevolent evaluator who owned the test set and controlled submission. List the structural features of the LLM era that break this, and rank them by how much damage each does.

8.7 Bibliographic Notes

Goodhart’s Law and its variants are treated formally in Section 9.1; the strategic-classification and performative-prediction foundations are Hardt et al. (2016) and Perdomo et al. (2020).

Signaling and information economics. The market-for-lemons and signaling framing is Akerlof (1970) and Spence (1973). The information-design view of disclosure is developed in Chapter 9 via Kamenica and Gentzkow (2011).

Gaming in the wild. Leaderboard-platform asymmetries are analyzed by Singh et al. (2025; platform background: Chiang et al. 2024); contamination evidence and detection by Zhang et al. (2024) and Oren et al. (2023); the OpenLLM Leaderboard’s archival by Aidar Myrzakhan (2024). Contamination-resistant “live” benchmarks include White et al. (2025) and Jain et al. (2024). Sandbagging and dangerous-capability evaluation are Weij et al. (2024) and Phuong et al. (2024). The construct-validity critique of general benchmarks is Raji et al. (2021). The economics of the eval market is discussed by Liao (2025).

Design levers for every channel in Table 8.2 are developed in Chapter 9. Reproducible evaluation infrastructure is Biderman et al. (2024).

8.8 Exercises

(Easy) Fill in Table 8.1 for a familiar benchmark: for each actor, name the concrete decision and the concrete distortion.
(Easy) Modify the decoupling simulation (Section 8.2.1) so that gaming effort is correlated with low true utility (the worst models game hardest). How does the decoupling curve change?
(Medium) Formalize a one-shot signaling game between a builder (high- or low-quality) and an adopter, where a benchmark score is a costly signal. Derive the condition on gaming cost under which a separating equilibrium exists, and relate it to the design levers of Chapter 9.
(Medium) Pick one bold (evaluator-side) row of Table 8.2 and write a half-page design proposal that would move it from open to closed.
(Hard) Extend the Stackelberg game of Section 9.2 to a strategic evaluator whose utility is a convex combination of social welfare and own-traffic. Characterize when the evaluator’s optimal mechanism still incentivizes broad builder capability.

References

Aidar Myrzakhan, Zhiqiang Shen, Sondos Mahmoud Bsharat. 2024. “Open-LLM-Leaderboard: From Multi-Choice to Open-Style Questions for LLMs Evaluation, Benchmark, and Arena.” arXiv Preprint arXiv:2406.07545.

Akerlof, George A. 1970. “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism.” The Quarterly Journal of Economics 84 (3): 488–500.

Biderman, Stella et al. 2024. “Lessons from the Trenches on Reproducible Evaluation of Language Models.” arXiv Preprint arXiv:2405.14782.

Blum, Avrim, and Moritz Hardt. 2015. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions.” Proceedings of the 32nd International Conference on Machine Learning, 1006–14.

Chiang, Wei-Lin, Lianmin Zheng, Ying Sheng, et al. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. https://arxiv.org/abs/2403.04132.

Hardt, Moritz, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. 2016. “Strategic Classification.” Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 111–22.

Jain, Naman, King Han, Alex Gu, et al. 2024. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” arXiv Preprint arXiv:2403.07974.

Kamenica, Emir, and Matthew Gentzkow. 2011. “Bayesian Persuasion.” American Economic Review 101 (6): 2590–615.

Liao, Thomas. 2025. Why Do Eval Startups Fail? Https://thomasliao.com/eval-startups.

Oren, Yonatan, Nicole Meister, Niladri Chatterji, Faisal Ladhak, and Tatsunori B. Hashimoto. 2023. Proving Test Set Contamination in Black Box Language Models. https://arxiv.org/abs/2310.17623.

Perdomo, Juan, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. 2020. “Performative Prediction.” Proceedings of the 37th International Conference on Machine Learning, 7599–609.

Phuong, Mary et al. 2024. Evaluating Frontier Models for Dangerous Capabilities. https://arxiv.org/abs/2403.13793.

Raji, Inioluwa Deborah, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. 2021. “AI and the Everything in the Whole Wide World Benchmark.” Proceedings of the NeurIPS Track on Datasets and Benchmarks.

Singh, Shivalika et al. 2025. The Leaderboard Illusion. https://arxiv.org/abs/2504.20879.

Spence, Michael. 1973. “Job Market Signaling.” The Quarterly Journal of Economics 87 (3): 355–74.

Weij, Teun van der, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. 2024. AI Sandbagging: Language Models Can Strategically Underperform on Evaluations. https://arxiv.org/abs/2406.07358.

White, Colin, Samuel Dooley, Manley Roberts, et al. 2025. LiveBench: A Challenging, Contamination-Limited LLM Benchmark. https://arxiv.org/abs/2406.19314.

Zhang, Hugh et al. 2024. A Careful Examination of Large Language Model Performance on Grade School Arithmetic. https://arxiv.org/abs/2405.00332.

--- format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); }); </script> filters: - pyodide pyodide: packages: - numpy - matplotlib --- # Ecosystem {#sec-ecosystem} ::: {.callout-note title="Intended Learning Outcomes"} By the end of this chapter, the reader will be able to: 1. **Map** the stakeholders of a modern AI benchmark---builders, evaluators/platforms, adopters, investors, and regulators---and identify the decision each one delegates to the metric and the incentive each has to distort it. 2. **Explain** why an expanding audience turns Goodhart's Law from an academic curiosity into a structural force, using the economics of *signaling* and the *market for lemons*. 3. **Distinguish** the two directions of gaming: builders who game the metric and evaluators/platforms who game the audience, and classify real episodes into a mechanism-based taxonomy. 4. **Connect** each gaming channel to a design lever from @sec-design, and identify which channels remain open research problems. 5. **Articulate** the governance frontier: multi-principal evaluation, the role of open-source models, and how to organize a trustworthy third-party evaluation ecosystem. ::: ::: {.callout-tip title="Suggested Lecture Plan" collapse="true"} This chapter can be covered in **2 lectures** (75--90 minutes each): **Lecture 1: The Ecosystem and Its Failures** - From scoreboard to target: the expanding audience (15 min) - Benchmarks as signaling and the market for lemons (20 min) - A taxonomy of gaming: builder-side and evaluator-side (35 min) **Lecture 2: Governance** - The gaming-channel / design-lever map (20 min) - Multi-principal evaluation and the strategic evaluator (25 min) - The role of open-source; organizing the community (25 min) ::: ::: {.callout-note title="Chapter Roadmap"} This chapter is the *diagnosis* that motivates the *design* of @sec-design. We ask three questions in sequence: **who are the actors** (@sec-actors), **what game do they play** (@sec-gaming-taxonomy), and **how should the system be governed** (@sec-governance-frontier). The technical machinery for re-engineering the game---randomization, distribution correction, holdout mechanisms, welfare-aligned metrics---is developed in @sec-design; here we establish *why* that machinery is needed and *which* failures it must close. ::: ## From Scoreboard to Target {#sec-scoreboard-to-target} > "When a measure becomes a target, it ceases to be a good measure." --- Charles Goodhart (1975) A benchmark begins life as an *internal* instrument: a model developer's private scoreboard for tracking research progress. In that regime Goodhart's Law is mostly latent---the only party who can optimize against the metric is the same party who benefits from an honest reading of it. The structural shift of the past few years is that the *audience* of a benchmark has expanded far beyond the developer who builds the model. Today a single leaderboard number is consumed by: - **Builders**, who use it to allocate research effort and to claim state-of-the-art. - **Adopters** (enterprises, agencies, application developers), who use it as a procurement and due-diligence signal in build-vs-buy and vendor-selection decisions. - **Investors**, who read benchmark leadership as a proxy for a technical moat when pricing funding rounds. - **Governments and regulators**, who increasingly reference capability and safety evaluations in policy. *(TODO: cite the specific provisions to lean on---e.g., the EU AI Act's general-purpose-AI obligations and codes of practice, and the relevant US framework. Verify current status as of the writing date; the policy landscape moves quickly.)* Once any of these parties can move resources on the basis of a score, the developer's reputation, funding, or regulatory standing depends on that score---and the data-generating process is no longer fixed. This is exactly the setting in which Goodhart's adversarial variant dominates (@sec-goodhart). The remainder of this chapter takes the audience expansion as the root cause and traces its consequences. ::: {.callout-important title="The chapter's thesis"} A public benchmark is a **signaling instrument**. As its audience expands beyond builders, *both sides* of the market acquire an incentive to game it: builders game the metric, and benchmark providers game the audience. Restoring trust is therefore not only a statistical problem (@sec-reliability, @sec-validity) but an **institutional design** problem---the subject of @sec-design and of this chapter's governance discussion. ::: ## The Actors and Their Incentives {#sec-actors} We organize the ecosystem as a set of principals, each delegating a decision to the metric, each with a characteristic way of distorting it. | Actor | Decision delegated to the metric | Incentive to distort | Primary failure mode | |---|---|---|---| | **Builder / developer** | Where to allocate training effort; what to claim | Reputation, funding, market share | Optimize to the test (@sec-builder-gaming) | | **Evaluator / platform** | What to publish and how to rank | Relevance, traffic, revenue, own-model favoritism | Game the audience (@sec-evaluator-gaming) | | **Adopter** | Build vs. buy; which vendor | Lower procurement risk cheaply | Over-trust a decoupled proxy | | **Investor** | How to price a round | Back the apparent leader | Reward narrative over capability | | **Regulator** | Whether a system is safe to deploy | Public safety, accountability | Incentivize *sandbagging* (@sec-builder-gaming) | : Stakeholders of a modern benchmark and how each can distort it {#tbl-actors} The classical evaluation model---and the formal game of @sec-evaluation-game---contains only the first two rows, and assumes the evaluator is a *benevolent social planner*. The central move of this chapter is to **relax that assumption**: the lower three rows add principals on the demand side, and the second row, once the evaluator has its own commercial incentives, becomes strategic. This is the multi-principal setting we return to in @sec-multi-principal. ### Benchmarks as Signaling and the Market for Lemons {#sec-benchmark-signaling} Why should adopters and investors---parties who never train a model---rely on benchmarks at all? Because a benchmark is a *certification technology* that reduces an information asymmetry. The buyer cannot directly observe a model's quality before deploying it; the seller can. This is precisely @akerlof1970market's market for lemons: absent a credible quality signal, buyers rationally discount all sellers toward the average, and high-quality producers are driven out. A trustworthy benchmark is the device that lets a high-quality model *separate* itself---the AI analogue of @spence1973job's job-market signaling. This framing does real work in the chapter: - It explains the adopter and investor rows of @tbl-actors without hand-waving: both are *receivers* in a signaling game, differing only in the decision they take. - It makes the cost of gaming precise. Gaming works by making the *cheap* signal (a tuned benchmark score) imitate the *expensive* one (genuine capability). When gaming is cheap relative to real improvement, the separating equilibrium collapses back to pooling---and the benchmark stops carrying information for *all* downstream audiences, even though it may still be locally useful to guide a builder's own research. - It connects directly to @sec-design: the design levers there (randomization, correction, holdout, welfare-aligned metrics) are all attempts to *raise the cost of the cheap signal* until separation is restored. ```{pyodide-python} #| label: signal-decoupling #| autorun: true import numpy as np import matplotlib.pyplot as plt def spearman(a, b): # Spearman rank correlation = Pearson correlation of the ranks. ra, rb = np.argsort(np.argsort(a)), np.argsort(np.argsort(b)) return np.corrcoef(ra, rb)[0, 1] np.random.seed(0) # A simple illustration of the chapter's thesis: # as "gaming" becomes cheaper, the benchmark decouples from true utility. M = 200 # number of models true_utility = np.random.normal(0, 1, M) # what society actually cares about gaming_strengths = np.linspace(0, 3.0, 25) # how much a model can boost its score by gaming noise = 0.3 corrs = [] for g in gaming_strengths: gaming_effort = np.random.uniform(0, 1, M) # heterogeneous propensity to game bench = true_utility + g * gaming_effort + np.random.normal(0, noise, M) corrs.append(spearman(bench, true_utility)) fig, ax = plt.subplots(1, 1, figsize=(3.5, 2.6)) ax.plot(gaming_strengths, corrs, color='#5B8DEE', lw=2) ax.axhline(1.0, color='#888888', ls=':', lw=1) ax.set_xlabel('Gaming strength (cheapness of the proxy)') ax.set_ylabel("Benchmark–utility rank corr.") ax.set_title('Decoupling under gaming') ax.set_ylim(0, 1.05) plt.tight_layout() plt.show() ``` **Figure.** A deliberately simple illustration of the thesis: each model's benchmark score is its true utility plus a gaming term plus noise. As gaming becomes cheaper (x-axis), the rank correlation between the benchmark and true utility decays---the metric *decouples* from what downstream audiences care about. *(TODO: replace with a proper separating-vs-pooling signaling figure to make the Spence connection explicit.)* ## A Taxonomy of Gaming {#sec-gaming-taxonomy} The single most important organizing decision for this chapter is to classify gaming **by mechanism, not by anecdote**. Two episodes that look different in the press (a contaminated math benchmark, a leaderboard scandal) may exploit the same channel---or the same-looking episodes may exploit different channels that need different defenses. We split the taxonomy by *who is gaming whom*. ### Builders Game the Metric {#sec-builder-gaming} This is the direction modeled formally in @sec-evaluation-game. Each sub-type names a distinct exploitable channel: 1. **Contamination / memorization.** Training data includes (or overlaps) evaluation items, so the benchmark measures recall rather than capability. Concrete: the OpenLLM Leaderboard's eventual archival amid contamination and overfitting concerns [@myrzakhan2024openllmleaderboard]; clean evidence from holdout re-tests such as the GSM1k re-examination of grade-school arithmetic [@zhang2024careful]; and detection methods such as black-box contamination proofs [@oren2023proving]. *Channel: the test items are public text that leaks into pretraining.* 2. **Adaptive overfitting via repeated querying.** Even without seeing the test items, repeated submissions against a fixed holdout let a builder overfit the leaderboard. This is the failure the Ladder mechanism and reusable-holdout techniques were designed to bound [@blum2015ladder]. *Channel: information leaks through the score itself across rounds (@sec-restoring-alignment).* 3. **Strategic submission and selection.** A builder deploys models tuned for the queries a platform tends to surface, or selectively reports favorable runs. *(TODO: cite the Llama-4 Chatbot Arena episode [@singh2025leaderboard] and the Terminal-Bench case once the references are settled.)* *Channel: the builder learns and targets the evaluation distribution.* 4. **Strategic post-training.** Ranking-aware fine-tuning shifts a model up a leaderboard without broad capability gains. *(TODO: cite "Model Rankings under Strategic Post-Training," `strategicposttraining2026`---verify authors/venue.)* *Channel: optimization directed at the rank statistic rather than the construct.* 5. **Sandbagging (the inverse).** When the audience is a *regulator*, the incentive flips: a developer (or a model) may *underperform* on a dangerous-capability evaluation to appear safe [@vanderweij2024sandbagging]. This is why government-facing evaluations need different safeguards than capability leaderboards; see the frontier dangerous-capability evaluation program of @phuong2024evaluating and the red-teaming methodology of @sec-redteaming. *Channel: strategic concealment rather than strategic exaggeration.* ### Evaluators Game the Audience {#sec-evaluator-gaming} This is the direction the formal game of @sec-design does **not** yet model, and it is the chapter's novel thread. Once a benchmark provider has commercial or reputational incentives, the provider is itself a strategic player: - **Selective disclosure / pay-to-play.** Privileged providers obtain more matches, private pre-testing, or the ability to retract unfavorable results---asymmetries documented in the Chatbot Arena analysis of @singh2025leaderboard [see also @chiang2024chatbotarena for the platform]. *Channel: the evaluator controls what the audience sees.* - **Conflicts of interest.** A platform that ranks models while also building or investing in one of them has an incentive to tilt the playing field. - **Market structure.** Why is trustworthy third-party evaluation hard to sustain as a business? The economics of the eval market---who pays, and whether the payer wants an honest signal---explain the recurring failure of independent eval startups [@liao2025evalstartups]. This is a *governance* observation: the party best positioned to certify quality is often the party with the least incentive to do so honestly. ::: {.callout-note title="Why this split matters"} Every defense in @sec-design assumes a benevolent evaluator and targets *builder-side* gaming. The evaluator-side failures of @sec-evaluator-gaming are therefore **largely unaddressed by current mechanism design**---they are an open frontier (@sec-governance-frontier). Naming them as a separate direction is what keeps this chapter an argument rather than a list. ::: ## The Gaming-Channel / Design-Lever Map {#sec-gaming-defense-map} This table is the spine that ties diagnosis (this chapter) to design (@sec-design). Each gaming channel from @sec-gaming-taxonomy is closed---or left open---by a specific lever. The unclosed cells *are* the chapter's research agenda. | Gaming channel | Section | Design lever | Where developed | |---|---|---|---| | Contamination / memorization | @sec-builder-gaming | Randomize & refresh tasks; live benchmarks | @sec-one-shot-alignment; [@white2025livebench; @jain2024livecodebench] | | Adaptive overfitting (repeated querying) | @sec-builder-gaming | Noise-gated holdout; differential privacy | @sec-holdout | | Learning the evaluation distribution | @sec-builder-gaming | Distribution correction toward uniform | @sec-distribution-correction | | Rank-targeted / metric gaming | @sec-builder-gaming | Welfare-aligned (counterfactual) metrics | @sec-metric-design | | Sandbagging on safety evals | @sec-builder-gaming | Adversarial / red-team elicitation | @sec-redteaming | | **Selective disclosure, pay-to-play** | @sec-evaluator-gaming | *(open)* audit, transparency, conflict rules | @sec-governance-frontier | | **Evaluator conflicts of interest** | @sec-evaluator-gaming | *(open)* institutional separation | @sec-multi-principal | : Mapping each gaming channel to its design lever. Bold rows are evaluator-side failures not yet covered by the formal game. {#tbl-gaming-defense} A practical reading of @tbl-gaming-defense: the builder-side rows are well-served by the mechanisms of @sec-design-principles, while the evaluator-side rows require *governance* rather than *mechanism design*---rules about who may run a benchmark, what they must disclose, and who audits them. ## The Governance Frontier {#sec-governance-frontier} The remaining sections promote the chapter's hardest questions to first-class topics. These are genuinely open; we mark them as such. ### Multi-Principal Evaluation and the Strategic Evaluator {#sec-multi-principal} The formal game of @sec-evaluation-game has one evaluator and one builder, with the evaluator acting as a benevolent social planner. The ecosystem of @tbl-actors has *many* principals with conflicting objectives, and an evaluator who may itself be strategic. Open questions: - **What is the right model of an evaluator with its own utility?** If the evaluator maximizes traffic or favors an affiliated model, what mechanism keeps its published ranking informative? (This is Discussion Question 1 of @sec-design, promoted here to a research direction.) - **What is a model of adopter and government behavior?** Both are receivers in the signaling game of @sec-benchmark-signaling, but with different loss functions and different abilities to audit. A formal account would let us ask which guarantees a benchmark must offer each. - **Who watches the watchers?** Independent audit, reproducible evaluation infrastructure [@biderman2024lessons], and conflict-of-interest rules are the institutional analogues of the technical levers in @sec-design. ### The Role of Open-Source Models {#sec-open-source} Open weights change the governance calculus in both directions, and it is worth stating the tension plainly: - **For trust:** open weights enable independent third-party auditing, reproducible evaluation, and inspection that closed APIs preclude. - **Against trust:** open weights make contamination and targeted gaming *easier*, since the model can be fine-tuned directly against any public benchmark. *(TODO: take a position here, and connect to the public/private benchmark work---the public version supports open auditing while a private holdout preserves an incentive-compatible signal. Add the citation for that paper once it has a key; `truong2025guard` covers the private-benchmark design.)* ### Organizing for Trustworthy Evaluation {#sec-organizing} A closing, deliberately normative section: how should the research community organize a future in which capable AI is evaluated in ways that serve all of its audiences, not just its builders? *(TODO: this is the "intelligence serves all" theme---write the vision to leave the reader with, and connect it back to the predictive-evaluation half of the book.)* ## Discussion Questions {#sec-ecosystem-discussion} 1. The market-for-lemons framing (@sec-benchmark-signaling) predicts that high-quality models exit when no credible signal exists. Has the *opposite* been observed---a credible benchmark causing a flight to quality? What made it credible? 2. @tbl-gaming-defense leaves the evaluator-side rows open. Sketch a mechanism (technical or institutional) that would make selective disclosure costly for a platform. 3. Sandbagging (@sec-builder-gaming) inverts the usual gaming direction. Does *any* of the design machinery in @sec-design help against strategic *under*-performance, or is it fundamentally a red-teaming problem (@sec-redteaming)? 4. Adopters and investors consume the same number but take different decisions. Should a benchmark report *different* statistics to different audiences, and does that reintroduce the garbling problem of @sec-metric-design? 5. Argue both sides of @sec-open-source: does open-sourcing models make the evaluation ecosystem more or less trustworthy on net? 6. The Kaggle era had a benevolent evaluator who *owned the test set* and *controlled submission*. List the structural features of the LLM era that break this, and rank them by how much damage each does. ## Bibliographic Notes {#sec-ecosystem-biblio} **Goodhart's Law and its variants** are treated formally in @sec-goodhart [@goodhart1984problems; @manheim2018categorizing]; the strategic-classification and performative-prediction foundations are @hardt2016strategic and @perdomo2020performative. **Signaling and information economics.** The market-for-lemons and signaling framing is @akerlof1970market and @spence1973job. The information-design view of disclosure is developed in @sec-design via @kamenica2011bayesian. **Gaming in the wild.** Leaderboard-platform asymmetries are analyzed by @singh2025leaderboard [platform background: @chiang2024chatbotarena]; contamination evidence and detection by @zhang2024careful and @oren2023proving; the OpenLLM Leaderboard's archival by @myrzakhan2024openllmleaderboard. Contamination-resistant "live" benchmarks include @white2025livebench and @jain2024livecodebench. Sandbagging and dangerous-capability evaluation are @vanderweij2024sandbagging and @phuong2024evaluating. The construct-validity critique of general benchmarks is @raji2021everything. The economics of the eval market is discussed by @liao2025evalstartups. **Design levers** for every channel in @tbl-gaming-defense are developed in @sec-design [@truong2025guard; @blum2015ladder; @procaccia2025metritocracy]. Reproducible evaluation infrastructure is @biderman2024lessons. ## Exercises {#sec-ecosystem-exercises} 1. **(Easy)** Fill in @tbl-actors for a familiar benchmark: for each actor, name the concrete decision and the concrete distortion. 2. **(Easy)** Modify the decoupling simulation (@sec-benchmark-signaling) so that gaming effort is correlated with low true utility (the worst models game hardest). How does the decoupling curve change? 3. **(Medium)** Formalize a one-shot signaling game between a builder (high- or low-quality) and an adopter, where a benchmark score is a costly signal. Derive the condition on gaming cost under which a separating equilibrium exists, and relate it to the design levers of @sec-design. 4. **(Medium)** Pick one bold (evaluator-side) row of @tbl-gaming-defense and write a half-page design proposal that would move it from *open* to *closed*. 5. **(Hard)** Extend the Stackelberg game of @sec-evaluation-game to a *strategic* evaluator whose utility is a convex combination of social welfare and own-traffic. Characterize when the evaluator's optimal mechanism still incentivizes broad builder capability.