8 Ecosystem
By the end of this chapter, the reader will be able to:
- Map the stakeholders of a modern AI benchmark—builders, evaluators/platforms, adopters, investors, and regulators—and identify the decision each one delegates to the metric and the incentive each has to distort it.
- Explain why an expanding audience turns Goodhart’s Law from an academic curiosity into a structural force, using the economics of signaling and the market for lemons.
- Distinguish the two directions of gaming: builders who game the metric and evaluators/platforms who game the audience, and classify real episodes into a mechanism-based taxonomy.
- Connect each gaming channel to a design lever from Chapter 9, and identify which channels remain open research problems.
- Articulate the governance frontier: multi-principal evaluation, the role of open-source models, and how to organize a trustworthy third-party evaluation ecosystem.
This chapter can be covered in 2 lectures (75–90 minutes each):
Lecture 1: The Ecosystem and Its Failures
- From scoreboard to target: the expanding audience (15 min)
- Benchmarks as signaling and the market for lemons (20 min)
- A taxonomy of gaming: builder-side and evaluator-side (35 min)
Lecture 2: Governance
- The gaming-channel / design-lever map (20 min)
- Multi-principal evaluation and the strategic evaluator (25 min)
- The role of open-source; organizing the community (25 min)
This chapter is the diagnosis that motivates the design of Chapter 9. We ask three questions in sequence: who are the actors (Section 8.2), what game do they play (Section 8.3), and how should the system be governed (Section 8.5). The technical machinery for re-engineering the game—randomization, distribution correction, holdout mechanisms, welfare-aligned metrics—is developed in Chapter 9; here we establish why that machinery is needed and which failures it must close.
8.1 From Scoreboard to Target
“When a measure becomes a target, it ceases to be a good measure.” — Charles Goodhart (1975)
A benchmark begins life as an internal instrument: a model developer’s private scoreboard for tracking research progress. In that regime Goodhart’s Law is mostly latent—the only party who can optimize against the metric is the same party who benefits from an honest reading of it. The structural shift of the past few years is that the audience of a benchmark has expanded far beyond the developer who builds the model.
Today a single leaderboard number is consumed by:
- Builders, who use it to allocate research effort and to claim state-of-the-art.
- Adopters (enterprises, agencies, application developers), who use it as a procurement and due-diligence signal in build-vs-buy and vendor-selection decisions.
- Investors, who read benchmark leadership as a proxy for a technical moat when pricing funding rounds.
- Governments and regulators, who increasingly reference capability and safety evaluations in policy. (TODO: cite the specific provisions to lean on—e.g., the EU AI Act’s general-purpose-AI obligations and codes of practice, and the relevant US framework. Verify current status as of the writing date; the policy landscape moves quickly.)
Once any of these parties can move resources on the basis of a score, the developer’s reputation, funding, or regulatory standing depends on that score—and the data-generating process is no longer fixed. This is exactly the setting in which Goodhart’s adversarial variant dominates (Section 9.1). The remainder of this chapter takes the audience expansion as the root cause and traces its consequences.
A public benchmark is a signaling instrument. As its audience expands beyond builders, both sides of the market acquire an incentive to game it: builders game the metric, and benchmark providers game the audience. Restoring trust is therefore not only a statistical problem (Chapter 5, Chapter 1) but an institutional design problem—the subject of Chapter 9 and of this chapter’s governance discussion.
8.2 The Actors and Their Incentives
We organize the ecosystem as a set of principals, each delegating a decision to the metric, each with a characteristic way of distorting it.
| Actor | Decision delegated to the metric | Incentive to distort | Primary failure mode |
|---|---|---|---|
| Builder / developer | Where to allocate training effort; what to claim | Reputation, funding, market share | Optimize to the test (Section 8.3.1) |
| Evaluator / platform | What to publish and how to rank | Relevance, traffic, revenue, own-model favoritism | Game the audience (Section 8.3.2) |
| Adopter | Build vs. buy; which vendor | Lower procurement risk cheaply | Over-trust a decoupled proxy |
| Investor | How to price a round | Back the apparent leader | Reward narrative over capability |
| Regulator | Whether a system is safe to deploy | Public safety, accountability | Incentivize sandbagging (Section 8.3.1) |
The classical evaluation model—and the formal game of Section 9.2—contains only the first two rows, and assumes the evaluator is a benevolent social planner. The central move of this chapter is to relax that assumption: the lower three rows add principals on the demand side, and the second row, once the evaluator has its own commercial incentives, becomes strategic. This is the multi-principal setting we return to in Section 8.5.1.
8.2.1 Benchmarks as Signaling and the Market for Lemons
Why should adopters and investors—parties who never train a model—rely on benchmarks at all? Because a benchmark is a certification technology that reduces an information asymmetry. The buyer cannot directly observe a model’s quality before deploying it; the seller can. This is precisely Akerlof (1970)’s market for lemons: absent a credible quality signal, buyers rationally discount all sellers toward the average, and high-quality producers are driven out. A trustworthy benchmark is the device that lets a high-quality model separate itself—the AI analogue of Spence (1973)’s job-market signaling.
This framing does real work in the chapter:
- It explains the adopter and investor rows of Table 8.1 without hand-waving: both are receivers in a signaling game, differing only in the decision they take.
- It makes the cost of gaming precise. Gaming works by making the cheap signal (a tuned benchmark score) imitate the expensive one (genuine capability). When gaming is cheap relative to real improvement, the separating equilibrium collapses back to pooling—and the benchmark stops carrying information for all downstream audiences, even though it may still be locally useful to guide a builder’s own research.
- It connects directly to Chapter 9: the design levers there (randomization, correction, holdout, welfare-aligned metrics) are all attempts to raise the cost of the cheap signal until separation is restored.
Figure. A deliberately simple illustration of the thesis: each model’s benchmark score is its true utility plus a gaming term plus noise. As gaming becomes cheaper (x-axis), the rank correlation between the benchmark and true utility decays—the metric decouples from what downstream audiences care about. (TODO: replace with a proper separating-vs-pooling signaling figure to make the Spence connection explicit.)
8.3 A Taxonomy of Gaming
The single most important organizing decision for this chapter is to classify gaming by mechanism, not by anecdote. Two episodes that look different in the press (a contaminated math benchmark, a leaderboard scandal) may exploit the same channel—or the same-looking episodes may exploit different channels that need different defenses. We split the taxonomy by who is gaming whom.
8.3.1 Builders Game the Metric
This is the direction modeled formally in Section 9.2. Each sub-type names a distinct exploitable channel:
- Contamination / memorization. Training data includes (or overlaps) evaluation items, so the benchmark measures recall rather than capability. Concrete: the OpenLLM Leaderboard’s eventual archival amid contamination and overfitting concerns (Aidar Myrzakhan 2024); clean evidence from holdout re-tests such as the GSM1k re-examination of grade-school arithmetic (Zhang et al. 2024); and detection methods such as black-box contamination proofs (Oren et al. 2023). Channel: the test items are public text that leaks into pretraining.
- Adaptive overfitting via repeated querying. Even without seeing the test items, repeated submissions against a fixed holdout let a builder overfit the leaderboard. This is the failure the Ladder mechanism and reusable-holdout techniques were designed to bound (Blum and Hardt 2015). Channel: information leaks through the score itself across rounds (Section 9.4).
- Strategic submission and selection. A builder deploys models tuned for the queries a platform tends to surface, or selectively reports favorable runs. (TODO: cite the Llama-4 Chatbot Arena episode (Singh et al. 2025) and the Terminal-Bench case once the references are settled.) Channel: the builder learns and targets the evaluation distribution.
- Strategic post-training. Ranking-aware fine-tuning shifts a model up a leaderboard without broad capability gains. (TODO: cite “Model Rankings under Strategic Post-Training,”
strategicposttraining2026—verify authors/venue.) Channel: optimization directed at the rank statistic rather than the construct. - Sandbagging (the inverse). When the audience is a regulator, the incentive flips: a developer (or a model) may underperform on a dangerous-capability evaluation to appear safe (Weij et al. 2024). This is why government-facing evaluations need different safeguards than capability leaderboards; see the frontier dangerous-capability evaluation program of Phuong et al. (2024) and the red-teaming methodology of Chapter 10. Channel: strategic concealment rather than strategic exaggeration.
8.3.2 Evaluators Game the Audience
This is the direction the formal game of Chapter 9 does not yet model, and it is the chapter’s novel thread. Once a benchmark provider has commercial or reputational incentives, the provider is itself a strategic player:
- Selective disclosure / pay-to-play. Privileged providers obtain more matches, private pre-testing, or the ability to retract unfavorable results—asymmetries documented in the Chatbot Arena analysis of Singh et al. (2025; see also Chiang et al. 2024 for the platform). Channel: the evaluator controls what the audience sees.
- Conflicts of interest. A platform that ranks models while also building or investing in one of them has an incentive to tilt the playing field.
- Market structure. Why is trustworthy third-party evaluation hard to sustain as a business? The economics of the eval market—who pays, and whether the payer wants an honest signal—explain the recurring failure of independent eval startups (Liao 2025). This is a governance observation: the party best positioned to certify quality is often the party with the least incentive to do so honestly.
Every defense in Chapter 9 assumes a benevolent evaluator and targets builder-side gaming. The evaluator-side failures of Section 8.3.2 are therefore largely unaddressed by current mechanism design—they are an open frontier (Section 8.5). Naming them as a separate direction is what keeps this chapter an argument rather than a list.
8.4 The Gaming-Channel / Design-Lever Map
This table is the spine that ties diagnosis (this chapter) to design (Chapter 9). Each gaming channel from Section 8.3 is closed—or left open—by a specific lever. The unclosed cells are the chapter’s research agenda.
| Gaming channel | Section | Design lever | Where developed |
|---|---|---|---|
| Contamination / memorization | Section 8.3.1 | Randomize & refresh tasks; live benchmarks | Section 9.2.3; (White et al. 2025; Jain et al. 2024) |
| Adaptive overfitting (repeated querying) | Section 8.3.1 | Noise-gated holdout; differential privacy | Section 9.4.3 |
| Learning the evaluation distribution | Section 8.3.1 | Distribution correction toward uniform | Section 9.4.1 |
| Rank-targeted / metric gaming | Section 8.3.1 | Welfare-aligned (counterfactual) metrics | Section 9.5 |
| Sandbagging on safety evals | Section 8.3.1 | Adversarial / red-team elicitation | Chapter 10 |
| Selective disclosure, pay-to-play | Section 8.3.2 | (open) audit, transparency, conflict rules | Section 8.5 |
| Evaluator conflicts of interest | Section 8.3.2 | (open) institutional separation | Section 8.5.1 |
A practical reading of Table 8.2: the builder-side rows are well-served by the mechanisms of Section 9.7, while the evaluator-side rows require governance rather than mechanism design—rules about who may run a benchmark, what they must disclose, and who audits them.
8.5 The Governance Frontier
The remaining sections promote the chapter’s hardest questions to first-class topics. These are genuinely open; we mark them as such.
8.5.1 Multi-Principal Evaluation and the Strategic Evaluator
The formal game of Section 9.2 has one evaluator and one builder, with the evaluator acting as a benevolent social planner. The ecosystem of Table 8.1 has many principals with conflicting objectives, and an evaluator who may itself be strategic. Open questions:
- What is the right model of an evaluator with its own utility? If the evaluator maximizes traffic or favors an affiliated model, what mechanism keeps its published ranking informative? (This is Discussion Question 1 of Chapter 9, promoted here to a research direction.)
- What is a model of adopter and government behavior? Both are receivers in the signaling game of Section 8.2.1, but with different loss functions and different abilities to audit. A formal account would let us ask which guarantees a benchmark must offer each.
- Who watches the watchers? Independent audit, reproducible evaluation infrastructure (Biderman et al. 2024), and conflict-of-interest rules are the institutional analogues of the technical levers in Chapter 9.
8.5.2 The Role of Open-Source Models
Open weights change the governance calculus in both directions, and it is worth stating the tension plainly:
- For trust: open weights enable independent third-party auditing, reproducible evaluation, and inspection that closed APIs preclude.
- Against trust: open weights make contamination and targeted gaming easier, since the model can be fine-tuned directly against any public benchmark.
(TODO: take a position here, and connect to the public/private benchmark work—the public version supports open auditing while a private holdout preserves an incentive-compatible signal. Add the citation for that paper once it has a key; truong2025guard covers the private-benchmark design.)
8.5.3 Organizing for Trustworthy Evaluation
A closing, deliberately normative section: how should the research community organize a future in which capable AI is evaluated in ways that serve all of its audiences, not just its builders? (TODO: this is the “intelligence serves all” theme—write the vision to leave the reader with, and connect it back to the predictive-evaluation half of the book.)
8.6 Discussion Questions
- The market-for-lemons framing (Section 8.2.1) predicts that high-quality models exit when no credible signal exists. Has the opposite been observed—a credible benchmark causing a flight to quality? What made it credible?
- Table 8.2 leaves the evaluator-side rows open. Sketch a mechanism (technical or institutional) that would make selective disclosure costly for a platform.
- Sandbagging (Section 8.3.1) inverts the usual gaming direction. Does any of the design machinery in Chapter 9 help against strategic under-performance, or is it fundamentally a red-teaming problem (Chapter 10)?
- Adopters and investors consume the same number but take different decisions. Should a benchmark report different statistics to different audiences, and does that reintroduce the garbling problem of Section 9.5?
- Argue both sides of Section 8.5.2: does open-sourcing models make the evaluation ecosystem more or less trustworthy on net?
- The Kaggle era had a benevolent evaluator who owned the test set and controlled submission. List the structural features of the LLM era that break this, and rank them by how much damage each does.
8.7 Bibliographic Notes
Goodhart’s Law and its variants are treated formally in Section 9.1; the strategic-classification and performative-prediction foundations are Hardt et al. (2016) and Perdomo et al. (2020).
Signaling and information economics. The market-for-lemons and signaling framing is Akerlof (1970) and Spence (1973). The information-design view of disclosure is developed in Chapter 9 via Kamenica and Gentzkow (2011).
Gaming in the wild. Leaderboard-platform asymmetries are analyzed by Singh et al. (2025; platform background: Chiang et al. 2024); contamination evidence and detection by Zhang et al. (2024) and Oren et al. (2023); the OpenLLM Leaderboard’s archival by Aidar Myrzakhan (2024). Contamination-resistant “live” benchmarks include White et al. (2025) and Jain et al. (2024). Sandbagging and dangerous-capability evaluation are Weij et al. (2024) and Phuong et al. (2024). The construct-validity critique of general benchmarks is Raji et al. (2021). The economics of the eval market is discussed by Liao (2025).
Design levers for every channel in Table 8.2 are developed in Chapter 9. Reproducible evaluation infrastructure is Biderman et al. (2024).
8.8 Exercises
- (Easy) Fill in Table 8.1 for a familiar benchmark: for each actor, name the concrete decision and the concrete distortion.
- (Easy) Modify the decoupling simulation (Section 8.2.1) so that gaming effort is correlated with low true utility (the worst models game hardest). How does the decoupling curve change?
- (Medium) Formalize a one-shot signaling game between a builder (high- or low-quality) and an adopter, where a benchmark score is a costly signal. Derive the condition on gaming cost under which a separating equilibrium exists, and relate it to the design levers of Chapter 9.
- (Medium) Pick one bold (evaluator-side) row of Table 8.2 and write a half-page design proposal that would move it from open to closed.
- (Hard) Extend the Stackelberg game of Section 9.2 to a strategic evaluator whose utility is a convex combination of social welfare and own-traffic. Characterize when the evaluator’s optimal mechanism still incentivizes broad builder capability.