4  Efficient Measurement

Intended Learning Outcomes

By the end of this chapter, you will be able to:

  1. Formulate the benchmark design problem as an optimization over item selection and scoring rules.
  2. Apply Fisher information to select maximally informative items, and implement a Computerized Adaptive Testing procedure.
  3. Construct D-optimal item pools that maximize the precision of ability estimates.
  4. Design efficient paired-comparison tournaments using information-theoretic principles.
  5. Explain stopping rules and practical constraints (cost, time, contamination) for adaptive AI evaluation.

This chapter can be covered in 2 lectures (75-90 minutes each):

Lecture 1: Fisher Information and Adaptive Testing

  • The benchmark design problem (15 min)
  • Fisher information for item selection (20 min)
  • Computerized Adaptive Testing (30 min)
  • Hands-on: CAT simulation (10 min)

Lecture 2: Optimal Design and Paired Comparisons

  • D-optimal design and item pool construction (30 min)
  • Design for paired comparisons (20 min)
  • Stopping rules and practical considerations (15 min)
  • Hands-on: D-optimal design simulation (10 min)
Notation

This chapter introduces \(I_j(\theta)\) (Fisher information for item \(j\)) and \(\mathcal{I}(\theta)\) (Fisher information matrix). See ?sec-notation for the complete notation reference.

4.1 The Design Problem in AI Evaluation

Chapter 1 introduced the measurement models—Rasch, 2PL, factor models, Bradley-Terry—that formalize how latent abilities generate observed responses. Chapter 2 showed how to estimate the parameters of these models from data. But both chapters took the response matrix \(Y\) as given. In practice, someone must design the evaluation: choosing which items to include and how to score responses. These design decisions profoundly affect the quality and efficiency of the resulting measurements.

The benchmark designer faces two fundamental efficiency questions:

  1. Item selection: Which questions or tasks should the benchmark include? Given a pool of candidate items, how do we select a subset that maximizes the precision of our measurements?

  2. Scoring rules: How do we aggregate responses into scores? Sum scores, weighted scores, latent factor scores? The choice interacts with the measurement model (recall from Chapter 1 that sum scores are sufficient for the Rasch model but not for 2PL).

This chapter develops the statistical foundations for efficient evaluation design, drawing on classical experimental design and information theory. We apply these frameworks to the practical problem of designing AI evaluation benchmarks that achieve high measurement precision with minimal cost.

4.2 Optimal Experimental Design

We begin with the simplest setting: the evaluator wants to measure model abilities as precisely as possible, and the models respond honestly. This is the classical problem of optimal experimental design applied to measurement.

4.2.1 Fisher Information for Item Selection

Recall from Chapter 2 that the Fisher information for item \(j\) at ability \(\theta\) in the Rasch model is:

\[ I_j(\theta) = P_j(\theta) \cdot (1 - P_j(\theta)) \tag{4.1}\]

where \(P_j(\theta) = \sigma(\theta - \beta_j)\) is the probability of a correct response. This quantity measures how much observing a response to item \(j\) tells us about \(\theta\).

Fisher information is maximized when \(P_j(\theta) = 0.5\), which occurs when \(\theta = \beta_j\)—the item difficulty matches the model’s ability. Intuitively, a question that a model gets right 99% of the time or wrong 99% of the time reveals almost nothing about the model’s ability. The most informative questions are those where the outcome is uncertain.

For a test consisting of items \(\{j_1, \ldots, j_K\}\), the total information is additive under local independence:

\[ I_{\text{total}}(\theta) = \sum_{k=1}^K I_{j_k}(\theta) \]

This additivity is the foundation of optimal item selection: we want to choose items that collectively maximize the total information across the range of abilities we care about.

4.2.2 Computerized Adaptive Testing

Computerized Adaptive Testing (CAT) is the sequential application of optimal item selection. Rather than administering a fixed test to all models, CAT adapts the test in real time: after each response, it selects the next item that would be most informative given what we have learned so far.

The CAT procedure iterates four steps:

  1. Select the most informative item given the current ability estimate
  2. Administer the item and observe the response
  3. Update the ability estimate using the new data
  4. Check a stopping criterion; if not met, return to step 1
Why Fisher Information for Item Selection?

Fisher information measures how much a response to item \(j\) tells us about \(\theta\):

  • High information: The item difficulty is well-matched to the ability level
  • Low information: The item is too easy or too hard

Asking a frontier model to answer \(1 + 1\) or a small model to prove the Riemann hypothesis provides almost no information. The most informative items are those where the model has roughly a 50% chance of success.

4.2.3 CAT Implementation

4.2.4 Stopping Rules and Practical Considerations

CAT requires a stopping criterion. Common choices include:

  1. Reliability threshold: Stop when \(R = 1 - \sigma^2_{\text{post}} / \sigma^2_{\text{prior}} \geq 0.95\)
  2. Standard error threshold: Stop when \(\text{SE}(\hat{\theta}) \leq 0.3\)
  3. Fixed length: Administer exactly \(K\) items
  4. Information threshold: Stop when additional items provide negligible information

For AI evaluation, practical constraints interact with statistical criteria:

  • Cost: Each API call has a monetary cost; the stopping rule should account for evaluation budgets.
  • Time: Evaluations must complete within deadlines.
  • Contamination: Administering too many items from the same pool risks benchmark leakage into training data.
CAT for AI Evaluation

Traditional CAT assumes deterministic responses: a human test-taker gives the same answer if asked the same question twice. AI models may or may not satisfy this depending on temperature and sampling settings.

For deterministic evaluation (temperature = 0), CAT applies directly. For stochastic evaluation, we may need multiple samples per item, or methods that account for response variability.

CAT also requires pre-calibrated item parameters. In a cold-start scenario (new benchmark), we must first collect data on a pilot sample of models before CAT can be deployed. This connects to the cold-start prediction problem addressed in Section 2.6.

4.2.5 D-Optimal Design

CAT is sequential optimal design—it selects one item at a time. But sometimes we need to design a fixed test: selecting \(K\) items from a pool of \(M\) candidates to administer to all models at once. This is the classical problem of optimal experimental design.

For a \(K\)-dimensional factor model with ability vector \(U_i \in \mathbb{R}^K\), the Fisher information matrix from administering a set of items \(\mathcal{J}\) is:

\[ \mathcal{I}(\theta; \mathcal{J}) = \sum_{j \in \mathcal{J}} P_j(\theta)(1 - P_j(\theta)) \, V_j V_j^\top \]

where \(V_j \in \mathbb{R}^K\) is the factor loading vector for item \(j\). Different optimality criteria lead to different item selection strategies:

  • D-optimal: Maximize \(\det(\mathcal{I})\)—the volume of the confidence ellipsoid. This minimizes the generalized variance of the ability estimates.
  • A-optimal: Minimize \(\text{tr}(\mathcal{I}^{-1})\)—the average variance of individual ability components.
  • E-optimal: Maximize \(\lambda_{\min}(\mathcal{I})\)—the smallest eigenvalue. This ensures no ability dimension is poorly estimated.

D-optimal design produces item pools with difficulties spread across the target ability range, ensuring high information everywhere. Random selection tends to cluster items near the center of the pool’s difficulty distribution, leaving the tails poorly covered.

4.2.6 Design for Paired Comparisons

When evaluation is based on paired comparisons (as in the Chatbot Arena from Chapter 1), the design problem takes a different form. Instead of selecting items, we must decide which pairs of models to compare. Under the Bradley-Terry model, the information from comparing models \(i\) and \(k\) about their strength difference \(\theta_i - \theta_k\) is:

\[ I_{ik}(\theta) = P_{ik}(1 - P_{ik}), \quad P_{ik} = \sigma(\theta_i - \theta_k) \]

This is maximized when the models are evenly matched (\(P_{ik} = 0.5\), i.e., \(\theta_i = \theta_k\)). The design problem is to choose a tournament schedule—which pairs to compare, and how often—that maximizes the precision of the estimated ratings.

Classical solutions include balanced incomplete block designs (BIBDs), where each pair of models is compared equally often. For AI evaluation arenas, adaptive matchmaking algorithms serve the same role as CAT: they select matchups that are most informative given current rating estimates. This is precisely what the Chatbot Arena does when it pairs models with similar Elo ratings.

4.3 Discussion Questions

  1. Adaptive testing for AI. What are the practical challenges in deploying CAT for AI evaluation? Consider: determinism of model responses, cost of API calls, benchmark contamination, and the need for pre-calibrated items.

  2. Optimal vs. fixed design. When is it better to use adaptive testing (CAT) versus a fixed D-optimal test? What factors determine this choice in practice?

  3. Paired comparison design. The Chatbot Arena uses adaptive matchmaking to pair models with similar Elo ratings. What are the advantages and disadvantages of this approach compared to a balanced tournament design?

  4. Evaluation budgets. A team has a fixed budget of 10,000 API calls to evaluate 50 models on a benchmark with 500 items. Design an evaluation strategy that maximizes measurement precision under this constraint.

4.4 Bibliographic Notes

4.4.1 Optimal Experimental Design

The theory of optimal experimental design originates with Kiefer (1959). D-optimal design is covered in Atkinson, Donev, and Tobias (2007). For the connection between Fisher information and test design in IRT, see Linden (2006) and Chang and Ying (2009). Wright and Stone (1979) provides an accessible introduction to test design under the Rasch model.

4.4.2 Computerized Adaptive Testing

CAT has a rich history beginning with (lord1970some?). The Fisher information criterion for item selection was developed by (birnbaum1968some?). For multidimensional CAT, see (segall1996multidimensional?). Applications to AI evaluation are emerging; see (polo2024tinybenchmarks?) for recent work on efficient benchmark design. Truong et al. (2025) demonstrate the practical impact of adaptive testing in the scaling law setting: by combining IRT-calibrated item parameters with Elo-based adaptive item selection, they achieve comparable decision accuracy to full-scale evaluation using only 50 questions per benchmark — a 99.9% reduction in the query budget.

4.5 Exercises

4.5.1 Theoretical Exercises

Exercise 3.1 (\(\star\)): Show that Fisher information \(I_j(\theta) = P_j(\theta)(1 - P_j(\theta))\) is maximized when \(\theta = \beta_j\), and that the maximum value is \(1/4\).

Exercise 3.2 (\(\star\star\)): For the 2PL model \(P_j(\theta) = \sigma(a_j(\theta - \beta_j))\), derive the Fisher information and show that it equals \(a_j^2 P_j(1 - P_j)\). How does discrimination \(a_j\) affect optimal item selection?

Exercise 3.3 (\(\star\star\)): Design a stopping rule for CAT that balances measurement precision with evaluation cost. Assume each API call costs $0.01 and the value of reducing standard error by one unit is $1. Find the cost-optimal stopping point.

4.5.2 Computational Exercises

Exercise 3.4 (\(\star\star\)): Extend the D-optimal design simulation to a 2-dimensional factor model with \(K = 2\). Implement item selection using \(j^* = \arg\max_j \det(\mathcal{I} + I_j)\) where \(I_j = P_j(1 - P_j) V_j V_j^\top\). Compare the selected item loading vectors to random selection.

Exercise 3.5 (\(\star\star\)): Compare the convergence of CAT across models with different ability levels. Does CAT require more items for extreme abilities (very high or very low)? Why?

Exercise 3.6 (\(\star\star\)): Investigate the sensitivity of CAT to misspecification of item parameters. If the calibration sample differs systematically from the test population, how does CAT performance degrade? Simulate this scenario.

References

Atkinson, Anthony C., Alexander N. Donev, and Randall D. Tobias. 2007. Optimum Experimental Designs, with SAS. Oxford University Press.
Chang, Hua-Hua, and Zhiliang Ying. 2009. “Nonlinear Sequential Designs for Logistic Item Response Theory Models with Applications to Computerized Adaptive Tests.” Annals of Statistics 37 (3): 1466–88.
Kiefer, Jack. 1959. “Optimum Experimental Designs.” Journal of the Royal Statistical Society: Series B 21 (2): 272–319.
Linden, Wim J. van der. 2006. “Optimal Test Design.” Handbook of Statistics 26: 575–95.
Truong, Sang, Yuheng Tu, Rylan Schaeffer, and Sanmi Koyejo. 2025. “Item Response Scaling Laws.” arXiv Preprint.
Wright, Benjamin D., and Mark H. Stone. 1979. Best Test Design. Chicago: MESA Press.