6 Efficiency

Intended Learning Outcomes

By the end of this chapter, the reader will be able to:

Formulate the benchmark design problem as an optimization over item selection and scoring rules.
Apply Fisher information to select maximally informative items, and implement a Computerized Adaptive Testing procedure.
Construct D-optimal item pools that maximize the precision of ability estimates.
Design efficient paired-comparison tournaments using information-theoretic principles.
Explain stopping rules and practical constraints (cost, time, contamination) for adaptive AI evaluation.

Suggested Lecture Plan

This chapter can be covered in 2 lectures (75-90 minutes each):

Lecture 1: Fisher Information and Adaptive Testing

The benchmark design problem (15 min)
Fisher information for item selection (20 min)
Computerized Adaptive Testing (30 min)
Hands-on: CAT simulation (10 min)

Lecture 2: Optimal Design and Paired Comparisons

D-optimal design and item pool construction (30 min)
Design for paired comparisons (20 min)
Stopping rules and practical considerations (15 min)
Hands-on: D-optimal design simulation (10 min)

Notation

This chapter introduces $I_j(\theta)$ (Fisher information for item $j$) and $\mathcal{I}(\theta)$ (Fisher information matrix). See the notation appendix for the complete notation reference.

6.1 The Design Problem in AI Evaluation

Chapter 3 introduced the measurement models—Rasch, 2PL, factor models, Bradley-Terry—that formalize how latent abilities generate observed responses. Chapter 4 showed how to estimate the parameters of these models from data. But both chapters took the response matrix $Y$ as given. In practice, someone must design the evaluation: choosing which items to include and how to score responses. These design decisions profoundly affect the quality and efficiency of the resulting measurements.

The benchmark designer faces two fundamental efficiency questions:

Item selection: Which questions or tasks should the benchmark include? Given a pool of candidate items, how do we select a subset that maximizes the precision of our measurements?
Scoring rules: How do we aggregate responses into scores? Sum scores, weighted scores, latent factor scores? The choice interacts with the measurement model (recall from Chapter 3 that sum scores are sufficient for the Rasch model but not for 2PL).

This chapter develops the statistical foundations for efficient evaluation design, drawing on classical experimental design and information theory. We apply these frameworks to the practical problem of designing AI evaluation benchmarks that achieve high measurement precision with minimal cost.

6.2 Optimal Experimental Design

We begin with the simplest setting: the evaluator wants to measure model abilities as precisely as possible, and the models respond honestly. This is the classical problem of optimal experimental design applied to measurement.

6.2.1 Fisher Information for Item Selection

Recall from Chapter 3 that the Fisher information for item $j$ at ability $\theta$ in the Rasch model is:

\[ I_j(\theta) = P_j(\theta) \cdot (1 - P_j(\theta)) \tag{6.1}\]

where $P_j(\theta) = \sigma(\theta - \beta_j)$ is the probability of a correct response. This quantity measures how much observing a response to item $j$ tells us about $\theta$.

Fisher information is maximized when $P_j(\theta) = 0.5$, which occurs when $\theta = \beta_j$—the item difficulty matches the model’s ability. Intuitively, a question that a model gets right 99% of the time or wrong 99% of the time reveals almost nothing about the model’s ability. The most informative questions are those where the outcome is uncertain.

For a test consisting of items $\{j_1, \ldots, j_K\}$, the total information is additive under local independence:

\[ I_{\text{total}}(\theta) = \sum_{k=1}^K I_{j_k}(\theta) \]

This additivity is the foundation of optimal item selection: we want to choose items that collectively maximize the total information across the range of abilities we care about.

6.2.2 Computerized Adaptive Testing

Computerized Adaptive Testing (CAT) is the sequential application of optimal item selection. Rather than administering a fixed test to all models, CAT adapts the test in real time: after each response, it selects the next item that would be most informative given what we have learned so far.

The CAT procedure iterates four steps:

Select the most informative item given the current ability estimate
Administer the item and observe the response
Update the ability estimate using the new data
Check a stopping criterion; if not met, return to step 1

Why Fisher Information for Item Selection?

Fisher information measures how much a response to item $j$ tells us about $\theta$:

High information: The item difficulty is well-matched to the ability level
Low information: The item is too easy or too hard

Asking a frontier model to answer $1 + 1$ or a small model to prove the Riemann hypothesis provides almost no information. The most informative items are those where the model has roughly a 50% chance of success.

6.2.3 CAT Implementation

6.2.4 Stopping Rules and Practical Considerations

CAT requires a stopping criterion. Common choices include:

Reliability threshold: Stop when $R = 1 - \sigma^2_{\text{post}} / \sigma^2_{\text{prior}} \geq 0.95$
Standard error threshold: Stop when $\text{SE}(\hat{\theta}) \leq 0.3$
Fixed length: Administer exactly $K$ items
Information threshold: Stop when additional items provide negligible information

For AI evaluation, practical constraints interact with statistical criteria:

Cost: Each API call has a monetary cost; the stopping rule should account for evaluation budgets.
Time: Evaluations must complete within deadlines.
Contamination: Administering too many items from the same pool risks benchmark leakage into training data.

CAT for AI Evaluation

Traditional CAT assumes deterministic responses: a human test-taker gives the same answer if asked the same question twice. AI models may or may not satisfy this depending on temperature and sampling settings.

For deterministic evaluation (temperature = 0), CAT applies directly. For stochastic evaluation, we may need multiple samples per item, or methods that account for response variability.

CAT also requires pre-calibrated item parameters. In a cold-start scenario (new benchmark), we must first collect data on a pilot sample of models before CAT can be deployed. This connects to the cold-start prediction problem addressed in Section 3.2.4.

6.2.5 D-Optimal Design

CAT is sequential optimal design—it selects one item at a time. But sometimes we need to design a fixed test: selecting $K$ items from a pool of $M$ candidates to administer to all models at once. This is the classical problem of optimal experimental design.

For a $K$-dimensional factor model with ability vector $U_i \in \mathbb{R}^K$, the Fisher information matrix from administering a set of items $\mathcal{J}$ is:

\[ \mathcal{I}(\theta; \mathcal{J}) = \sum_{j \in \mathcal{J}} P_j(\theta)(1 - P_j(\theta)) \, V_j V_j^\top \]

where $V_j \in \mathbb{R}^K$ is the factor loading vector for item $j$. Different optimality criteria lead to different item selection strategies:

D-optimal: Maximize $\det(\mathcal{I})$—the volume of the confidence ellipsoid. This minimizes the generalized variance of the ability estimates.
A-optimal: Minimize $\text{tr}(\mathcal{I}^{-1})$—the average variance of individual ability components.
E-optimal: Maximize $\lambda_{\min}(\mathcal{I})$—the smallest eigenvalue. This ensures no ability dimension is poorly estimated.

D-optimal design produces item pools with difficulties spread across the target ability range, ensuring high information everywhere. Random selection tends to cluster items near the center of the pool’s difficulty distribution, leaving the tails poorly covered.

6.2.6 Design for Paired Comparisons

When evaluation is based on paired comparisons (as in the Chatbot Arena), the design problem takes a different form. Instead of selecting items, we must decide which pairs of models to compare. Under the Bradley-Terry model, the information from comparing models $i$ and $k$ about their strength difference $\theta_i - \theta_k$ is:

\[ I_{ik}(\theta) = P_{ik}(1 - P_{ik}), \quad P_{ik} = \sigma(\theta_i - \theta_k) \]

This is maximized when the models are evenly matched ($P_{ik} = 0.5$, i.e., $\theta_i = \theta_k$). The design problem is to choose a tournament schedule—which pairs to compare, and how often—that maximizes the precision of the estimated ratings.

Classical solutions include balanced incomplete block designs (BIBDs), where each pair of models is compared equally often. For AI evaluation arenas, adaptive matchmaking algorithms serve the same role as CAT: they select matchups that are most informative given current rating estimates. This is precisely what the Chatbot Arena does when it pairs models with similar Elo ratings.

6.3 Discussion Questions

Adaptive testing for AI. What are the practical challenges in deploying CAT for AI evaluation? Consider: determinism of model responses, cost of API calls, benchmark contamination, and the need for pre-calibrated items.
Optimal vs. fixed design. When is it better to use adaptive testing (CAT) versus a fixed D-optimal test? What factors determine this choice in practice?
Paired comparison design. The Chatbot Arena uses adaptive matchmaking to pair models with similar Elo ratings. What are the advantages and disadvantages of this approach compared to a balanced tournament design?
Evaluation budgets. A team has a fixed budget of 10,000 API calls to evaluate 50 models on a benchmark with 500 items. Design an evaluation strategy that maximizes measurement precision under this constraint.

6.4 Bibliographic Notes

6.4.1 Optimal Experimental Design

The theory of optimal experimental design originates with Kiefer (1959). D-optimal design is covered in Atkinson et al. (2007). For the connection between Fisher information and test design in IRT, see Linden (2006) and Chang and Ying (2009). Wright and Stone (1979) provides an accessible introduction to test design under the Rasch model.

6.4.2 Computerized Adaptive Testing

CAT has a rich history beginning with Lord (1970). The Fisher information criterion for item selection was developed by Birnbaum (1968). For multidimensional CAT, see Segall (1996). Applications to AI evaluation are emerging; see Polo et al. (2024) for recent work on efficient benchmark design. Truong et al. (2025) demonstrate the practical impact of adaptive testing in the scaling law setting: by combining IRT-calibrated item parameters with Elo-based adaptive item selection, they achieve comparable decision accuracy to full-scale evaluation using only 50 questions per benchmark — a 99.9% reduction in the query budget.

6.5 Exercises

6.5.1 Theoretical Exercises

Exercise 3.1 ($\star$): Show that Fisher information $I_j(\theta) = P_j(\theta)(1 - P_j(\theta))$ is maximized when $\theta = \beta_j$, and that the maximum value is $1/4$.

Exercise 3.2 ($\star\star$): For the 2PL model $P_j(\theta) = \sigma(a_j(\theta - \beta_j))$, derive the Fisher information and show that it equals $a_j^2 P_j(1 - P_j)$. How does discrimination $a_j$ affect optimal item selection?

Exercise 3.3 ($\star\star$): Design a stopping rule for CAT that balances measurement precision with evaluation cost. Assume each API call costs $0.01 and the value of reducing standard error by one unit is $1. Find the cost-optimal stopping point.

6.5.2 Computational Exercises

Exercise 3.4 ($\star\star$): Extend the D-optimal design simulation to a 2-dimensional factor model with $K = 2$. Implement item selection using $j^* = \arg\max_j \det(\mathcal{I} + I_j)$ where $I_j = P_j(1 - P_j) V_j V_j^\top$. Compare the selected item loading vectors to random selection.

Exercise 3.5 ($\star\star$): Compare the convergence of CAT across models with different ability levels. Does CAT require more items for extreme abilities (very high or very low)? Why?

Exercise 3.6 ($\star\star$): Investigate the sensitivity of CAT to misspecification of item parameters. If the calibration sample differs systematically from the test population, how does CAT performance degrade? Simulate this scenario.

References

Atkinson, Anthony C., Alexander N. Donev, and Randall D. Tobias. 2007. Optimum Experimental Designs, with SAS. Oxford University Press.

Birnbaum, Allan. 1968. “Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability.” In Statistical Theories of Mental Test Scores, edited by Frederic M. Lord and Melvin R. Novick. Addison-Wesley.

Chang, Hua-Hua, and Zhiliang Ying. 2009. “Nonlinear Sequential Designs for Logistic Item Response Theory Models with Applications to Computerized Adaptive Tests.” Annals of Statistics 37 (3): 1466–88.

Kiefer, Jack. 1959. “Optimum Experimental Designs.” Journal of the Royal Statistical Society: Series B 21 (2): 272–319.

Linden, Wim J. van der. 2006. “Optimal Test Design.” Handbook of Statistics 26: 575–95.

Lord, Frederic M. 1970. “Some Test Theory for Tailored Testing.” In Computer-Assisted Instruction, Testing, and Guidance. Harper; Row.

Polo, Felipe Maia, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. 2024. tinyBenchmarks: Evaluating LLMs with Fewer Examples. https://arxiv.org/abs/2402.14992.

Segall, Daniel O. 1996. “Multidimensional Adaptive Testing.” Psychometrika 61 (2): 331–54. https://doi.org/10.1007/BF02294343.

Truong, Sang, Yuheng Tu, Rylan Schaeffer, and Sanmi Koyejo. 2025. “Item Response Scaling Laws.” arXiv Preprint.

Wright, Benjamin D., and Mark H. Stone. 1979. Best Test Design. MESA Press.

--- format: html: include-after-body: text: | <script> // Auto-execute all pyodide cells after initialization document.addEventListener('DOMContentLoaded', function() { function waitForPyodide() { if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) { if (typeof globalThis.qpyodideCellDetails !== 'undefined') { globalThis.qpyodideCellDetails.forEach((cell, index) => { if (cell.options && cell.options.autorun === 'true') { setTimeout(() => { const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`); if (runButton && !runButton.disabled) { runButton.click(); } }, index * 1000); } }); } } else { setTimeout(waitForPyodide, 500); } } setTimeout(waitForPyodide, 2000); }); </script> filters: - pyodide pyodide: packages: - numpy - matplotlib - scipy --- # Efficiency {#sec-efficient} ::: {.callout-note title="Intended Learning Outcomes"} By the end of this chapter, the reader will be able to: 1. **Formulate** the benchmark design problem as an optimization over item selection and scoring rules. 2. **Apply** Fisher information to select maximally informative items, and implement a Computerized Adaptive Testing procedure. 3. **Construct** D-optimal item pools that maximize the precision of ability estimates. 4. **Design** efficient paired-comparison tournaments using information-theoretic principles. 5. **Explain** stopping rules and practical constraints (cost, time, contamination) for adaptive AI evaluation. ::: ::: {.callout-tip title="Suggested Lecture Plan" collapse="true"} This chapter can be covered in **2 lectures** (75-90 minutes each): **Lecture 1: Fisher Information and Adaptive Testing** - The benchmark design problem (15 min) - Fisher information for item selection (20 min) - Computerized Adaptive Testing (30 min) - Hands-on: CAT simulation (10 min) **Lecture 2: Optimal Design and Paired Comparisons** - D-optimal design and item pool construction (30 min) - Design for paired comparisons (20 min) - Stopping rules and practical considerations (15 min) - Hands-on: D-optimal design simulation (10 min) ::: ::: {.callout-note title="Notation"} This chapter introduces $I_j(\theta)$ (Fisher information for item $j$) and $\mathcal{I}(\theta)$ (Fisher information matrix). See the [notation appendix](notation.qmd) for the complete notation reference. ::: ## The Design Problem in AI Evaluation {#sec-design-problem} @sec-foundations introduced the measurement models---Rasch, 2PL, factor models, Bradley-Terry---that formalize how latent abilities generate observed responses. @sec-learning showed how to estimate the parameters of these models from data. But both chapters took the response matrix $Y$ as given. In practice, someone must *design* the evaluation: choosing which items to include and how to score responses. These design decisions profoundly affect the quality and efficiency of the resulting measurements. The benchmark designer faces two fundamental efficiency questions: 1. **Item selection:** Which questions or tasks should the benchmark include? Given a pool of candidate items, how do we select a subset that maximizes the precision of our measurements? 2. **Scoring rules:** How do we aggregate responses into scores? Sum scores, weighted scores, latent factor scores? The choice interacts with the measurement model (recall from @sec-foundations that sum scores are sufficient for the Rasch model but not for 2PL). This chapter develops the statistical foundations for efficient evaluation design, drawing on classical experimental design and information theory. We apply these frameworks to the practical problem of designing AI evaluation benchmarks that achieve high measurement precision with minimal cost. ## Optimal Experimental Design {#sec-optimal-design} We begin with the simplest setting: the evaluator wants to measure model abilities as precisely as possible, and the models respond honestly. This is the classical problem of *optimal experimental design* applied to measurement. ### Fisher Information for Item Selection {#sec-fisher-design} Recall from @sec-foundations that the Fisher information for item $j$ at ability $\theta$ in the Rasch model is: $$ I_j(\theta) = P_j(\theta) \cdot (1 - P_j(\theta)) $$ {#eq-fisher-info} where $P_j(\theta) = \sigma(\theta - \beta_j)$ is the probability of a correct response. This quantity measures how much observing a response to item $j$ tells us about $\theta$. Fisher information is maximized when $P_j(\theta) = 0.5$, which occurs when $\theta = \beta_j$---the item difficulty matches the model's ability. Intuitively, a question that a model gets right 99% of the time or wrong 99% of the time reveals almost nothing about the model's ability. The most informative questions are those where the outcome is uncertain. For a test consisting of items $\{j_1, \ldots, j_K\}$, the total information is additive under local independence: $$ I_{\text{total}}(\theta) = \sum_{k=1}^K I_{j_k}(\theta) $$ This additivity is the foundation of optimal item selection: we want to choose items that collectively maximize the total information across the range of abilities we care about. {{< include _plt_setup.qmd >}} ```{pyodide-python} #| label: fisher-information-design #| autorun: true #| fig-cap: "Fisher information curves illustrate the key design principle: items are most informative when their difficulty matches the test-taker's ability." import numpy as np import matplotlib.pyplot as plt def sigmoid(x): """Numerically stable sigmoid function.""" return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x))) theta_range = np.linspace(-4, 4, 200) fig, axes = plt.subplots(1, 1, figsize=(2, 2)) # Information curves for different item difficulties difficulties = [-2, -1, 0, 1, 2] colors = plt.cm.viridis(np.linspace(0, 1, len(difficulties))) for beta_j, color in zip(difficulties, colors): P = sigmoid(theta_range - beta_j) info = P * (1 - P) axes[0].plot(theta_range, info, color=color, linewidth=2, label=f'Item difficulty = {beta_j}') axes[0].set_xlabel('Ability (θ)') axes[0].set_ylabel('Fisher Information') axes[0].set_title('Item Information Curves') axes[0].legend(fontsize=7) axes[0].grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ### Computerized Adaptive Testing {#sec-cat} Computerized Adaptive Testing (CAT) is the sequential application of optimal item selection. Rather than administering a fixed test to all models, CAT adapts the test in real time: after each response, it selects the next item that would be most informative given what we have learned so far. The CAT procedure iterates four steps: 1. **Select** the most informative item given the current ability estimate 2. **Administer** the item and observe the response 3. **Update** the ability estimate using the new data 4. **Check** a stopping criterion; if not met, return to step 1 ::: {.callout-important title="Why Fisher Information for Item Selection?"} Fisher information measures how much a response to item $j$ tells us about $\theta$: - **High information**: The item difficulty is well-matched to the ability level - **Low information**: The item is too easy or too hard Asking a frontier model to answer $1 + 1$ or a small model to prove the Riemann hypothesis provides almost no information. The most informative items are those where the model has roughly a 50% chance of success. ::: ```{pyodide-python} #| label: fisher-information-design #| autorun: true #| fig-cap: "Fisher information curves illustrate the key design principle: items are most informative when their difficulty matches the test-taker's ability." # Cumulative information: adaptive vs random np.random.seed(42) N, M = 100, 50 theta_true = np.random.normal(0, 1, N) beta_true = np.random.normal(0, 1.5, M) theta_test = 0.5 n_items = 20 fig, axes = plt.subplots(1, 1, figsize=(2, 2)) # Adaptive: choose items closest to current estimate beta_available = beta_true.copy() adaptive_info = [0] theta_estimate = 0 for t in range(n_items): distances = np.abs(beta_available - theta_estimate) best_idx = np.argmin(distances) beta_selected = beta_available[best_idx] P = sigmoid(theta_test - beta_selected) info = P * (1 - P) adaptive_info.append(adaptive_info[-1] + info) beta_available = np.delete(beta_available, best_idx) theta_estimate = theta_test # Random selection random_order = np.random.permutation(len(beta_true))[:n_items] random_info = [0] for j in random_order: P = sigmoid(theta_test - beta_true[j]) info = P * (1 - P) random_info.append(random_info[-1] + info) axes[1].plot(range(n_items + 1), adaptive_info, 'g-', linewidth=2, label='Adaptive') axes[1].plot(range(n_items + 1), random_info, 'b-', linewidth=2, label='Random') axes[1].set_xlabel('Number of Items') axes[1].set_ylabel('Cumulative Fisher Information') axes[1].set_title(f'Information Accumulation (θ = {theta_test})') axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ### CAT Implementation {#sec-cat-implementation} ```{pyodide-python} #| label: cat-simulation #| autorun: true #| fig-cap: "CAT achieves the same measurement precision as random testing with substantially fewer items." # Generate synthetic data for CAT simulation np.random.seed(42) N, M = 100, 50 theta_true = np.random.normal(0, 1, N) beta_true = np.random.normal(0, 1.5, M) def cat_simulation(theta_true_i, beta, n_items_max=30, reliability_threshold=0.95): """ Simulate CAT for a single test-taker. Parameters ---------- theta_true_i : float True ability of the test-taker beta : ndarray Item difficulties (pre-calibrated) n_items_max : int Maximum number of items to administer reliability_threshold : float Stop when reliability exceeds this threshold Returns ------- dict with theta_hat, n_items, reliability_history, etc. """ M = len(beta) administered = [] responses = [] # Prior: theta ~ N(0, 1) theta_hat = 0.0 prior_var = 1.0 theta_history = [theta_hat] reliability_history = [0.0] se_history = [1.0] available_items = list(range(M)) for t in range(min(n_items_max, M)): # Select item with maximum Fisher information at current estimate best_item = None best_info = -np.inf for j in available_items: P_j = sigmoid(theta_hat - beta[j]) info_j = P_j * (1 - P_j) if info_j > best_info: best_info = info_j best_item = j # Administer item (simulate response from true ability) P_true = sigmoid(theta_true_i - beta[best_item]) response = int(np.random.random() < P_true) administered.append(best_item) responses.append(response) available_items.remove(best_item) # Update ability estimate via MAP (Newton-Raphson) for _ in range(10): P_vec = sigmoid(theta_hat - np.array([beta[j] for j in administered])) grad = np.sum(np.array(responses) - P_vec) - theta_hat / prior_var hess = -np.sum(P_vec * (1 - P_vec)) - 1 / prior_var if abs(hess) > 1e-10: theta_hat = theta_hat - grad / hess # Posterior variance and reliability total_info = np.sum([sigmoid(theta_hat - beta[j]) * (1 - sigmoid(theta_hat - beta[j])) for j in administered]) posterior_var = 1 / (1/prior_var + total_info) reliability = 1 - posterior_var / prior_var theta_history.append(theta_hat) reliability_history.append(reliability) se_history.append(np.sqrt(posterior_var)) if reliability >= reliability_threshold: break return { 'theta_hat': theta_hat, 'n_items': len(administered), 'reliability_history': reliability_history, 'theta_history': theta_history, 'se_history': se_history, 'final_reliability': reliability_history[-1] } def random_selection_simulation(theta_true_i, beta, n_items_max=30, reliability_threshold=0.95): """Simulate random item selection for comparison.""" M = len(beta) item_order = list(np.random.permutation(M)[:n_items_max]) theta_hat = 0.0 prior_var = 1.0 administered = [] responses = [] reliability_history = [0.0] theta_history = [theta_hat] for j in item_order: P_true = sigmoid(theta_true_i - beta[j]) response = int(np.random.random() < P_true) administered.append(j) responses.append(response) for _ in range(10): P_vec = sigmoid(theta_hat - np.array([beta[k] for k in administered])) grad = np.sum(np.array(responses) - P_vec) - theta_hat / prior_var hess = -np.sum(P_vec * (1 - P_vec)) - 1 / prior_var if abs(hess) > 1e-10: theta_hat = theta_hat - grad / hess total_info = np.sum([sigmoid(theta_hat - beta[k]) * (1 - sigmoid(theta_hat - beta[k])) for k in administered]) posterior_var = 1 / (1/prior_var + total_info) reliability = 1 - posterior_var / prior_var reliability_history.append(reliability) theta_history.append(theta_hat) if reliability >= reliability_threshold: break return {'n_items': len(administered), 'reliability_history': reliability_history, 'theta_history': theta_history, 'theta_hat': theta_hat, 'final_reliability': reliability_history[-1]} # Run simulations np.random.seed(42) n_test_takers = 100 theta_test_sample = np.random.normal(0, 1, n_test_takers) cat_results = [] random_results = [] for theta_i in theta_test_sample: cat_results.append(cat_simulation(theta_i, beta_true)) random_results.append(random_selection_simulation(theta_i, beta_true)) cat_items = [r['n_items'] for r in cat_results] random_items = [r['n_items'] for r in random_results] # Plot comparison fig, axes = plt.subplots(1, 3, figsize=(6, 2)) # Bar chart methods = ['Random', 'CAT'] means = [np.mean(random_items), np.mean(cat_items)] stds = [np.std(random_items), np.std(cat_items)] bars = axes[0].bar(methods, means, yerr=stds, capsize=5, alpha=0.7, color=['#1f77b4', '#2ca02c']) axes[0].set_ylabel('Items to reach 95% reliability') axes[0].set_title('Efficiency: CAT vs Random') axes[0].grid(True, alpha=0.3, axis='y') for bar, mean, std in zip(bars, means, stds): axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + std + 0.5, f'{mean:.1f}', ha='center', va='bottom', fontsize=11) # Reliability trajectory example_idx = 50 axes[1].plot(random_results[example_idx]['reliability_history'], 'b-', linewidth=2, label='Random') axes[1].plot(cat_results[example_idx]['reliability_history'], 'g-', linewidth=2, label='CAT') axes[1].axhline(0.95, color='r', linestyle='--', linewidth=1.5, label='Threshold') axes[1].set_xlabel('Items administered') axes[1].set_ylabel('Reliability') axes[1].set_title(f'Reliability Growth (θ = {theta_test_sample[example_idx]:.2f})') axes[1].legend(fontsize=7) axes[1].grid(True, alpha=0.3) # Histogram axes[2].hist(random_items, bins=15, alpha=0.6, label='Random', color='#1f77b4') axes[2].hist(cat_items, bins=15, alpha=0.6, label='CAT', color='#2ca02c') axes[2].set_xlabel('Number of items') axes[2].set_ylabel('Frequency') axes[2].set_title('Distribution of Test Lengths') axes[2].legend() axes[2].grid(True, alpha=0.3) plt.tight_layout() plt.show() efficiency_gain = (np.mean(random_items) - np.mean(cat_items)) / np.mean(random_items) * 100 print(f"Random: {np.mean(random_items):.1f} ± {np.std(random_items):.1f} items") print(f"CAT: {np.mean(cat_items):.1f} ± {np.std(cat_items):.1f} items") print(f"Efficiency gain: {efficiency_gain:.1f}% fewer items with CAT") ``` ### Stopping Rules and Practical Considerations {#sec-stopping} CAT requires a stopping criterion. Common choices include: 1. **Reliability threshold:** Stop when $R = 1 - \sigma^2_{\text{post}} / \sigma^2_{\text{prior}} \geq 0.95$ 2. **Standard error threshold:** Stop when $\text{SE}(\hat{\theta}) \leq 0.3$ 3. **Fixed length:** Administer exactly $K$ items 4. **Information threshold:** Stop when additional items provide negligible information For AI evaluation, practical constraints interact with statistical criteria: - **Cost:** Each API call has a monetary cost; the stopping rule should account for evaluation budgets. - **Time:** Evaluations must complete within deadlines. - **Contamination:** Administering too many items from the same pool risks benchmark leakage into training data. ::: {.callout-note title="CAT for AI Evaluation"} Traditional CAT assumes deterministic responses: a human test-taker gives the same answer if asked the same question twice. AI models may or may not satisfy this depending on temperature and sampling settings. For deterministic evaluation (temperature = 0), CAT applies directly. For stochastic evaluation, we may need multiple samples per item, or methods that account for response variability. CAT also requires pre-calibrated item parameters. In a cold-start scenario (new benchmark), we must first collect data on a pilot sample of models before CAT can be deployed. This connects to the cold-start prediction problem addressed in @sec-cold-start. ::: ### D-Optimal Design {#sec-d-optimal} CAT is *sequential* optimal design---it selects one item at a time. But sometimes we need to design a fixed test: selecting $K$ items from a pool of $M$ candidates to administer to all models at once. This is the classical problem of *optimal experimental design*. For a $K$-dimensional factor model with ability vector $U_i \in \mathbb{R}^K$, the Fisher information matrix from administering a set of items $\mathcal{J}$ is: $$ \mathcal{I}(\theta; \mathcal{J}) = \sum_{j \in \mathcal{J}} P_j(\theta)(1 - P_j(\theta)) \, V_j V_j^\top $$ where $V_j \in \mathbb{R}^K$ is the factor loading vector for item $j$. Different optimality criteria lead to different item selection strategies: - **D-optimal:** Maximize $\det(\mathcal{I})$---the volume of the confidence ellipsoid. This minimizes the generalized variance of the ability estimates. - **A-optimal:** Minimize $\text{tr}(\mathcal{I}^{-1})$---the average variance of individual ability components. - **E-optimal:** Maximize $\lambda_{\min}(\mathcal{I})$---the smallest eigenvalue. This ensures no ability dimension is poorly estimated. ```{pyodide-python} #| label: d-optimal-design #| autorun: true #| fig-cap: "D-optimal item selection chooses items with diverse difficulties covering the target ability range, yielding substantially higher total information than random selection." from scipy.optimize import minimize as sp_minimize def compute_test_information(item_indices, beta_pool, theta_grid): """Compute total Fisher information of a test across an ability grid.""" total_info = np.zeros_like(theta_grid) for j in item_indices: P = sigmoid(theta_grid - beta_pool[j]) total_info += P * (1 - P) return total_info def d_optimal_greedy(beta_pool, K, theta_grid): """Greedy D-optimal item selection for the Rasch model. Selects K items from the pool to maximize total information integrated over the theta grid (approximating D-optimality for the unidimensional case). """ available = list(range(len(beta_pool))) selected = [] for _ in range(K): best_item = None best_score = -np.inf for j in available: candidate = selected + [j] info = compute_test_information(candidate, beta_pool, theta_grid) # D-optimal criterion: product of information values # (log-determinant in 1D = log of information) score = np.sum(np.log(info + 1e-10)) if score > best_score: best_score = score best_item = j selected.append(best_item) available.remove(best_item) return selected # Large item pool np.random.seed(42) M_pool = 200 beta_pool = np.random.normal(0, 2, M_pool) K = 20 # Select 20 items theta_grid = np.linspace(-3, 3, 100) # D-optimal selection d_optimal_items = d_optimal_greedy(beta_pool, K, theta_grid) # Random selection (multiple draws for comparison) random_infos = [] for trial in range(50): random_items = np.random.choice(M_pool, K, replace=False) info = compute_test_information(random_items, beta_pool, theta_grid) random_infos.append(info) d_opt_info = compute_test_information(d_optimal_items, beta_pool, theta_grid) random_mean = np.mean(random_infos, axis=0) random_std = np.std(random_infos, axis=0) # Visualization fig, axes = plt.subplots(1, 3, figsize=(6, 2)) # Test information functions axes[0].plot(theta_grid, d_opt_info, 'g-', linewidth=2, label='D-optimal') axes[0].plot(theta_grid, random_mean, 'b-', linewidth=2, label='Random (mean)') axes[0].fill_between(theta_grid, random_mean - random_std, random_mean + random_std, alpha=0.2, color='blue') axes[0].set_xlabel('Ability (θ)') axes[0].set_ylabel('Test Information') axes[0].set_title('Test Information Function') axes[0].legend(fontsize=7) axes[0].grid(True, alpha=0.3) # Selected item difficulties axes[1].hist(beta_pool[d_optimal_items], bins=12, alpha=0.7, color='green', label='D-optimal', density=True) axes[1].hist(beta_pool, bins=30, alpha=0.3, color='gray', label='Full pool', density=True) axes[1].set_xlabel('Item Difficulty (β)') axes[1].set_ylabel('Density') axes[1].set_title('Selected Item Difficulties') axes[1].legend(fontsize=7) axes[1].grid(True, alpha=0.3) # Integrated information comparison d_opt_total = np.trapz(d_opt_info, theta_grid) random_totals = [np.trapz(ri, theta_grid) for ri in random_infos] axes[2].hist(random_totals, bins=15, alpha=0.6, color='blue', label='Random') axes[2].axvline(d_opt_total, color='green', linewidth=2, linestyle='--', label=f'D-optimal ({d_opt_total:.1f})') axes[2].set_xlabel('Integrated Information') axes[2].set_ylabel('Frequency') axes[2].set_title('Total Information Comparison') axes[2].legend(fontsize=7) axes[2].grid(True, alpha=0.3) plt.tight_layout() plt.show() print(f"D-optimal integrated info: {d_opt_total:.1f}") print(f"Random integrated info: {np.mean(random_totals):.1f} ± {np.std(random_totals):.1f}") print(f"D-optimal advantage: {(d_opt_total / np.mean(random_totals) - 1) * 100:.1f}% more information") ``` D-optimal design produces item pools with difficulties spread across the target ability range, ensuring high information everywhere. Random selection tends to cluster items near the center of the pool's difficulty distribution, leaving the tails poorly covered. ### Design for Paired Comparisons {#sec-paired-design} When evaluation is based on paired comparisons (as in the Chatbot Arena), the design problem takes a different form. Instead of selecting items, we must decide which pairs of models to compare. Under the Bradley-Terry model, the information from comparing models $i$ and $k$ about their strength difference $\theta_i - \theta_k$ is: $$ I_{ik}(\theta) = P_{ik}(1 - P_{ik}), \quad P_{ik} = \sigma(\theta_i - \theta_k) $$ This is maximized when the models are evenly matched ($P_{ik} = 0.5$, i.e., $\theta_i = \theta_k$). The design problem is to choose a tournament schedule---which pairs to compare, and how often---that maximizes the precision of the estimated ratings. Classical solutions include **balanced incomplete block designs** (BIBDs), where each pair of models is compared equally often. For AI evaluation arenas, adaptive matchmaking algorithms serve the same role as CAT: they select matchups that are most informative given current rating estimates. This is precisely what the Chatbot Arena does when it pairs models with similar Elo ratings. ## Discussion Questions {#sec-efficient-discussion} 1. **Adaptive testing for AI.** What are the practical challenges in deploying CAT for AI evaluation? Consider: determinism of model responses, cost of API calls, benchmark contamination, and the need for pre-calibrated items. 2. **Optimal vs. fixed design.** When is it better to use adaptive testing (CAT) versus a fixed D-optimal test? What factors determine this choice in practice? 3. **Paired comparison design.** The Chatbot Arena uses adaptive matchmaking to pair models with similar Elo ratings. What are the advantages and disadvantages of this approach compared to a balanced tournament design? 4. **Evaluation budgets.** A team has a fixed budget of 10,000 API calls to evaluate 50 models on a benchmark with 500 items. Design an evaluation strategy that maximizes measurement precision under this constraint. ## Bibliographic Notes {#sec-efficient-bib} ### Optimal Experimental Design The theory of optimal experimental design originates with @kiefer1959optimum. D-optimal design is covered in @atkinson2007optimum. For the connection between Fisher information and test design in IRT, see @van2006optimal and @chang2009nonlinear. @wright1979best provides an accessible introduction to test design under the Rasch model. ### Computerized Adaptive Testing CAT has a rich history beginning with @lord1970some. The Fisher information criterion for item selection was developed by @birnbaum1968some. For multidimensional CAT, see @segall1996multidimensional. Applications to AI evaluation are emerging; see @polo2024tinybenchmarks for recent work on efficient benchmark design. @truong2025irsl demonstrate the practical impact of adaptive testing in the scaling law setting: by combining IRT-calibrated item parameters with Elo-based adaptive item selection, they achieve comparable decision accuracy to full-scale evaluation using only 50 questions per benchmark — a 99.9% reduction in the query budget. ## Exercises {#sec-efficient-exercises} ### Theoretical Exercises **Exercise 3.1** ($\star$): Show that Fisher information $I_j(\theta) = P_j(\theta)(1 - P_j(\theta))$ is maximized when $\theta = \beta_j$, and that the maximum value is $1/4$. **Exercise 3.2** ($\star\star$): For the 2PL model $P_j(\theta) = \sigma(a_j(\theta - \beta_j))$, derive the Fisher information and show that it equals $a_j^2 P_j(1 - P_j)$. How does discrimination $a_j$ affect optimal item selection? **Exercise 3.3** ($\star\star$): Design a stopping rule for CAT that balances measurement precision with evaluation cost. Assume each API call costs \$0.01 and the value of reducing standard error by one unit is \$1. Find the cost-optimal stopping point. ### Computational Exercises **Exercise 3.4** ($\star\star$): Extend the D-optimal design simulation to a 2-dimensional factor model with $K = 2$. Implement item selection using $j^* = \arg\max_j \det(\mathcal{I} + I_j)$ where $I_j = P_j(1 - P_j) V_j V_j^\top$. Compare the selected item loading vectors to random selection. **Exercise 3.5** ($\star\star$): Compare the convergence of CAT across models with different ability levels. Does CAT require more items for extreme abilities (very high or very low)? Why? **Exercise 3.6** ($\star\star$): Investigate the sensitivity of CAT to misspecification of item parameters. If the calibration sample differs systematically from the test population, how does CAT performance degrade? Simulate this scenario.