---
format:
html:
include-after-body:
text: |
<script>
// Auto-execute all pyodide cells after initialization
document.addEventListener('DOMContentLoaded', function() {
function waitForPyodide() {
if (typeof globalThis.mainPyodide !== 'undefined' && globalThis.mainPyodide) {
if (typeof globalThis.qpyodideCellDetails !== 'undefined') {
globalThis.qpyodideCellDetails.forEach((cell, index) => {
if (cell.options && cell.options.autorun === 'true') {
setTimeout(() => {
const runButton = document.querySelector(`#qpyodide-button-run-${cell.id}`);
if (runButton && !runButton.disabled) {
runButton.click();
}
}, index * 1000);
}
});
}
} else {
setTimeout(waitForPyodide, 500);
}
}
setTimeout(waitForPyodide, 2000);
});
</script>
filters:
- pyodide
pyodide:
packages:
- numpy
- matplotlib
- scipy
---
# Efficient Measurement {#sec-efficient}
::: {.callout-note title="Intended Learning Outcomes"}
By the end of this chapter, you will be able to:
1. **Formulate** the benchmark design problem as an optimization over item selection and scoring rules.
2. **Apply** Fisher information to select maximally informative items, and implement a Computerized Adaptive Testing procedure.
3. **Construct** D-optimal item pools that maximize the precision of ability estimates.
4. **Design** efficient paired-comparison tournaments using information-theoretic principles.
5. **Explain** stopping rules and practical constraints (cost, time, contamination) for adaptive AI evaluation.
:::
::: {.callout-tip title="Suggested Lecture Plan" collapse="true"}
This chapter can be covered in **2 lectures** (75-90 minutes each):
**Lecture 1: Fisher Information and Adaptive Testing**
- The benchmark design problem (15 min)
- Fisher information for item selection (20 min)
- Computerized Adaptive Testing (30 min)
- Hands-on: CAT simulation (10 min)
**Lecture 2: Optimal Design and Paired Comparisons**
- D-optimal design and item pool construction (30 min)
- Design for paired comparisons (20 min)
- Stopping rules and practical considerations (15 min)
- Hands-on: D-optimal design simulation (10 min)
:::
::: {.callout-note title="Notation"}
This chapter introduces $I_j(\theta)$ (Fisher information for item $j$) and $\mathcal{I}(\theta)$ (Fisher information matrix). See the [notation appendix](notation.qmd) for the complete notation reference.
:::
## The Design Problem in AI Evaluation {#sec-design-problem}
Chapter 1 introduced the measurement models---Rasch, 2PL, factor models, Bradley-Terry---that formalize how latent abilities generate observed responses. Chapter 2 showed how to estimate the parameters of these models from data. But both chapters took the response matrix $Y$ as given. In practice, someone must *design* the evaluation: choosing which items to include and how to score responses. These design decisions profoundly affect the quality and efficiency of the resulting measurements.
The benchmark designer faces two fundamental efficiency questions:
1. **Item selection:** Which questions or tasks should the benchmark include? Given a pool of candidate items, how do we select a subset that maximizes the precision of our measurements?
2. **Scoring rules:** How do we aggregate responses into scores? Sum scores, weighted scores, latent factor scores? The choice interacts with the measurement model (recall from Chapter 1 that sum scores are sufficient for the Rasch model but not for 2PL).
This chapter develops the statistical foundations for efficient evaluation design, drawing on classical experimental design and information theory. We apply these frameworks to the practical problem of designing AI evaluation benchmarks that achieve high measurement precision with minimal cost.
## Optimal Experimental Design {#sec-optimal-design}
We begin with the simplest setting: the evaluator wants to measure model abilities as precisely as possible, and the models respond honestly. This is the classical problem of *optimal experimental design* applied to measurement.
### Fisher Information for Item Selection {#sec-fisher-design}
Recall from Chapter 2 that the Fisher information for item $j$ at ability $\theta$ in the Rasch model is:
$$
I_j(\theta) = P_j(\theta) \cdot (1 - P_j(\theta))
$$ {#eq-fisher-info}
where $P_j(\theta) = \sigma(\theta - \beta_j)$ is the probability of a correct response. This quantity measures how much observing a response to item $j$ tells us about $\theta$.
Fisher information is maximized when $P_j(\theta) = 0.5$, which occurs when $\theta = \beta_j$---the item difficulty matches the model's ability. Intuitively, a question that a model gets right 99% of the time or wrong 99% of the time reveals almost nothing about the model's ability. The most informative questions are those where the outcome is uncertain.
For a test consisting of items $\{j_1, \ldots, j_K\}$, the total information is additive under local independence:
$$
I_{\text{total}}(\theta) = \sum_{k=1}^K I_{j_k}(\theta)
$$
This additivity is the foundation of optimal item selection: we want to choose items that collectively maximize the total information across the range of abilities we care about.
{{< include _plt_setup.qmd >}}
```{pyodide-python}
#| label: fisher-information-design
#| autorun: true
#| fig-cap: "Fisher information curves illustrate the key design principle: items are most informative when their difficulty matches the test-taker's ability."
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
"""Numerically stable sigmoid function."""
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
theta_range = np.linspace(-4, 4, 200)
fig, axes = plt.subplots(1, 2, figsize=(6, 2))
# Information curves for different item difficulties
difficulties = [-2, -1, 0, 1, 2]
colors = plt.cm.viridis(np.linspace(0, 1, len(difficulties)))
for beta_j, color in zip(difficulties, colors):
P = sigmoid(theta_range - beta_j)
info = P * (1 - P)
axes[0].plot(theta_range, info, color=color, linewidth=2,
label=f'Item difficulty = {beta_j}')
axes[0].set_xlabel('Ability (θ)')
axes[0].set_ylabel('Fisher Information')
axes[0].set_title('Item Information Curves')
axes[0].legend(fontsize=7)
axes[0].grid(True, alpha=0.3)
# Cumulative information: adaptive vs random
np.random.seed(42)
N, M = 100, 50
theta_true = np.random.normal(0, 1, N)
beta_true = np.random.normal(0, 1.5, M)
theta_test = 0.5
n_items = 20
# Adaptive: choose items closest to current estimate
beta_available = beta_true.copy()
adaptive_info = [0]
theta_estimate = 0
for t in range(n_items):
distances = np.abs(beta_available - theta_estimate)
best_idx = np.argmin(distances)
beta_selected = beta_available[best_idx]
P = sigmoid(theta_test - beta_selected)
info = P * (1 - P)
adaptive_info.append(adaptive_info[-1] + info)
beta_available = np.delete(beta_available, best_idx)
theta_estimate = theta_test
# Random selection
random_order = np.random.permutation(len(beta_true))[:n_items]
random_info = [0]
for j in random_order:
P = sigmoid(theta_test - beta_true[j])
info = P * (1 - P)
random_info.append(random_info[-1] + info)
axes[1].plot(range(n_items + 1), adaptive_info, 'g-', linewidth=2, label='Adaptive')
axes[1].plot(range(n_items + 1), random_info, 'b-', linewidth=2, label='Random')
axes[1].set_xlabel('Number of Items')
axes[1].set_ylabel('Cumulative Fisher Information')
axes[1].set_title(f'Information Accumulation (θ = {theta_test})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
### Computerized Adaptive Testing {#sec-cat}
Computerized Adaptive Testing (CAT) is the sequential application of optimal item selection. Rather than administering a fixed test to all models, CAT adapts the test in real time: after each response, it selects the next item that would be most informative given what we have learned so far.
The CAT procedure iterates four steps:
1. **Select** the most informative item given the current ability estimate
2. **Administer** the item and observe the response
3. **Update** the ability estimate using the new data
4. **Check** a stopping criterion; if not met, return to step 1
::: {.callout-important title="Why Fisher Information for Item Selection?"}
Fisher information measures how much a response to item $j$ tells us about $\theta$:
- **High information**: The item difficulty is well-matched to the ability level
- **Low information**: The item is too easy or too hard
Asking a frontier model to answer $1 + 1$ or a small model to prove the Riemann hypothesis provides almost no information. The most informative items are those where the model has roughly a 50% chance of success.
:::
### CAT Implementation {#sec-cat-implementation}
```{pyodide-python}
#| label: cat-simulation
#| autorun: true
#| fig-cap: "CAT achieves the same measurement precision as random testing with substantially fewer items."
# Generate synthetic data for CAT simulation
np.random.seed(42)
N, M = 100, 50
theta_true = np.random.normal(0, 1, N)
beta_true = np.random.normal(0, 1.5, M)
def cat_simulation(theta_true_i, beta, n_items_max=30, reliability_threshold=0.95):
"""
Simulate CAT for a single test-taker.
Parameters
----------
theta_true_i : float
True ability of the test-taker
beta : ndarray
Item difficulties (pre-calibrated)
n_items_max : int
Maximum number of items to administer
reliability_threshold : float
Stop when reliability exceeds this threshold
Returns
-------
dict with theta_hat, n_items, reliability_history, etc.
"""
M = len(beta)
administered = []
responses = []
# Prior: theta ~ N(0, 1)
theta_hat = 0.0
prior_var = 1.0
theta_history = [theta_hat]
reliability_history = [0.0]
se_history = [1.0]
available_items = list(range(M))
for t in range(min(n_items_max, M)):
# Select item with maximum Fisher information at current estimate
best_item = None
best_info = -np.inf
for j in available_items:
P_j = sigmoid(theta_hat - beta[j])
info_j = P_j * (1 - P_j)
if info_j > best_info:
best_info = info_j
best_item = j
# Administer item (simulate response from true ability)
P_true = sigmoid(theta_true_i - beta[best_item])
response = int(np.random.random() < P_true)
administered.append(best_item)
responses.append(response)
available_items.remove(best_item)
# Update ability estimate via MAP (Newton-Raphson)
for _ in range(10):
P_vec = sigmoid(theta_hat - np.array([beta[j] for j in administered]))
grad = np.sum(np.array(responses) - P_vec) - theta_hat / prior_var
hess = -np.sum(P_vec * (1 - P_vec)) - 1 / prior_var
if abs(hess) > 1e-10:
theta_hat = theta_hat - grad / hess
# Posterior variance and reliability
total_info = np.sum([sigmoid(theta_hat - beta[j]) * (1 - sigmoid(theta_hat - beta[j]))
for j in administered])
posterior_var = 1 / (1/prior_var + total_info)
reliability = 1 - posterior_var / prior_var
theta_history.append(theta_hat)
reliability_history.append(reliability)
se_history.append(np.sqrt(posterior_var))
if reliability >= reliability_threshold:
break
return {
'theta_hat': theta_hat, 'n_items': len(administered),
'reliability_history': reliability_history,
'theta_history': theta_history, 'se_history': se_history,
'final_reliability': reliability_history[-1]
}
def random_selection_simulation(theta_true_i, beta, n_items_max=30, reliability_threshold=0.95):
"""Simulate random item selection for comparison."""
M = len(beta)
item_order = list(np.random.permutation(M)[:n_items_max])
theta_hat = 0.0
prior_var = 1.0
administered = []
responses = []
reliability_history = [0.0]
theta_history = [theta_hat]
for j in item_order:
P_true = sigmoid(theta_true_i - beta[j])
response = int(np.random.random() < P_true)
administered.append(j)
responses.append(response)
for _ in range(10):
P_vec = sigmoid(theta_hat - np.array([beta[k] for k in administered]))
grad = np.sum(np.array(responses) - P_vec) - theta_hat / prior_var
hess = -np.sum(P_vec * (1 - P_vec)) - 1 / prior_var
if abs(hess) > 1e-10:
theta_hat = theta_hat - grad / hess
total_info = np.sum([sigmoid(theta_hat - beta[k]) * (1 - sigmoid(theta_hat - beta[k]))
for k in administered])
posterior_var = 1 / (1/prior_var + total_info)
reliability = 1 - posterior_var / prior_var
reliability_history.append(reliability)
theta_history.append(theta_hat)
if reliability >= reliability_threshold:
break
return {'n_items': len(administered), 'reliability_history': reliability_history,
'theta_history': theta_history, 'theta_hat': theta_hat,
'final_reliability': reliability_history[-1]}
# Run simulations
np.random.seed(42)
n_test_takers = 100
theta_test_sample = np.random.normal(0, 1, n_test_takers)
cat_results = []
random_results = []
for theta_i in theta_test_sample:
cat_results.append(cat_simulation(theta_i, beta_true))
random_results.append(random_selection_simulation(theta_i, beta_true))
cat_items = [r['n_items'] for r in cat_results]
random_items = [r['n_items'] for r in random_results]
# Plot comparison
fig, axes = plt.subplots(1, 3, figsize=(6, 2))
# Bar chart
methods = ['Random', 'CAT']
means = [np.mean(random_items), np.mean(cat_items)]
stds = [np.std(random_items), np.std(cat_items)]
bars = axes[0].bar(methods, means, yerr=stds, capsize=5, alpha=0.7,
color=['#1f77b4', '#2ca02c'])
axes[0].set_ylabel('Items to reach 95% reliability')
axes[0].set_title('Efficiency: CAT vs Random')
axes[0].grid(True, alpha=0.3, axis='y')
for bar, mean, std in zip(bars, means, stds):
axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + std + 0.5,
f'{mean:.1f}', ha='center', va='bottom', fontsize=11)
# Reliability trajectory
example_idx = 50
axes[1].plot(random_results[example_idx]['reliability_history'], 'b-', linewidth=2, label='Random')
axes[1].plot(cat_results[example_idx]['reliability_history'], 'g-', linewidth=2, label='CAT')
axes[1].axhline(0.95, color='r', linestyle='--', linewidth=1.5, label='Threshold')
axes[1].set_xlabel('Items administered')
axes[1].set_ylabel('Reliability')
axes[1].set_title(f'Reliability Growth (θ = {theta_test_sample[example_idx]:.2f})')
axes[1].legend(fontsize=7)
axes[1].grid(True, alpha=0.3)
# Histogram
axes[2].hist(random_items, bins=15, alpha=0.6, label='Random', color='#1f77b4')
axes[2].hist(cat_items, bins=15, alpha=0.6, label='CAT', color='#2ca02c')
axes[2].set_xlabel('Number of items')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Distribution of Test Lengths')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
efficiency_gain = (np.mean(random_items) - np.mean(cat_items)) / np.mean(random_items) * 100
print(f"Random: {np.mean(random_items):.1f} ± {np.std(random_items):.1f} items")
print(f"CAT: {np.mean(cat_items):.1f} ± {np.std(cat_items):.1f} items")
print(f"Efficiency gain: {efficiency_gain:.1f}% fewer items with CAT")
```
### Stopping Rules and Practical Considerations {#sec-stopping}
CAT requires a stopping criterion. Common choices include:
1. **Reliability threshold:** Stop when $R = 1 - \sigma^2_{\text{post}} / \sigma^2_{\text{prior}} \geq 0.95$
2. **Standard error threshold:** Stop when $\text{SE}(\hat{\theta}) \leq 0.3$
3. **Fixed length:** Administer exactly $K$ items
4. **Information threshold:** Stop when additional items provide negligible information
For AI evaluation, practical constraints interact with statistical criteria:
- **Cost:** Each API call has a monetary cost; the stopping rule should account for evaluation budgets.
- **Time:** Evaluations must complete within deadlines.
- **Contamination:** Administering too many items from the same pool risks benchmark leakage into training data.
::: {.callout-note title="CAT for AI Evaluation"}
Traditional CAT assumes deterministic responses: a human test-taker gives the same answer if asked the same question twice. AI models may or may not satisfy this depending on temperature and sampling settings.
For deterministic evaluation (temperature = 0), CAT applies directly. For stochastic evaluation, we may need multiple samples per item, or methods that account for response variability.
CAT also requires pre-calibrated item parameters. In a cold-start scenario (new benchmark), we must first collect data on a pilot sample of models before CAT can be deployed. This connects to the cold-start prediction problem addressed in @sec-cold-start.
:::
### D-Optimal Design {#sec-d-optimal}
CAT is *sequential* optimal design---it selects one item at a time. But sometimes we need to design a fixed test: selecting $K$ items from a pool of $M$ candidates to administer to all models at once. This is the classical problem of *optimal experimental design*.
For a $K$-dimensional factor model with ability vector $U_i \in \mathbb{R}^K$, the Fisher information matrix from administering a set of items $\mathcal{J}$ is:
$$
\mathcal{I}(\theta; \mathcal{J}) = \sum_{j \in \mathcal{J}} P_j(\theta)(1 - P_j(\theta)) \, V_j V_j^\top
$$
where $V_j \in \mathbb{R}^K$ is the factor loading vector for item $j$. Different optimality criteria lead to different item selection strategies:
- **D-optimal:** Maximize $\det(\mathcal{I})$---the volume of the confidence ellipsoid. This minimizes the generalized variance of the ability estimates.
- **A-optimal:** Minimize $\text{tr}(\mathcal{I}^{-1})$---the average variance of individual ability components.
- **E-optimal:** Maximize $\lambda_{\min}(\mathcal{I})$---the smallest eigenvalue. This ensures no ability dimension is poorly estimated.
```{pyodide-python}
#| label: d-optimal-design
#| autorun: true
#| fig-cap: "D-optimal item selection chooses items with diverse difficulties covering the target ability range, yielding substantially higher total information than random selection."
from scipy.optimize import minimize as sp_minimize
def compute_test_information(item_indices, beta_pool, theta_grid):
"""Compute total Fisher information of a test across an ability grid."""
total_info = np.zeros_like(theta_grid)
for j in item_indices:
P = sigmoid(theta_grid - beta_pool[j])
total_info += P * (1 - P)
return total_info
def d_optimal_greedy(beta_pool, K, theta_grid):
"""Greedy D-optimal item selection for the Rasch model.
Selects K items from the pool to maximize total information
integrated over the theta grid (approximating D-optimality
for the unidimensional case).
"""
available = list(range(len(beta_pool)))
selected = []
for _ in range(K):
best_item = None
best_score = -np.inf
for j in available:
candidate = selected + [j]
info = compute_test_information(candidate, beta_pool, theta_grid)
# D-optimal criterion: product of information values
# (log-determinant in 1D = log of information)
score = np.sum(np.log(info + 1e-10))
if score > best_score:
best_score = score
best_item = j
selected.append(best_item)
available.remove(best_item)
return selected
# Large item pool
np.random.seed(42)
M_pool = 200
beta_pool = np.random.normal(0, 2, M_pool)
K = 20 # Select 20 items
theta_grid = np.linspace(-3, 3, 100)
# D-optimal selection
d_optimal_items = d_optimal_greedy(beta_pool, K, theta_grid)
# Random selection (multiple draws for comparison)
random_infos = []
for trial in range(50):
random_items = np.random.choice(M_pool, K, replace=False)
info = compute_test_information(random_items, beta_pool, theta_grid)
random_infos.append(info)
d_opt_info = compute_test_information(d_optimal_items, beta_pool, theta_grid)
random_mean = np.mean(random_infos, axis=0)
random_std = np.std(random_infos, axis=0)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(6, 2))
# Test information functions
axes[0].plot(theta_grid, d_opt_info, 'g-', linewidth=2, label='D-optimal')
axes[0].plot(theta_grid, random_mean, 'b-', linewidth=2, label='Random (mean)')
axes[0].fill_between(theta_grid, random_mean - random_std, random_mean + random_std,
alpha=0.2, color='blue')
axes[0].set_xlabel('Ability (θ)')
axes[0].set_ylabel('Test Information')
axes[0].set_title('Test Information Function')
axes[0].legend(fontsize=7)
axes[0].grid(True, alpha=0.3)
# Selected item difficulties
axes[1].hist(beta_pool[d_optimal_items], bins=12, alpha=0.7, color='green',
label='D-optimal', density=True)
axes[1].hist(beta_pool, bins=30, alpha=0.3, color='gray',
label='Full pool', density=True)
axes[1].set_xlabel('Item Difficulty (β)')
axes[1].set_ylabel('Density')
axes[1].set_title('Selected Item Difficulties')
axes[1].legend(fontsize=7)
axes[1].grid(True, alpha=0.3)
# Integrated information comparison
d_opt_total = np.trapz(d_opt_info, theta_grid)
random_totals = [np.trapz(ri, theta_grid) for ri in random_infos]
axes[2].hist(random_totals, bins=15, alpha=0.6, color='blue', label='Random')
axes[2].axvline(d_opt_total, color='green', linewidth=2, linestyle='--',
label=f'D-optimal ({d_opt_total:.1f})')
axes[2].set_xlabel('Integrated Information')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Total Information Comparison')
axes[2].legend(fontsize=7)
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"D-optimal integrated info: {d_opt_total:.1f}")
print(f"Random integrated info: {np.mean(random_totals):.1f} ± {np.std(random_totals):.1f}")
print(f"D-optimal advantage: {(d_opt_total / np.mean(random_totals) - 1) * 100:.1f}% more information")
```
D-optimal design produces item pools with difficulties spread across the target ability range, ensuring high information everywhere. Random selection tends to cluster items near the center of the pool's difficulty distribution, leaving the tails poorly covered.
### Design for Paired Comparisons {#sec-paired-design}
When evaluation is based on paired comparisons (as in the Chatbot Arena from Chapter 1), the design problem takes a different form. Instead of selecting items, we must decide which pairs of models to compare. Under the Bradley-Terry model, the information from comparing models $i$ and $k$ about their strength difference $\theta_i - \theta_k$ is:
$$
I_{ik}(\theta) = P_{ik}(1 - P_{ik}), \quad P_{ik} = \sigma(\theta_i - \theta_k)
$$
This is maximized when the models are evenly matched ($P_{ik} = 0.5$, i.e., $\theta_i = \theta_k$). The design problem is to choose a tournament schedule---which pairs to compare, and how often---that maximizes the precision of the estimated ratings.
Classical solutions include **balanced incomplete block designs** (BIBDs), where each pair of models is compared equally often. For AI evaluation arenas, adaptive matchmaking algorithms serve the same role as CAT: they select matchups that are most informative given current rating estimates. This is precisely what the Chatbot Arena does when it pairs models with similar Elo ratings.
## Discussion Questions {#sec-efficient-discussion}
1. **Adaptive testing for AI.** What are the practical challenges in deploying CAT for AI evaluation? Consider: determinism of model responses, cost of API calls, benchmark contamination, and the need for pre-calibrated items.
2. **Optimal vs. fixed design.** When is it better to use adaptive testing (CAT) versus a fixed D-optimal test? What factors determine this choice in practice?
3. **Paired comparison design.** The Chatbot Arena uses adaptive matchmaking to pair models with similar Elo ratings. What are the advantages and disadvantages of this approach compared to a balanced tournament design?
4. **Evaluation budgets.** A team has a fixed budget of 10,000 API calls to evaluate 50 models on a benchmark with 500 items. Design an evaluation strategy that maximizes measurement precision under this constraint.
## Bibliographic Notes {#sec-efficient-bib}
### Optimal Experimental Design
The theory of optimal experimental design originates with @kiefer1959optimum. D-optimal design is covered in @atkinson2007optimum. For the connection between Fisher information and test design in IRT, see @van2006optimal and @chang2009nonlinear. @wright1979best provides an accessible introduction to test design under the Rasch model.
### Computerized Adaptive Testing
CAT has a rich history beginning with @lord1970some. The Fisher information criterion for item selection was developed by @birnbaum1968some. For multidimensional CAT, see @segall1996multidimensional. Applications to AI evaluation are emerging; see @polo2024tinybenchmarks for recent work on efficient benchmark design. @truong2025irsl demonstrate the practical impact of adaptive testing in the scaling law setting: by combining IRT-calibrated item parameters with Elo-based adaptive item selection, they achieve comparable decision accuracy to full-scale evaluation using only 50 questions per benchmark — a 99.9% reduction in the query budget.
## Exercises {#sec-efficient-exercises}
### Theoretical Exercises
**Exercise 3.1** ($\star$): Show that Fisher information $I_j(\theta) = P_j(\theta)(1 - P_j(\theta))$ is maximized when $\theta = \beta_j$, and that the maximum value is $1/4$.
**Exercise 3.2** ($\star\star$): For the 2PL model $P_j(\theta) = \sigma(a_j(\theta - \beta_j))$, derive the Fisher information and show that it equals $a_j^2 P_j(1 - P_j)$. How does discrimination $a_j$ affect optimal item selection?
**Exercise 3.3** ($\star\star$): Design a stopping rule for CAT that balances measurement precision with evaluation cost. Assume each API call costs \$0.01 and the value of reducing standard error by one unit is \$1. Find the cost-optimal stopping point.
### Computational Exercises
**Exercise 3.4** ($\star\star$): Extend the D-optimal design simulation to a 2-dimensional factor model with $K = 2$. Implement item selection using $j^* = \arg\max_j \det(\mathcal{I} + I_j)$ where $I_j = P_j(1 - P_j) V_j V_j^\top$. Compare the selected item loading vectors to random selection.
**Exercise 3.5** ($\star\star$): Compare the convergence of CAT across models with different ability levels. Does CAT require more items for extreme abilities (very high or very low)? Why?
**Exercise 3.6** ($\star\star$): Investigate the sensitivity of CAT to misspecification of item parameters. If the calibration sample differs systematically from the test population, how does CAT performance degrade? Simulate this scenario.