import torch
import torch.nn as nn
from torch.optim import LBFGS
import torch.nn.functional as F
class LogisticFM(nn.Module):
"""Logistic Factor Model for binary response data."""
def __init__(self, N, M, K):
super().__init__()
self.U = nn.Parameter(torch.randn(N, K)) # Model abilities
self.V = nn.Parameter(torch.randn(M, K)) # Item loadings
self.Z = nn.Parameter(torch.randn(M, 1)) # Item intercepts
def forward(self):
return torch.sigmoid(self.U @ self.V.T + self.Z.T)2 Learning
3 Learning
By the end of this chapter, you will be able to:
- Derive the log-likelihood function for the Rasch model and explain the role of person and item parameters.
- Implement maximum likelihood estimation (MLE) for IRT models using gradient descent and L-BFGS optimization.
- Explain the identifiability problem in IRT and describe standard solutions (sum-to-zero, fixed anchor).
- Distinguish between joint MLE, conditional MLE, and marginal MLE, and articulate when each is appropriate.
- Implement the Expectation-Maximization (EM) algorithm for Rasch model estimation and explain the E-step and M-step.
- Describe Bayesian inference for IRT models and specify appropriate priors for ability and item parameters.
- Implement MAP estimation and MCMC sampling for IRT models.
- Explain regularization in IRT as a Bayesian prior and apply cross-validation for hyperparameter selection.
- Design a Computerized Adaptive Testing (CAT) procedure using Fisher information for item selection.
- Apply MLE, Bayesian, and CAT methods to real AI benchmark data and compare their efficiency.
This chapter can be covered in 3-4 lectures (75-90 minutes each):
Lecture 1: Foundations of Estimation
- Why learning matters for AI measurement (15 min)
- Likelihood and log-likelihood for Rasch model (20 min)
- Gradient derivation and interpretation (20 min)
- Hands-on: MLE with synthetic data (20 min)
Lecture 2: Advanced Estimation Methods
- Identifiability and conditional vs marginal MLE (20 min)
- EM algorithm for IRT (30 min)
- Hands-on: EM implementation (25 min)
Lecture 3: Bayesian Approaches
- Prior specification for IRT (15 min)
- MAP estimation (20 min)
- MCMC for IRT (30 min)
- Regularization as Bayesian prior (10 min)
Lecture 4: Active Learning
- CAT framework and Fisher information (25 min)
- D-optimality and item selection (20 min)
- Stopping rules and efficiency (15 min)
- Hands-on: CAT simulation (15 min)
Building on Chapter 1, we use the following additional notation:
| Symbol | Meaning | Domain |
|---|---|---|
| \(\ell(\theta, \beta)\) | Log-likelihood function | \(\mathbb{R}\) |
| \(\nabla_\theta \ell\) | Gradient w.r.t. ability parameters | \(\mathbb{R}^N\) |
| \(\mathcal{I}(\theta)\) | Fisher information matrix | \(\mathbb{R}^{N \times N}\) |
| \(I_j(\theta)\) | Fisher information for item \(j\) | \(\mathbb{R}^+\) |
| \(\pi(\theta)\) | Prior distribution over abilities | - |
| \(\pi(\beta)\) | Prior distribution over difficulties | - |
| \(\hat{\theta}_{\text{MLE}}\) | Maximum likelihood estimate | \(\mathbb{R}^N\) |
| \(\hat{\theta}_{\text{MAP}}\) | Maximum a posteriori estimate | \(\mathbb{R}^N\) |
| \(\eta\) | Learning rate | \(\mathbb{R}^+\) |
A visual tour of the key concepts in this chapter — from maximum likelihood estimation and the EM algorithm to Bayesian inference and computerized adaptive testing.
3.1 Why Learning Matters for AI Measurement
Chapter 1 introduced the measurement models—Rasch, 2PL, factor models—that describe how latent abilities generate observed responses. But knowing the form of a model is not enough. To actually use these models for AI evaluation, we must estimate their parameters from data.
Given a response matrix \(Y \in \{0,1\}^{N \times M}\) where \(Y_{ij} = 1\) indicates model \(i\) answered question \(j\) correctly:
\[ \text{Find } \hat{\theta}, \hat{\beta} = \arg\max_{\theta, \beta} P(Y \mid \theta, \beta) \]
This optimization problem underlies all psychometric estimation and forms the foundation for trustworthy AI evaluation.
Parameter estimation serves several critical purposes in AI measurement:
Fair comparison: Calibrated item difficulties allow us to compare models tested on different question subsets. If we know that question A is harder than question B, we can appropriately weight their contributions to the final score.
Uncertainty quantification: Estimation procedures provide not just point estimates but standard errors, telling us how confident we should be in our measurements.
Adaptive testing: Once we have calibrated item parameters, we can select the most informative questions for each model, dramatically reducing evaluation costs.
Prediction: With learned parameters, we can predict how a model will perform on questions it has never seen, enabling efficient evaluation of new benchmarks.
This chapter covers two complementary paradigms for learning these parameters:
Passive learning: Given a fixed dataset, estimate all parameters simultaneously. This includes maximum likelihood estimation (MLE), expectation-maximization (EM), and Bayesian inference.
Active learning: Sequentially select which questions to administer based on current estimates, updating parameters after each response. Computerized Adaptive Testing (CAT) is the primary example.
3.2 Maximum Likelihood Estimation
Maximum likelihood estimation is the foundation of parameter estimation in IRT. The principle is simple: find the parameter values that make the observed data most probable.
3.2.1 The Likelihood Function
Recall from Chapter 1 that the Rasch model specifies the probability of a correct response as:
\[ P(Y_{ij} = 1 \mid \theta_i, \beta_j) = \sigma(\theta_i - \beta_j) = \frac{1}{1 + e^{-(\theta_i - \beta_j)}} \tag{3.1}\]
where \(\theta_i\) is the ability of model \(i\) and \(\beta_j\) is the difficulty of item \(j\).
Under the assumption of local independence—that responses are conditionally independent given the latent parameters—the likelihood of the entire response matrix is:
\[ L(\theta, \beta \mid Y) = \prod_{i=1}^N \prod_{j=1}^M P(Y_{ij} \mid \theta_i, \beta_j)^{Y_{ij}} [1 - P(Y_{ij} \mid \theta_i, \beta_j)]^{1-Y_{ij}} \tag{3.2}\]
Taking the logarithm (for computational stability and mathematical convenience):
\[ \ell(\theta, \beta) = \sum_{i=1}^N \sum_{j=1}^M \left[ Y_{ij}(\theta_i - \beta_j) - \log(1 + e^{\theta_i - \beta_j}) \right] \tag{3.3}\]
This is the objective function we want to maximize.
3.2.2 Gradient Derivation
To optimize the log-likelihood, we need its gradients. Taking partial derivatives:
\[ \frac{\partial \ell}{\partial \theta_i} = \sum_{j=1}^M \left[ Y_{ij} - \sigma(\theta_i - \beta_j) \right] \tag{3.4}\]
\[ \frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^N \left[ \sigma(\theta_i - \beta_j) - Y_{ij} \right] \tag{3.5}\]
The gradient \(\frac{\partial \ell}{\partial \theta_i} = \sum_j [Y_{ij} - P_{ij}]\) has a beautiful interpretation:
- \(Y_{ij}\) is the observed response (0 or 1)
- \(P_{ij} = \sigma(\theta_i - \beta_j)\) is the predicted probability
The gradient is simply the sum of residuals: observed minus predicted. If model \(i\) performs better than expected (more correct answers than predicted), the residuals are positive, and we increase \(\theta_i\). If it performs worse than expected, we decrease \(\theta_i\). This is the essence of gradient ascent.
3.2.3 Implementation with Gradient Descent
Let us implement MLE via gradient descent on synthetic data. First, we generate a response matrix from known parameters:
Now we implement MLE via gradient descent:
3.2.4 The Identifiability Problem
The Rasch model has a fundamental identifiability issue: if we add a constant \(c\) to all abilities and all difficulties, the likelihood is unchanged:
\[ P(Y_{ij}=1 \mid \theta_i + c, \beta_j + c) = \sigma((\theta_i + c) - (\beta_j + c)) = \sigma(\theta_i - \beta_j) \]
The parameters are only identified up to an additive constant. This means infinitely many parameter values produce the same likelihood.
Common Solutions:
- Sum-to-zero constraint: Set \(\sum_i \theta_i = 0\) or \(\sum_j \beta_j = 0\)
- Fixed anchor: Set one parameter (e.g., \(\beta_1 = 0\)) as reference
- Prior constraint: Use Bayesian priors centered at zero
For AI benchmarks, we typically use sum-to-zero: a model with \(\theta = 0\) has “average” ability relative to the calibration sample.
Without addressing identifiability, gradient descent may drift indefinitely. The re-centering step in our implementation ensures parameters remain anchored.
3.2.5 L-BFGS Optimization
While gradient descent is intuitive, quasi-Newton methods like L-BFGS converge much faster by approximating second-order information:
3.3 Joint, Conditional, and Marginal MLE
The MLE approach we have discussed so far is called joint maximum likelihood estimation (JMLE). It treats both person parameters \(\theta\) and item parameters \(\beta\) as fixed unknowns to be estimated. However, JMLE has theoretical limitations that motivate alternative approaches.
3.3.1 Joint MLE (JMLE)
JMLE simultaneously estimates all parameters by maximizing Equation 3.3. While intuitive, JMLE suffers from the incidental parameter problem: as the number of items \(M\) remains fixed and the number of persons \(N\) grows, the item parameter estimates \(\hat{\beta}\) are inconsistent—they do not converge to the true values.
This happens because each person parameter \(\theta_i\) is estimated from only \(M\) observations (their responses to \(M\) items), and these “incidental” person parameters introduce bias into the item estimates.
For AI benchmarks with many items (typically \(M > 100\)), this bias is small in practice. But for smaller tests, JMLE can be problematic.
3.3.2 Conditional MLE (CMLE)
Georg Rasch discovered an elegant solution to the incidental parameter problem. For the Rasch model specifically, the sum score \(S_i = \sum_j Y_{ij}\) is a sufficient statistic for \(\theta_i\). This means all information about \(\theta_i\) in the data \(Y_i\) is captured by \(S_i\).
By conditioning on the sufficient statistics, we can eliminate the person parameters entirely:
\[ P(Y_i \mid S_i, \beta) = \frac{\exp(-\sum_j Y_{ij}\beta_j)}{\gamma_{S_i}(\beta)} \tag{3.6}\]
where \(\gamma_r(\beta) = \sum_{A: |A|=r} \exp(-\sum_{j \in A} \beta_j)\) is the elementary symmetric function of order \(r\), summing over all subsets \(A\) of items of size \(r\).
The conditional likelihood depends only on \(\beta\), so we can estimate item parameters without any person parameters. This produces consistent estimates of \(\beta\) regardless of how \(N\) grows.
The sufficiency of sum scores is unique to the Rasch model. For the 2PL or 3PL models, sum scores are not sufficient, and CMLE cannot be applied. This mathematical property is one reason the Rasch model holds special status in measurement theory.
3.3.3 Marginal MLE (MMLE)
An alternative approach is to treat person parameters as random variables from a population distribution:
\[ \theta_i \sim \mathcal{N}(\mu_\theta, \sigma^2_\theta) \]
The marginal likelihood integrates out the person parameters:
\[ L(\beta) = \prod_{i=1}^N \int P(Y_i \mid \theta, \beta) \, p(\theta) \, d\theta \tag{3.7}\]
This approach:
- Treats item parameters as fixed and person parameters as random
- Produces consistent estimates of \(\beta\) as \(N \to \infty\)
- Naturally extends to more complex IRT models (2PL, 3PL)
- Forms the basis for the EM algorithm (next section)
| Method | Person Parameters | Item Parameters | Consistency | Applicability |
|---|---|---|---|---|
| JMLE | Estimated directly | Estimated directly | Inconsistent for fixed M | Any IRT model |
| CMLE | Conditioned out | Consistent | Consistent | Rasch only |
| MMLE | Integrated out | Consistent | Consistent | Any IRT model |
For AI benchmarks with many questions (\(M > 100\)), JMLE works well in practice. For smaller tests or when statistical properties are important, CMLE or MMLE is preferred.
3.4 The EM Algorithm
The Expectation-Maximization (EM) algorithm is a general method for maximum likelihood estimation with latent variables. In IRT, the latent variables are the person abilities \(\theta\).
3.4.1 The EM Framework
The EM algorithm iterates between two steps:
E-step (Expectation): Compute the expected value of the complete-data log-likelihood, given the observed data and current parameter estimates:
\[ Q(\beta \mid \beta^{(t)}) = \mathbb{E}_{\theta \mid Y, \beta^{(t)}} \left[ \log P(Y, \theta \mid \beta) \right] \]
M-step (Maximization): Find the parameter values that maximize the expected log-likelihood:
\[ \beta^{(t+1)} = \arg\max_\beta Q(\beta \mid \beta^{(t)}) \]
The EM algorithm guarantees that the marginal likelihood increases (or stays the same) at each iteration, converging to a local maximum.
3.4.2 EM for the Rasch Model
For the Rasch model with a standard normal prior on abilities, the EM algorithm takes a specific form:
E-step: For each person \(i\), compute the posterior distribution of \(\theta_i\) given their responses \(Y_i\) and current item parameters \(\beta^{(t)}\):
\[ p(\theta_i \mid Y_i, \beta^{(t)}) \propto p(Y_i \mid \theta_i, \beta^{(t)}) \cdot p(\theta_i) \]
This posterior is not available in closed form, so we use numerical integration (Gauss-Hermite quadrature).
M-step: Update each item parameter by solving:
\[ \sum_{i=1}^N \mathbb{E}_{\theta_i}[\sigma(\theta_i - \beta_j)] = \sum_{i=1}^N Y_{ij} \]
The left side is the expected number of correct responses to item \(j\); the right side is the observed number. We equate these.
3.4.3 Multidimensional Extension: The Logistic Factor Model
The methods above focused on the Rasch model, which assumes a single latent dimension. For AI benchmarks that measure multiple capabilities, we extend to the Logistic Factor Model:
\[ P(Y_{ij} = 1 \mid U_i, V_j, Z_j) = \sigma(U_i^\top V_j + Z_j) \]
where:
- \(U_i \in \mathbb{R}^K\) is the \(K\)-dimensional latent ability vector for model \(i\)
- \(V_j \in \mathbb{R}^K\) is the factor loading vector for item \(j\)
- \(Z_j \in \mathbb{R}\) is the item intercept (capturing overall difficulty)
When \(K=1\) and \(V_j = 1\) for all \(j\), this reduces to the Rasch model.
3.4.3.1 Implementation
- \(U_i\): latent ability vector of model \(i\) (position in \(K\)-dimensional capability space)
- \(V_j\): latent property vector of item \(j\) (which capabilities the item measures)
- \(Z_j\): overall item difficulty (independent of capability dimensions)
- \(\sigma\): sigmoid function ensuring probabilities in \([0,1]\)
3.4.3.2 Training with LBFGS
We train the model by minimizing binary cross-entropy loss:
# Training setup
N, M = Y.shape
K = 2 # Number of latent dimensions
model = LogisticFM(N, M, K)
opt = LBFGS(
model.parameters(),
lr=0.1,
max_iter=20,
history_size=10,
line_search_fn="strong_wolfe"
)
def closure():
opt.zero_grad()
probs = model()
loss = F.binary_cross_entropy(probs[train_mask], Y[train_mask].float())
loss.backward()
return loss
# Training loop
for iteration in range(20):
loss = opt.step(closure)The model learns to decompose the response matrix into latent factors that capture the underlying structure of model capabilities and item characteristics.
3.5 Bayesian Inference
Bayesian inference provides an alternative to maximum likelihood that naturally incorporates prior information and quantifies uncertainty. Instead of finding a single point estimate, we characterize the entire posterior distribution over parameters.
3.5.1 Prior Specification
The first step in Bayesian inference is specifying prior distributions that encode our beliefs before seeing the data:
For abilities (persons/models): \[\theta_i \sim \mathcal{N}(0, \sigma^2_\theta), \quad \sigma_\theta = 1 \text{ (standard choice)}\]
For difficulties (items/questions): \[\beta_j \sim \mathcal{N}(0, \sigma^2_\beta), \quad \sigma_\beta = 1\text{-}2 \text{ (depending on expected range)}\]
For discrimination (2PL model): \[a_j \sim \text{LogNormal}(0, 0.5) \text{ or } a_j \sim \text{Gamma}(2, 0.5)\]
These priors are weakly informative: they regularize estimates without dominating the data. They encode the belief that most abilities and difficulties are within a few units of zero, which is appropriate when the scale is defined by convention.
3.5.2 Posterior Computation
Bayes’ theorem gives us the posterior distribution:
\[ p(\theta, \beta \mid Y) \propto p(Y \mid \theta, \beta) \cdot p(\theta) \cdot p(\beta) \tag{3.8}\]
The posterior combines the likelihood (data) with the priors (beliefs). Unfortunately, this posterior is not available in closed form—we need computational methods.
3.5.3 MAP Estimation
The simplest Bayesian approach is maximum a posteriori (MAP) estimation, which finds the mode of the posterior:
\[ \hat{\theta}_{\text{MAP}}, \hat{\beta}_{\text{MAP}} = \arg\max_{\theta, \beta} \left[ \ell(\theta, \beta \mid Y) + \log p(\theta) + \log p(\beta) \right] \tag{3.9}\]
With Gaussian priors, this is equivalent to L2-regularized MLE:
\[ \hat{\theta}_{\text{MAP}}, \hat{\beta}_{\text{MAP}} = \arg\max_{\theta, \beta} \left[ \ell(\theta, \beta) - \frac{1}{2\sigma^2_\theta}\sum_i \theta_i^2 - \frac{1}{2\sigma^2_\beta}\sum_j \beta_j^2 \right] \]
Notice that MAP estimates have smaller variance than MLE estimates. This is shrinkage toward the prior mean (zero).
For extreme scores—models that answer all questions correctly or incorrectly—MLE gives infinite or very large estimates. MAP regularizes these to finite, sensible values. This is crucial for AI benchmarks where some models may achieve near-perfect scores on easy subsets.
The amount of shrinkage is controlled by the prior variance: smaller \(\sigma^2\) means stronger shrinkage toward zero.
3.5.4 MCMC Sampling
To characterize the full posterior distribution (not just its mode), we use Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings algorithm is a simple but effective approach:
The posterior standard deviations quantify our uncertainty about each parameter. Parameters with more information (e.g., items answered by many models, models who answered many questions) have smaller posterior uncertainty.
3.6 Regularization and Model Selection
3.6.1 L2 Regularization as Bayesian Prior
We have seen that MAP estimation with Gaussian priors is equivalent to L2 regularization. The regularization strength \(\lambda\) relates to the prior variance as \(\lambda = 1/\sigma^2\).
The regularized objective is:
\[ \ell_{\text{reg}}(\theta, \beta) = \ell(\theta, \beta) - \frac{\lambda_\theta}{2}\|\theta\|^2 - \frac{\lambda_\beta}{2}\|\beta\|^2 \]
Regularization prevents overfitting, especially when:
- Some persons have few responses (sparse data)
- Some items have extreme difficulty (near 0% or 100% pass rates)
- The model is complex (many parameters relative to data)
3.6.2 Cross-Validation for Hyperparameter Selection
How do we choose the regularization strength? Cross-validation provides a principled answer: we hold out some data, train on the rest, and evaluate prediction performance.
3.7 Active Learning: Computerized Adaptive Testing
So far we have discussed passive learning: given a fixed dataset, estimate all parameters. But in many AI evaluation scenarios, we can choose which questions to ask. This is active learning, and Computerized Adaptive Testing (CAT) is its primary instantiation in psychometrics.
3.7.1 The CAT Framework
The key insight of CAT is that not all questions are equally informative for all test-takers. A very easy question provides little information about a high-ability model—we already know it will likely answer correctly. Similarly, a very hard question provides little information about a low-ability model.
The most informative questions are those where the model has roughly a 50% chance of success. CAT iteratively:
- Select the most informative question given current ability estimate
- Administer the question and observe the response
- Update the ability estimate based on the response
- Check if stopping criterion is met; if not, return to step 1
Fisher information measures how much a response to item \(j\) tells us about \(\theta\):
- High information: The item is well-matched to the ability level
- Low information: The item is too easy or too hard
Intuitively, asking a genius to solve \(1+1\) or a beginner to prove the Riemann hypothesis provides little information. The most informative items are those where the model has about 50% chance of success.
3.7.2 Fisher Information for Item Selection
The Fisher information for item \(j\) at ability \(\theta\) in the Rasch model is:
\[ I_j(\theta) = P_j(\theta) \cdot (1 - P_j(\theta)) \tag{3.10}\]
where \(P_j(\theta) = \sigma(\theta - \beta_j)\).
This is maximized when \(P_j(\theta) = 0.5\), which occurs when \(\theta = \beta_j\). Thus, the optimal item to administer is the one whose difficulty most closely matches the current ability estimate.
3.7.3 CAT Implementation
Let us implement a complete CAT procedure:
3.7.4 Stopping Rules
CAT can use various stopping criteria:
- Reliability threshold: Stop when measurement precision reaches a target (e.g., \(R \geq 0.95\))
- Standard error threshold: Stop when \(SE(\hat{\theta}) \leq 0.3\)
- Fixed length: Administer exactly \(K\) items
- Information threshold: Stop when additional items would provide negligible information
For AI evaluation, practical constraints also matter:
- Cost: Each API call has a cost
- Time: Evaluation must complete within a deadline
- Contamination: Administering too many items risks benchmark leakage
Traditional CAT assumes deterministic responses: a human test-taker gives the same answer if asked the same question twice. AI models may or may not be deterministic depending on temperature and sampling settings.
For deterministic evaluation (temperature=0), CAT works directly. For stochastic evaluation, we may need multiple samples per item, or methods that account for response variability.
CAT also requires pre-calibrated item parameters. In a cold-start scenario (new benchmark), we must first collect data on a pilot sample of models before CAT can be deployed.
3.8 Generalization Experiments
To evaluate the robustness and transferability of learned factor models, we train and test them under various masking schemes, each representing a different notion of generalization. These masks determine which parts of the response matrix \(Y\) are visible during training and which are held out for evaluation.
3.8.1 Masking Schemes for Evaluation
| Masking Type | Train Set | Test Set | Purpose |
|---|---|---|---|
| Entry-wise random | 80% random entries | 20% random entries | Interpolation under missing-at-random |
| Row holdout (random) | 80% of models, all items | 20% of models, all items | Generalization to unseen models |
| Row holdout (shifted) | Slice of models (small→large) | Disjoint slice | Covariate-shift generalization |
| Column holdout (random) | All models, 80% of items | All models, 20% of items | Generalization to unseen items |
| Column holdout (shifted) | Subset of benchmarks | Held-out benchmarks | Cross-domain transfer |
| Row-column block (L-mask) | \(R_{tr} \times C_{tr}\) | \(R_{te} \times C_{te}\) | Compositional generalization |
| Temporal split | Models before cutoff | Models after cutoff | Temporal generalization |
These settings parallel psychometric validation tests where new examinees, items, or contexts probe the invariance of latent constructs.
3.8.2 Implementation of Masking Functions
import torch
def random_mask(data_idtor, pct=0.8):
"""Entry-wise random masking."""
train_idtor = torch.bernoulli(data_idtor * pct).int()
test_idtor = data_idtor.int() - train_idtor
return train_idtor, test_idtor
def model_mask(data_idtor, pct_models=0.8, exposure_rate=0.3):
"""Row holdout: hold out unseen models."""
train_row_mask = torch.bernoulli(torch.ones(data_idtor.shape[0]) * pct_models).bool()
train_idtor = torch.zeros_like(data_idtor).int()
train_idtor[train_row_mask, :] = data_idtor[train_row_mask, :]
train_idtor[~train_row_mask, :], _ = random_mask(data_idtor[~train_row_mask, :], pct=exposure_rate)
test_idtor = data_idtor - train_idtor
return train_idtor, test_idtor
def item_mask(data_idtor, pct_items=0.8, exposure_rate=0.3):
"""Column holdout: hold out unseen items."""
train_col_mask = torch.bernoulli(torch.ones(data_idtor.shape[1]) * pct_items).bool()
train_idtor = torch.zeros_like(data_idtor).int()
train_idtor[:, train_col_mask] = data_idtor[:, train_col_mask]
train_idtor[:, ~train_col_mask], _ = random_mask(data_idtor[:, ~train_col_mask], pct=exposure_rate)
test_idtor = data_idtor - train_idtor
return train_idtor, test_idtor
def L_mask(data_idtor, pct_models=0.8, pct_items=0.8):
"""Row-column block (L-mask): compositional generalization."""
train_row_mask = torch.bernoulli(torch.ones(data_idtor.shape[0]) * pct_models).bool()
train_col_mask = torch.bernoulli(torch.ones(data_idtor.shape[1]) * pct_items).bool()
train_idtor = torch.zeros_like(data_idtor).int()
train_idtor[train_row_mask][:, train_col_mask] = data_idtor[train_row_mask][:, train_col_mask]
test_idtor = data_idtor - train_idtor
test_idtor[train_row_mask, :] = 0
test_idtor[:, train_col_mask] = 0
return train_idtor, test_idtor3.8.3 Two-Stage Training for Holdout Generalization
To avoid data contamination in row and column holdout experiments, we use a two-stage training procedure:
3.8.3.1 Row Holdout: Estimating Parameters for Unseen Models
When testing generalization to unseen models, we:
- Stage 1: Train on known models to learn item parameters \((V, Z)\)
- Stage 2: Freeze \((V, Z)\) and estimate ability parameters \(U\) for held-out models using their limited exposed responses
This ensures item parameters are learned without information from test models.
# Stage 1: Train on known models
test_row = test_idtor.max(axis=1).values # Identify held-out models
model_stage1 = train_model(Y[~test_row, :], mask=train_idtor[~test_row, :])
# Freeze V, Z from Stage 1
V_frozen = model_stage1.V.detach()
Z_frozen = model_stage1.Z.detach()
# Stage 2: Estimate U for unseen models with frozen item parameters
model_stage2 = train_model(Y[test_row, :], mask=train_idtor[test_row, :],
V_fixed=V_frozen, Z_fixed=Z_frozen)3.8.3.2 Column Holdout: Estimating Parameters for Unseen Items
When testing generalization to unseen items, we:
- Stage 1: Train on known items to learn model parameters \(U\)
- Stage 2: Freeze \(U\) and estimate item parameters \((V, Z)\) for held-out items
# Stage 1: Train on known items
test_col = test_idtor.max(axis=0).values # Identify held-out items
model_stage1 = train_model(Y[:, ~test_col], mask=train_idtor[:, ~test_col])
# Freeze U from Stage 1
U_frozen = model_stage1.U.detach()
# Stage 2: Estimate V, Z for unseen items with frozen model parameters
model_stage2 = train_model(Y[:, test_col], mask=train_idtor[:, test_col],
U_fixed=U_frozen)The two-stage procedure prevents information leakage:
- Row holdout: Item parameters learned from training models should not contain information about test models
- Column holdout: Model parameters learned from training items should not contain information about test items
This mirrors the real-world scenario where we want to evaluate new models on pre-calibrated items, or calibrate new items using established models.
3.8.4 Evaluation Across Masking Schemes
For each masking scheme, we compute AUC on the held-out entries:
from torchmetrics import AUROC
masking_schemes = {
"entry_random": random_mask,
"row_holdout": model_mask,
"col_holdout": item_mask,
"L_mask": L_mask,
}
results = {}
auroc = AUROC(task="binary")
for name, mask_fn in masking_schemes.items():
train_mask, test_mask = mask_fn(data_idtor)
# Train model (with two-stage for row/col holdout)
model = train_with_appropriate_stages(Y, train_mask, test_mask, name)
# Evaluate on held-out entries
P_hat = model().detach()
auc = auroc(P_hat[test_mask.bool()], Y[test_mask.bool()])
results[name] = auc.item()
print(f"{name}: AUC = {auc:.3f}")The factor model typically achieves AUC of 92-97% on random masking across benchmarks, demonstrating strong predictive power. Performance on row and column holdout tests the model’s ability to generalize to new models and new items, respectively.
3.9 Discussion Questions
Identifiability and Interpretation: In AI evaluation, should we anchor the ability scale by fixing one model (e.g., GPT-4 = 0) or by centering all models? What are the implications for interpreting ability scores over time as new models are released?
Bayesian vs Frequentist: When is Bayesian inference preferred over MLE for AI benchmark analysis? Consider scenarios with limited data, extreme scores, or the need for uncertainty quantification.
Adaptive Testing for AI: Current AI benchmarks test all models on all questions. What are the practical challenges in implementing CAT for AI evaluation? Consider: determinism of model responses, cost of API calls, benchmark contamination.
Transfer of Item Parameters: If we calibrate item difficulties on one set of models (e.g., 2023 models), can we use these parameters to evaluate 2024 models? What assumptions does this require, and when might they fail?
Multidimensional Extensions: The chapter focused on unidimensional models (single ability). How would the learning procedures change for multidimensional factor models? What additional challenges arise?
3.10 Bibliographic Notes
3.10.1 Maximum Likelihood Estimation
The theory of maximum likelihood for IRT models is developed comprehensively in Lord and Novick (1968) and (birnbaum1968some?). The joint MLE approach and its limitations (incidental parameter problem) are discussed in (neyman1948consistent?). For modern computational approaches, see (baker2004item?).
3.10.2 Conditional and Marginal MLE
Conditional MLE for the Rasch model was developed by (andersen1970asymptotic?), who proved consistency and derived the elementary symmetric functions needed for computation. Marginal MLE was introduced by (bock1970fitting?) and popularized by (bock1981marginal?) using the EM algorithm.
3.10.3 EM Algorithm
The general EM algorithm was formalized by (dempster1977maximum?). Its application to IRT is detailed in (bock1981marginal?). For modern treatments, see (mclachlan2007algorithm?).
3.10.4 Bayesian IRT
Bayesian approaches to IRT were pioneered by (swaminathan1982bayesian?) and advanced using Gibbs sampling by (albert1992bayesian?). Modern references include (fox2010bayesian?) and the software documentation for Stan (carpenter2017stan?).
3.10.5 Computerized Adaptive Testing
CAT has a rich history beginning with (lord1970some?). The Fisher information criterion for item selection was developed by (birnbaum1968some?). For multidimensional CAT, see (segall1996multidimensional?) and (van1999multidimensional?). Applications to AI evaluation are emerging; see (polo2024tinybenchmarks?) for recent work.
3.10.6 Optimization Methods
L-BFGS is described in (nocedal2006numerical?). For deep learning optimizers applied to psychometric models, see (kingma2015adam?) for Adam.
3.11 Exercises
3.11.1 Theoretical Exercises
Exercise 2.1 (\(\star\)): Derive the gradient of the Rasch model log-likelihood with respect to \(\theta_i\). Show that it equals the sum of residuals: \(\frac{\partial \ell}{\partial \theta_i} = \sum_j (Y_{ij} - P_{ij})\).
Exercise 2.2 (\(\star\star\)): Prove that the Hessian matrix of the Rasch log-likelihood is negative semi-definite, ensuring the log-likelihood is concave.
Exercise 2.3 (\(\star\star\)): Show that for the Rasch model, the Fisher information for item \(j\) at ability \(\theta\) is \(I_j(\theta) = P_j(1-P_j)\), and that this is maximized when \(\theta = \beta_j\).
Exercise 2.4 (\(\star\star\star\)): Derive the EM algorithm for the 2PL model. What additional complications arise compared to the Rasch model due to the discrimination parameters?
Exercise 2.5 (\(\star\star\)): Show that L2 regularization on the parameters is equivalent to MAP estimation with Gaussian priors. What is the relationship between the regularization strength \(\lambda\) and the prior variance \(\sigma^2\)?
3.11.2 Computational Exercises
Exercise 2.6 (\(\star\star\)): Implement conditional MLE for the Rasch model. Use the fact that the conditional likelihood depends only on item parameters and can be computed using elementary symmetric functions.
Exercise 2.7 (\(\star\star\star\)): Implement a Gibbs sampler for the Rasch model that alternates between: - Sampling \(\theta_i \mid Y, \beta\) for each person (using slice sampling) - Sampling \(\beta_j \mid Y, \theta\) for each item
Compare the posterior estimates to those from Metropolis-Hastings.
Exercise 2.8 (\(\star\star\star\)): Extend the CAT simulation to handle a multidimensional factor model with \(K=2\) dimensions. Implement D-optimal item selection using \(j^* = \arg\max_j \det(\sum_\tau I_j^{(\tau)})\).
3.11.3 Discussion Exercises
Exercise 2.9: Compare the convergence of gradient descent, L-BFGS, and Adam on a Rasch model estimation problem. Which converges fastest? Which is most robust to different initializations?
Exercise 2.10: Design a stopping rule for CAT that balances measurement precision with evaluation cost. How would you adapt this for AI evaluation where API calls have monetary costs?
Exercise 2.11: Investigate the sensitivity of CAT to misspecification of item parameters. If the calibration sample differs systematically from the test population, how does CAT performance degrade? Simulate this scenario and quantify the effect.