3  Learning

Intended Learning Outcomes

By the end of this chapter, you will be able to:

  1. Derive the log-likelihood function for the Rasch model and explain the role of person and item parameters.
  2. Implement maximum likelihood estimation (MLE) for IRT models using gradient descent and L-BFGS optimization.
  3. Explain the identifiability problem in IRT and describe standard solutions (sum-to-zero, fixed anchor).
  4. Distinguish between joint MLE, conditional MLE, and marginal MLE, and articulate when each is appropriate.
  5. Implement the Expectation-Maximization (EM) algorithm for Rasch model estimation and explain the E-step and M-step.
  6. Describe Bayesian inference for IRT models and specify appropriate priors for ability and item parameters.
  7. Implement MAP estimation and MCMC sampling for IRT models.
  8. Explain regularization in IRT as a Bayesian prior and apply cross-validation for hyperparameter selection.
  9. Apply MLE and Bayesian methods to real AI benchmark data and compare their efficiency.
  10. Evaluate the generalization of learned factor models under various masking schemes (entry-wise, row holdout, column holdout).

This chapter can be covered in 3 lectures (75-90 minutes each):

Lecture 1: Foundations of Estimation

  • Why learning matters for AI measurement (15 min)
  • Likelihood and log-likelihood for Rasch model (20 min)
  • Gradient derivation and interpretation (20 min)
  • Hands-on: MLE with synthetic data (20 min)

Lecture 2: Advanced Estimation Methods

  • Identifiability and conditional vs marginal MLE (20 min)
  • EM algorithm for IRT (30 min)
  • Hands-on: EM implementation (25 min)

Lecture 3: Bayesian Approaches and Generalization

  • Prior specification for IRT (15 min)
  • MAP estimation and MCMC (30 min)
  • Regularization and cross-validation (15 min)
  • Generalization experiments and masking schemes (15 min)
Notation

This chapter introduces estimation notation: \(\ell(\theta, \beta)\) (log-likelihood), \(\hat{\theta}_{\text{MLE}}\) and \(\hat{\theta}_{\text{MAP}}\) (point estimates), \(\pi(\theta)\) (priors), and \(\eta\) (learning rate). See ?sec-notation for the complete notation reference.

Video Overview

A visual tour of the key concepts in this chapter β€” from maximum likelihood estimation and the EM algorithm to Bayesian inference and generalization.

3.1 Why Learning Matters for AI Measurement

Chapter 1 introduced the measurement modelsβ€”Rasch, 2PL, factor modelsβ€”that describe how latent abilities generate observed responses. But knowing the form of a model is not enough. To actually use these models for AI evaluation, we must estimate their parameters from data.

The Central Learning Problem in AI Measurement

Given a response matrix \(Y \in \{0,1\}^{N \times M}\) where \(Y_{ij} = 1\) indicates model \(i\) answered question \(j\) correctly:

\[ \text{Find } \hat{\theta}, \hat{\beta} = \arg\max_{\theta, \beta} P(Y \mid \theta, \beta) \]

This optimization problem underlies all psychometric estimation and forms the foundation for trustworthy AI evaluation.

Parameter estimation serves several critical purposes in AI measurement:

  1. Fair comparison: Calibrated item difficulties allow us to compare models tested on different question subsets. If we know that question A is harder than question B, we can appropriately weight their contributions to the final score.

  2. Uncertainty quantification: Estimation procedures provide not just point estimates but standard errors, telling us how confident we should be in our measurements.

  3. Adaptive testing: Once we have calibrated item parameters, we can select the most informative questions for each model, dramatically reducing evaluation costs (see Chapter 3 for a full treatment of Computerized Adaptive Testing).

  4. Prediction: With learned parameters, we can predict how a model will perform on questions it has never seen, enabling efficient evaluation of new benchmarks.

This chapter focuses on passive learning: given a fixed dataset, estimate all parameters simultaneously. This includes maximum likelihood estimation (MLE), expectation-maximization (EM), and Bayesian inference. We also introduce generalization experiments that evaluate how well learned models transfer to unseen data. The design of which items to include in a benchmark, including active item selection, is the subject of Chapter 3.

3.2 Maximum Likelihood Estimation

Maximum likelihood estimation is the foundation of parameter estimation in IRT. The principle is simple: find the parameter values that make the observed data most probable.

3.2.1 The Likelihood Function

Recall from Chapter 1 that the Rasch model specifies the probability of a correct response as:

\[ P(Y_{ij} = 1 \mid \theta_i, \beta_j) = \sigma(\theta_i - \beta_j) = \frac{1}{1 + e^{-(\theta_i - \beta_j)}} \tag{3.1}\]

where \(\theta_i\) is the ability of model \(i\) and \(\beta_j\) is the difficulty of item \(j\).

Under the assumption of local independenceβ€”that responses are conditionally independent given the latent parametersβ€”the likelihood of the entire response matrix is:

\[ L(\theta, \beta \mid Y) = \prod_{i=1}^N \prod_{j=1}^M P(Y_{ij} \mid \theta_i, \beta_j)^{Y_{ij}} [1 - P(Y_{ij} \mid \theta_i, \beta_j)]^{1-Y_{ij}} \tag{3.2}\]

Taking the logarithm (for computational stability and mathematical convenience):

\[ \ell(\theta, \beta) = \sum_{i=1}^N \sum_{j=1}^M \left[ Y_{ij}(\theta_i - \beta_j) - \log(1 + e^{\theta_i - \beta_j}) \right] \tag{3.3}\]

This is the objective function we want to maximize.

3.2.2 Gradient Derivation

To optimize the log-likelihood, we need its gradients. Taking partial derivatives:

\[ \frac{\partial \ell}{\partial \theta_i} = \sum_{j=1}^M \left[ Y_{ij} - \sigma(\theta_i - \beta_j) \right] \tag{3.4}\]

\[ \frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^N \left[ \sigma(\theta_i - \beta_j) - Y_{ij} \right] \tag{3.5}\]

Intuitive Interpretation of the Gradient

The gradient \(\frac{\partial \ell}{\partial \theta_i} = \sum_j [Y_{ij} - P_{ij}]\) has a beautiful interpretation:

  • \(Y_{ij}\) is the observed response (0 or 1)
  • \(P_{ij} = \sigma(\theta_i - \beta_j)\) is the predicted probability

The gradient is simply the sum of residuals: observed minus predicted. If model \(i\) performs better than expected (more correct answers than predicted), the residuals are positive, and we increase \(\theta_i\). If it performs worse than expected, we decrease \(\theta_i\). This is the essence of gradient ascent.

3.2.3 Implementation with Gradient Descent

Let us implement MLE via gradient descent on synthetic data. First, we generate a response matrix from known parameters:

Now we implement MLE via gradient descent:

3.2.4 The Identifiability Problem

The Identifiability Problem

The Rasch model has a fundamental identifiability issue: if we add a constant \(c\) to all abilities and all difficulties, the likelihood is unchanged:

\[ P(Y_{ij}=1 \mid \theta_i + c, \beta_j + c) = \sigma((\theta_i + c) - (\beta_j + c)) = \sigma(\theta_i - \beta_j) \]

The parameters are only identified up to an additive constant. This means infinitely many parameter values produce the same likelihood.

Common Solutions:

  1. Sum-to-zero constraint: Set \(\sum_i \theta_i = 0\) or \(\sum_j \beta_j = 0\)
  2. Fixed anchor: Set one parameter (e.g., \(\beta_1 = 0\)) as reference
  3. Prior constraint: Use Bayesian priors centered at zero

For AI benchmarks, we typically use sum-to-zero: a model with \(\theta = 0\) has β€œaverage” ability relative to the calibration sample.

Without addressing identifiability, gradient descent may drift indefinitely. The re-centering step in our implementation ensures parameters remain anchored.

3.2.5 L-BFGS Optimization

While gradient descent is intuitive, quasi-Newton methods like L-BFGS converge much faster by approximating second-order information:

3.3 Joint, Conditional, and Marginal MLE

The MLE approach we have discussed so far is called joint maximum likelihood estimation (JMLE). It treats both person parameters \(\theta\) and item parameters \(\beta\) as fixed unknowns to be estimated. However, JMLE has theoretical limitations that motivate alternative approaches.

3.3.1 Joint MLE (JMLE)

JMLE simultaneously estimates all parameters by maximizing Equation 3.3. While intuitive, JMLE suffers from the incidental parameter problem: as the number of items \(M\) remains fixed and the number of persons \(N\) grows, the item parameter estimates \(\hat{\beta}\) are inconsistentβ€”they do not converge to the true values.

This happens because each person parameter \(\theta_i\) is estimated from only \(M\) observations (their responses to \(M\) items), and these β€œincidental” person parameters introduce bias into the item estimates.

For AI benchmarks with many items (typically \(M > 100\)), this bias is small in practice. But for smaller tests, JMLE can be problematic.

3.3.2 Conditional MLE (CMLE)

Georg Rasch discovered an elegant solution to the incidental parameter problem. For the Rasch model specifically, the sum score \(S_i = \sum_j Y_{ij}\) is a sufficient statistic for \(\theta_i\). This means all information about \(\theta_i\) in the data \(Y_i\) is captured by \(S_i\).

By conditioning on the sufficient statistics, we can eliminate the person parameters entirely:

\[ P(Y_i \mid S_i, \beta) = \frac{\exp(-\sum_j Y_{ij}\beta_j)}{\gamma_{S_i}(\beta)} \tag{3.6}\]

where \(\gamma_r(\beta) = \sum_{A: |A|=r} \exp(-\sum_{j \in A} \beta_j)\) is the elementary symmetric function of order \(r\), summing over all subsets \(A\) of items of size \(r\).

The conditional likelihood depends only on \(\beta\), so we can estimate item parameters without any person parameters. This produces consistent estimates of \(\beta\) regardless of how \(N\) grows.

Rasch’s Insight

The sufficiency of sum scores is unique to the Rasch model. For the 2PL or 3PL models, sum scores are not sufficient, and CMLE cannot be applied. This mathematical property is one reason the Rasch model holds special status in measurement theory.

3.3.3 Marginal MLE (MMLE)

An alternative approach is to treat person parameters as random variables from a population distribution:

\[ \theta_i \sim \mathcal{N}(\mu_\theta, \sigma^2_\theta) \]

The marginal likelihood integrates out the person parameters:

\[ L(\beta) = \prod_{i=1}^N \int P(Y_i \mid \theta, \beta) \, p(\theta) \, d\theta \tag{3.7}\]

This approach:

  • Treats item parameters as fixed and person parameters as random
  • Produces consistent estimates of \(\beta\) as \(N \to \infty\)
  • Naturally extends to more complex IRT models (2PL, 3PL)
  • Forms the basis for the EM algorithm (next section)
Comparison of MLE Approaches
Method Person Parameters Item Parameters Consistency Applicability
JMLE Estimated directly Estimated directly Inconsistent for fixed M Any IRT model
CMLE Conditioned out Consistent Consistent Rasch only
MMLE Integrated out Consistent Consistent Any IRT model

For AI benchmarks with many questions (\(M > 100\)), JMLE works well in practice. For smaller tests or when statistical properties are important, CMLE or MMLE is preferred.

3.4 The EM Algorithm

The Expectation-Maximization (EM) algorithm is a general method for maximum likelihood estimation with latent variables. In IRT, the latent variables are the person abilities \(\theta\).

3.4.1 The EM Framework

The EM algorithm iterates between two steps:

E-step (Expectation): Compute the expected value of the complete-data log-likelihood, given the observed data and current parameter estimates:

\[ Q(\beta \mid \beta^{(t)}) = \mathbb{E}_{\theta \mid Y, \beta^{(t)}} \left[ \log P(Y, \theta \mid \beta) \right] \]

M-step (Maximization): Find the parameter values that maximize the expected log-likelihood:

\[ \beta^{(t+1)} = \arg\max_\beta Q(\beta \mid \beta^{(t)}) \]

The EM algorithm guarantees that the marginal likelihood increases (or stays the same) at each iteration, converging to a local maximum.

3.4.2 EM for the Rasch Model

For the Rasch model with a standard normal prior on abilities, the EM algorithm takes a specific form:

E-step: For each person \(i\), compute the posterior distribution of \(\theta_i\) given their responses \(Y_i\) and current item parameters \(\beta^{(t)}\):

\[ p(\theta_i \mid Y_i, \beta^{(t)}) \propto p(Y_i \mid \theta_i, \beta^{(t)}) \cdot p(\theta_i) \]

This posterior is not available in closed form, so we use numerical integration (Gauss-Hermite quadrature).

M-step: Update each item parameter by solving:

\[ \sum_{i=1}^N \mathbb{E}_{\theta_i}[\sigma(\theta_i - \beta_j)] = \sum_{i=1}^N Y_{ij} \]

The left side is the expected number of correct responses to item \(j\); the right side is the observed number. We equate these.

3.4.3 Multidimensional Extension: The Logistic Factor Model

The methods above focused on the Rasch model, which assumes a single latent dimension. For AI benchmarks that measure multiple capabilities, we extend to the Logistic Factor Model:

\[ P(Y_{ij} = 1 \mid U_i, V_j, Z_j) = \sigma(U_i^\top V_j + Z_j) \]

where:

  • \(U_i \in \mathbb{R}^K\) is the \(K\)-dimensional latent ability vector for model \(i\)
  • \(V_j \in \mathbb{R}^K\) is the factor loading vector for item \(j\)
  • \(Z_j \in \mathbb{R}\) is the item intercept (capturing overall difficulty)

When \(K=1\) and \(V_j = 1\) for all \(j\), this reduces to the Rasch model.

3.4.3.1 Implementation

import torch
import torch.nn as nn
from torch.optim import LBFGS
import torch.nn.functional as F

class LogisticFM(nn.Module):
    """Logistic Factor Model for binary response data."""
    def __init__(self, N, M, K):
        super().__init__()
        self.U = nn.Parameter(torch.randn(N, K))  # Model abilities
        self.V = nn.Parameter(torch.randn(M, K))  # Item loadings
        self.Z = nn.Parameter(torch.randn(M, 1))  # Item intercepts

    def forward(self):
        return torch.sigmoid(self.U @ self.V.T + self.Z.T)
Interpretation
  • \(U_i\): latent ability vector of model \(i\) (position in \(K\)-dimensional capability space)
  • \(V_j\): latent property vector of item \(j\) (which capabilities the item measures)
  • \(Z_j\): overall item difficulty (independent of capability dimensions)
  • \(\sigma\): sigmoid function ensuring probabilities in \([0,1]\)

3.4.3.2 Training with LBFGS

We train the model by minimizing binary cross-entropy loss:

# Training setup
N, M = Y.shape
K = 2  # Number of latent dimensions
model = LogisticFM(N, M, K)

opt = LBFGS(
    model.parameters(),
    lr=0.1,
    max_iter=20,
    history_size=10,
    line_search_fn="strong_wolfe"
)

def closure():
    opt.zero_grad()
    probs = model()
    loss = F.binary_cross_entropy(probs[train_mask], Y[train_mask].float())
    loss.backward()
    return loss

# Training loop
for iteration in range(20):
    loss = opt.step(closure)

The model learns to decompose the response matrix into latent factors that capture the underlying structure of model capabilities and item characteristics.

3.5 Bayesian Inference

Bayesian inference provides an alternative to maximum likelihood that naturally incorporates prior information and quantifies uncertainty. Instead of finding a single point estimate, we characterize the entire posterior distribution over parameters.

3.5.1 Prior Specification

The first step in Bayesian inference is specifying prior distributions that encode our beliefs before seeing the data:

Standard Priors for IRT

For abilities (persons/models): \[\theta_i \sim \mathcal{N}(0, \sigma^2_\theta), \quad \sigma_\theta = 1 \text{ (standard choice)}\]

For difficulties (items/questions): \[\beta_j \sim \mathcal{N}(0, \sigma^2_\beta), \quad \sigma_\beta = 1\text{-}2 \text{ (depending on expected range)}\]

For discrimination (2PL model): \[a_j \sim \text{LogNormal}(0, 0.5) \text{ or } a_j \sim \text{Gamma}(2, 0.5)\]

These priors are weakly informative: they regularize estimates without dominating the data. They encode the belief that most abilities and difficulties are within a few units of zero, which is appropriate when the scale is defined by convention.

3.5.2 Posterior Computation

Bayes’ theorem gives us the posterior distribution:

\[ p(\theta, \beta \mid Y) \propto p(Y \mid \theta, \beta) \cdot p(\theta) \cdot p(\beta) \tag{3.8}\]

The posterior combines the likelihood (data) with the priors (beliefs). Unfortunately, this posterior is not available in closed formβ€”we need computational methods.

3.5.3 MAP Estimation

The simplest Bayesian approach is maximum a posteriori (MAP) estimation, which finds the mode of the posterior:

\[ \hat{\theta}_{\text{MAP}}, \hat{\beta}_{\text{MAP}} = \arg\max_{\theta, \beta} \left[ \ell(\theta, \beta \mid Y) + \log p(\theta) + \log p(\beta) \right] \tag{3.9}\]

With Gaussian priors, this is equivalent to L2-regularized MLE:

\[ \hat{\theta}_{\text{MAP}}, \hat{\beta}_{\text{MAP}} = \arg\max_{\theta, \beta} \left[ \ell(\theta, \beta) - \frac{1}{2\sigma^2_\theta}\sum_i \theta_i^2 - \frac{1}{2\sigma^2_\beta}\sum_j \beta_j^2 \right] \]

Bayesian Shrinkage

Notice that MAP estimates have smaller variance than MLE estimates. This is shrinkage toward the prior mean (zero).

For extreme scoresβ€”models that answer all questions correctly or incorrectlyβ€”MLE gives infinite or very large estimates. MAP regularizes these to finite, sensible values. This is crucial for AI benchmarks where some models may achieve near-perfect scores on easy subsets.

The amount of shrinkage is controlled by the prior variance: smaller \(\sigma^2\) means stronger shrinkage toward zero.

3.5.4 MCMC Sampling

To characterize the full posterior distribution (not just its mode), we use Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings algorithm is a simple but effective approach:

The posterior standard deviations quantify our uncertainty about each parameter. Parameters with more information (e.g., items answered by many models, models who answered many questions) have smaller posterior uncertainty.

3.6 Regularization and Model Selection

3.6.1 L2 Regularization as Bayesian Prior

We have seen that MAP estimation with Gaussian priors is equivalent to L2 regularization. The regularization strength \(\lambda\) relates to the prior variance as \(\lambda = 1/\sigma^2\).

The regularized objective is:

\[ \ell_{\text{reg}}(\theta, \beta) = \ell(\theta, \beta) - \frac{\lambda_\theta}{2}\|\theta\|^2 - \frac{\lambda_\beta}{2}\|\beta\|^2 \]

Regularization prevents overfitting, especially when:

  • Some persons have few responses (sparse data)
  • Some items have extreme difficulty (near 0% or 100% pass rates)
  • The model is complex (many parameters relative to data)

3.6.2 Cross-Validation for Hyperparameter Selection

How do we choose the regularization strength? Cross-validation provides a principled answer: we hold out some data, train on the rest, and evaluate prediction performance.

3.7 Generalization Experiments

The estimation methods developed in this chapter produce learned factor models with latent parameters \((U, V, Z)\). But how well do these models generalize? To evaluate the robustness and transferability of learned factor models, we train and test them under various masking schemes, each representing a different notion of generalization. These masks determine which parts of the response matrix \(Y\) are visible during training and which are held out for evaluation.

3.7.1 Masking Schemes for Evaluation

Masking Type Train Set Test Set Purpose
Entry-wise random 80% random entries 20% random entries Interpolation under missing-at-random
Row holdout (random) 80% of models, all items 20% of models, all items Generalization to unseen models
Row holdout (shifted) Slice of models (small→large) Disjoint slice Covariate-shift generalization
Column holdout (random) All models, 80% of items All models, 20% of items Generalization to unseen items
Column holdout (shifted) Subset of benchmarks Held-out benchmarks Cross-domain transfer
Row-column block (L-mask) \(R_{tr} \times C_{tr}\) \(R_{te} \times C_{te}\) Compositional generalization
Temporal split Models before cutoff Models after cutoff Temporal generalization

These settings parallel psychometric validation tests where new examinees, items, or contexts probe the invariance of latent constructs.

3.7.2 Implementation of Masking Functions

import torch

def random_mask(data_idtor, pct=0.8):
    """Entry-wise random masking."""
    train_idtor = torch.bernoulli(data_idtor * pct).int()
    test_idtor = data_idtor.int() - train_idtor
    return train_idtor, test_idtor

def model_mask(data_idtor, pct_models=0.8, exposure_rate=0.3):
    """Row holdout: hold out unseen models."""
    train_row_mask = torch.bernoulli(torch.ones(data_idtor.shape[0]) * pct_models).bool()
    train_idtor = torch.zeros_like(data_idtor).int()
    train_idtor[train_row_mask, :] = data_idtor[train_row_mask, :]
    train_idtor[~train_row_mask, :], _ = random_mask(data_idtor[~train_row_mask, :], pct=exposure_rate)
    test_idtor = data_idtor - train_idtor
    return train_idtor, test_idtor

def item_mask(data_idtor, pct_items=0.8, exposure_rate=0.3):
    """Column holdout: hold out unseen items."""
    train_col_mask = torch.bernoulli(torch.ones(data_idtor.shape[1]) * pct_items).bool()
    train_idtor = torch.zeros_like(data_idtor).int()
    train_idtor[:, train_col_mask] = data_idtor[:, train_col_mask]
    train_idtor[:, ~train_col_mask], _ = random_mask(data_idtor[:, ~train_col_mask], pct=exposure_rate)
    test_idtor = data_idtor - train_idtor
    return train_idtor, test_idtor

def L_mask(data_idtor, pct_models=0.8, pct_items=0.8):
    """Row-column block (L-mask): compositional generalization."""
    train_row_mask = torch.bernoulli(torch.ones(data_idtor.shape[0]) * pct_models).bool()
    train_col_mask = torch.bernoulli(torch.ones(data_idtor.shape[1]) * pct_items).bool()
    train_idtor = torch.zeros_like(data_idtor).int()
    train_idtor[train_row_mask][:, train_col_mask] = data_idtor[train_row_mask][:, train_col_mask]
    test_idtor = data_idtor - train_idtor
    test_idtor[train_row_mask, :] = 0
    test_idtor[:, train_col_mask] = 0
    return train_idtor, test_idtor

3.7.3 Two-Stage Training for Holdout Generalization

To avoid data contamination in row and column holdout experiments, we use a two-stage training procedure:

3.7.3.1 Row Holdout: Estimating Parameters for Unseen Models

When testing generalization to unseen models, we:

  1. Stage 1: Train on known models to learn item parameters \((V, Z)\)
  2. Stage 2: Freeze \((V, Z)\) and estimate ability parameters \(U\) for held-out models using their limited exposed responses

This ensures item parameters are learned without information from test models.

# Stage 1: Train on known models
test_row = test_idtor.max(axis=1).values  # Identify held-out models
model_stage1 = train_model(Y[~test_row, :], mask=train_idtor[~test_row, :])

# Freeze V, Z from Stage 1
V_frozen = model_stage1.V.detach()
Z_frozen = model_stage1.Z.detach()

# Stage 2: Estimate U for unseen models with frozen item parameters
model_stage2 = train_model(Y[test_row, :], mask=train_idtor[test_row, :],
                           V_fixed=V_frozen, Z_fixed=Z_frozen)

3.7.3.2 Column Holdout: Estimating Parameters for Unseen Items

When testing generalization to unseen items, we:

  1. Stage 1: Train on known items to learn model parameters \(U\)
  2. Stage 2: Freeze \(U\) and estimate item parameters \((V, Z)\) for held-out items
# Stage 1: Train on known items
test_col = test_idtor.max(axis=0).values  # Identify held-out items
model_stage1 = train_model(Y[:, ~test_col], mask=train_idtor[:, ~test_col])

# Freeze U from Stage 1
U_frozen = model_stage1.U.detach()

# Stage 2: Estimate V, Z for unseen items with frozen model parameters
model_stage2 = train_model(Y[:, test_col], mask=train_idtor[:, test_col],
                           U_fixed=U_frozen)
Why Two-Stage Training?

The two-stage procedure prevents information leakage:

  • Row holdout: Item parameters learned from training models should not contain information about test models
  • Column holdout: Model parameters learned from training items should not contain information about test items

This mirrors the real-world scenario where we want to evaluate new models on pre-calibrated items, or calibrate new items using established models.

3.7.4 Evaluation Across Masking Schemes

For each masking scheme, we compute AUC on the held-out entries:

from torchmetrics import AUROC

masking_schemes = {
    "entry_random": random_mask,
    "row_holdout": model_mask,
    "col_holdout": item_mask,
    "L_mask": L_mask,
}

results = {}
auroc = AUROC(task="binary")

for name, mask_fn in masking_schemes.items():
    train_mask, test_mask = mask_fn(data_idtor)

    # Train model (with two-stage for row/col holdout)
    model = train_with_appropriate_stages(Y, train_mask, test_mask, name)

    # Evaluate on held-out entries
    P_hat = model().detach()
    auc = auroc(P_hat[test_mask.bool()], Y[test_mask.bool()])
    results[name] = auc.item()
    print(f"{name}: AUC = {auc:.3f}")

The factor model typically achieves AUC of 92-97% on random masking across benchmarks, demonstrating strong predictive power. Performance on row and column holdout tests the model’s ability to generalize to new models and new items, respectively.

Application: Item Response Scaling Laws

The separability of model ability from item difficulty β€” the core property of IRT β€” has a powerful application to scaling laws. Truong et al. (2025) show that by embedding IRT within the scaling law framework, one can factorize scaling law estimation from \(O(M \times N)\) to \(O(M + N)\), where \(M\) is the number of models (or checkpoints) and \(N\) is the number of questions.

Their key finding is that the IRT ability parameter \(\theta\) scales linearly with the logarithm of pre-training compute: \(\theta \approx a \cdot \log(\text{FLOP}) + b\). Combined with calibrated item parameters, this yields per-question scaling predictions: \(\hat{R}_{ij}(x) = \sigma(d_j(\theta_i(x) - z_j))\). Because item parameters transfer across benchmarks that share the same measurement objective, ability estimated on one benchmark can predict performance on another β€” a direct validation of the cross-benchmark transfer tested in the masking experiments above.

In a study of 6,612 model checkpoints and 37,682 questions, this approach achieves comparable or superior decision accuracy to traditional scaling laws using only 50 questions per benchmark β€” a 99.9% reduction in queries. The approach uses Beta-IRT, which models empirical probability responses (token probabilities, pass rates) rather than binary correctness, capturing richer scaling signals.

3.8 Discussion Questions

  1. Identifiability and Interpretation: In AI evaluation, should we anchor the ability scale by fixing one model (e.g., GPT-4 = 0) or by centering all models? What are the implications for interpreting ability scores over time as new models are released?

  2. Bayesian vs Frequentist: When is Bayesian inference preferred over MLE for AI benchmark analysis? Consider scenarios with limited data, extreme scores, or the need for uncertainty quantification.

  3. From Learning to Design: This chapter focuses on estimating parameters from a fixed dataset. Chapter 3 considers how to design the evaluationβ€”choosing which items to include, how to allocate testing effort, and how to ensure robustness. What aspects of the estimation methods learned here would inform good benchmark design?

  4. Transfer of Item Parameters: If we calibrate item difficulties on one set of models (e.g., 2023 models), can we use these parameters to evaluate 2024 models? What assumptions does this require, and when might they fail?

  5. Multidimensional Extensions: The chapter focused on unidimensional models (single ability). How would the learning procedures change for multidimensional factor models? What additional challenges arise?

3.9 Bibliographic Notes

3.9.1 Maximum Likelihood Estimation

The theory of maximum likelihood for IRT models is developed comprehensively in Lord and Novick (1968) and (birnbaum1968some?). The joint MLE approach and its limitations (incidental parameter problem) are discussed in (neyman1948consistent?). For modern computational approaches, see (baker2004item?).

3.9.2 Conditional and Marginal MLE

Conditional MLE for the Rasch model was developed by (andersen1970asymptotic?), who proved consistency and derived the elementary symmetric functions needed for computation. Marginal MLE was introduced by (bock1970fitting?) and popularized by (bock1981marginal?) using the EM algorithm.

3.9.3 EM Algorithm

The general EM algorithm was formalized by (dempster1977maximum?). Its application to IRT is detailed in (bock1981marginal?). For modern treatments, see (mclachlan2007algorithm?).

3.9.4 Bayesian IRT

Bayesian approaches to IRT were pioneered by (swaminathan1982bayesian?) and advanced using Gibbs sampling by (albert1992bayesian?). Modern references include (fox2010bayesian?) and the software documentation for Stan (carpenter2017stan?).

3.9.5 Beta-IRT and Continuous Responses

Standard IRT models assume binary (correct/incorrect) responses, but language models provide richer signals: token probabilities in pre-training and empirical pass rates in repeated sampling. Truong et al. (2025) introduce Beta-IRT, which replaces the Bernoulli loss with a Beta loss parameterized by the IRT logistic mean \(\sigma(d_j(\theta_i - z_j))\) and a precision parameter \(\phi\). This achieves reliable calibration with as few as 2 test takers (30–60\(\times\) fewer than Binary-IRT), making it practical for scaling law studies where the number of model checkpoints far exceeds what binary IRT requires. The connection between per-problem exponential scaling and aggregate power-law scaling is formalized by Schaeffer et al. (2025), who show the power-law exponent is controlled by the left-tail shape of the success probability distribution β€” precisely the item difficulty distribution in IRT terms.

3.9.6 Optimization Methods

L-BFGS is described in (nocedal2006numerical?). For deep learning optimizers applied to psychometric models, see (kingma2015adam?) for Adam.

3.10 Exercises

3.10.1 Theoretical Exercises

Exercise 2.1 (\(\star\)): Derive the gradient of the Rasch model log-likelihood with respect to \(\theta_i\). Show that it equals the sum of residuals: \(\frac{\partial \ell}{\partial \theta_i} = \sum_j (Y_{ij} - P_{ij})\).

Exercise 2.2 (\(\star\star\)): Prove that the Hessian matrix of the Rasch log-likelihood is negative semi-definite, ensuring the log-likelihood is concave.

Exercise 2.3 (\(\star\star\)): Show that for the Rasch model, the Fisher information for item \(j\) at ability \(\theta\) is \(I_j(\theta) = P_j(1-P_j)\), and that this is maximized when \(\theta = \beta_j\).

Exercise 2.4 (\(\star\star\star\)): Derive the EM algorithm for the 2PL model. What additional complications arise compared to the Rasch model due to the discrimination parameters?

Exercise 2.5 (\(\star\star\)): Show that L2 regularization on the parameters is equivalent to MAP estimation with Gaussian priors. What is the relationship between the regularization strength \(\lambda\) and the prior variance \(\sigma^2\)?

3.10.2 Computational Exercises

Exercise 2.6 (\(\star\star\)): Implement conditional MLE for the Rasch model. Use the fact that the conditional likelihood depends only on item parameters and can be computed using elementary symmetric functions.

Exercise 2.7 (\(\star\star\star\)): Implement a Gibbs sampler for the Rasch model that alternates between: - Sampling \(\theta_i \mid Y, \beta\) for each person (using slice sampling) - Sampling \(\beta_j \mid Y, \theta\) for each item

Compare the posterior estimates to those from Metropolis-Hastings.

Exercise 2.8 (\(\star\star\star\)): Implement marginal MLE using numerical quadrature to integrate out the ability parameters. Compare convergence and parameter recovery to the EM approach from this chapter.

Exercise 2.9 (\(\star\star\)): Implement Beta-IRT estimation. Given a response matrix \(P_{ij} \in [0, 1]\) of empirical probabilities (e.g., token probabilities or pass rates), maximize the Beta log-likelihood \(\sum_{i,j} \log p(P_{ij}; \mu_{ij}, \phi)\) where \(\mu_{ij} = \sigma(d_j(\theta_i - z_j))\) is the IRT-predicted mean and \(\phi > 0\) is a precision parameter. (a) Derive the gradient with respect to \(\theta_i\) and \(z_j\). (b) Simulate data with \(M = 50\) models and \(N = 200\) items, where \(P_{ij} \sim \text{Beta}(\mu_{ij} \phi, (1 - \mu_{ij})\phi)\) with \(\phi = 10\). (c) Compare parameter recovery of Beta-IRT vs. binary IRT (where responses are thresholded at 0.5) as a function of \(M\). Verify the finding of Truong et al. (2025) that Beta-IRT achieves reliable calibration with far fewer models than Binary-IRT.

3.10.3 Discussion Exercises

Exercise 2.9: Compare the convergence of gradient descent, L-BFGS, and Adam on a Rasch model estimation problem. Which converges fastest? Which is most robust to different initializations?

Exercise 2.10: Investigate the sensitivity of MLE and Bayesian estimation to model misspecification. Generate data from a 2PL model but fit a Rasch model. How do the estimated abilities compare? When does the misspecification matter most?

Exercise 2.11: Implement a cross-validation procedure for selecting between the Rasch, 2PL, and factor models. Apply it to benchmark data with different numbers of items and models. When does the additional complexity of the 2PL or factor model improve out-of-sample prediction?

References

Lord, Frederic M., and Melvin R. Novick. 1968. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Schaeffer, Rylan, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. 2025. β€œHow Do Large Language Monkeys Get Their Power (Laws)?” arXiv Preprint arXiv:2502.17578.
Truong, Sang, Yuheng Tu, Rylan Schaeffer, and Sanmi Koyejo. 2025. β€œItem Response Scaling Laws.” arXiv Preprint.