9  Red-Teaming and Adversarial Evaluation

Intended Learning Outcomes

By the end of this chapter, you will be able to:

  1. Frame red-teaming as an adversarial measurement problem, connecting structured attack protocols to item response theory and content validity.
  2. Analyze when attack success rate (ASR) comparisons are meaningful, distinguishing conceptual coherence (comparable estimands) from measurement validity (accurate operationalization).
  3. Distinguish between structured and unstructured red-teaming and analyze the coverage-depth tradeoff in adversarial evaluation.
  4. Formalize adversarial robustness as a latent trait and apply adaptive adversarial testing using the CAT framework from Section 4.2.2.
  5. Evaluate when synthetic data preserves construct validity for AI evaluation and identify the conditions under which calibration transfers from synthetic to real items.
  6. Apply prediction-powered inference to debias synthetic evaluations by combining large-scale synthetic judgments with small human evaluation sets.
  7. Design an end-to-end adversarial evaluation pipeline that composes red-team item banks, adaptive testing, synthetic augmentation, and statistical correction.

This chapter can be covered in 2 lectures (75–90 minutes each):

Lecture 1: Red-Teaming as Measurement

  • Red-teaming as measurement: coverage, depth, and item banks (20 min)
  • When can ASRs be compared? Conceptual coherence and aggregation (20 min)
  • Judge validity: differential misclassification and DIF (20 min)
  • Hands-on: simulating aggregation bias and judge error (15 min)

Lecture 2: Adversarial Robustness Evaluation

  • Adversarial robustness as a latent trait (20 min)
  • Adaptive adversarial testing and adversarial IRT (20 min)
  • Hands-on: simulating two-dimensional adversarial IRT (10 min)

Lecture 3: Synthetic Data and Evaluation at Scale

  • Synthetic data for evaluation: validity threats and calibration (25 min)
  • Hybrid evaluation and prediction-powered inference (25 min)
  • The adversarial evaluation pipeline (15 min)
  • Discussion and exercises (10 min)
Notation

This chapter introduces adversarial evaluation notation: \(\theta_j^{(\text{adv})}/\theta_j^{(\text{std})}\) (adversarial/standard ability), \(\alpha_{s,\mathcal{D}}\) (attack success probability), \(J/s\) (operational judge / oracle criterion), \(K\) (repeated samples), and \(\hat{\mu}_{\text{PPI}}\) (prediction-powered inference estimator). See ?sec-notation for the complete notation reference.

9.1 Red-Teaming as Measurement

Red-teaming—the practice of probing AI systems for failures—has become a standard component of AI evaluation. But most red-teaming efforts are conducted ad hoc: a team of human testers tries creative prompts, records failures, and writes a report. From the perspective of measurement science, this is analogous to evaluating a student by asking whatever questions come to mind, with no test specification, no item calibration, and no systematic coverage of the construct domain.

This section reframes red-teaming as a measurement problem. The key insight is that a red-team evaluation is an instrument—a collection of adversarial items—and the same psychometric principles that apply to any evaluation instrument (content validity, item calibration, reliability) apply here.

9.1.1 The Coverage-Depth Tradeoff

Consider two red-teaming strategies. Strategy A deploys 50 red-teamers, each spending one hour testing a wide range of attack categories: prompt injection, jailbreaking, harmful content generation, privacy violations, bias elicitation, and so on. Strategy B deploys 5 expert red-teamers, each spending ten hours on a single attack category, developing sophisticated multi-turn attack chains.

Strategy A has breadth: it covers many regions of the attack surface. Strategy B has depth: it explores individual attack categories more thoroughly. Neither dominates. The tradeoff maps directly onto measurement concepts from Section 6.3.1:

  • Content validity requires that the adversarial items represent the full domain of potential attacks. Strategy A is better for content validity.
  • Item difficulty determines whether the evaluation can distinguish models at different robustness levels. Strategy B produces harder items that discriminate among highly robust models.
Definition: Red-Team Test Specification

A red-team test specification is a document that defines:

  1. Attack taxonomy: The categories of adversarial behavior to be tested (e.g., jailbreaking, prompt injection, harmful content, bias, privacy leakage).
  2. Coverage requirements: The minimum number of items per category, ensuring content validity.
  3. Difficulty targets: The distribution of item difficulties within each category, ensuring discrimination across the robustness spectrum.
  4. Scoring rubric: The criteria for judging whether a model response constitutes a failure.

Without a test specification, red-team results have unknown content validity—we cannot know what fraction of the attack surface was covered or whether the items were difficult enough to challenge robust models.

9.1.2 Adversarial Items as Hard Items in IRT

In IRT terminology, an adversarial item is simply an item with parameters that we can interpret through the standard framework. Consider a 2PL model applied to adversarial items:

\[ P(\text{resist}_j \mid \beta_i^{(\text{atk})}, \alpha_i, \theta_j^{(\text{adv})}) = \frac{1}{1 + \exp\bigl(-\alpha_i(\theta_j^{(\text{adv})} - \beta_i^{(\text{atk})})\bigr)} \tag{9.1}\]

Here \(\theta_j^{(\text{adv})}\) is model \(j\)’s adversarial robustness and \(\beta_i^{(\text{atk})}\) is the attack strength (difficulty) of adversarial item \(i\). A model “responds correctly” by resisting the attack. Under this framing:

  • Easy adversarial items (\(\beta_i^{(\text{atk})} \ll 0\)) are weak attacks that most models resist. These provide little information about robustness differences among frontier models.
  • Hard adversarial items (\(\beta_i^{(\text{atk})} \gg 0\)) are sophisticated attacks that only the most robust models withstand.
  • High-discrimination items (\(\alpha_i \gg 1\)) sharply separate robust from non-robust models. These are the most informative items for ranking.

The connection to content validity (Section 6.3.1) is immediate: a red-team evaluation with only easy items has construct underrepresentation for the high-robustness region, just as a math test with only arithmetic problems underrepresents mathematical ability. Conversely, a red-team evaluation with only expert-crafted attacks may have no items in the easy-to-moderate range, making it unable to distinguish among weaker models.

9.1.3 Structured vs. Unstructured Red-Teaming

The red-teaming literature distinguishes two paradigms:

Unstructured red-teaming gives human testers broad instructions (“try to make the model do something bad”) and relies on their creativity. This approach can discover unexpected failure modes but produces items with unknown psychometric properties—we do not know the difficulty, discrimination, or category coverage until after the fact. From a measurement perspective, unstructured red-teaming is useful for exploratory item generation but insufficient for standardized measurement.

Structured red-teaming uses a test specification to guide item development. Testers are assigned categories, given difficulty targets, and use standardized scoring rubrics. Ganguli et al. (2022) demonstrated the value of scaling structured red-teaming, showing that the distribution of discovered failures changes systematically as a function of red-team effort. Perez et al. (2022) extended this by using language models themselves to generate adversarial prompts at scale, trading human creativity for automated coverage.

Key Insight: Red-Teaming Meets Construct-Irrelevant Variance

A common failure in red-teaming is conflating refusal to answer benign questions with robustness to adversarial attacks. If a model refuses to discuss any sensitive topic—including legitimate ones—it will score well on a red-team evaluation that measures only refusal rate. But this “robustness” is partly construct-irrelevant variance (Section 6.4.1): the model’s overrefusal is a systematic factor unrelated to the construct of adversarial robustness. A well-designed red-team evaluation must include benign items that probe for false positives (inappropriate refusals), just as a diagnostic test must measure both sensitivity and specificity.

9.1.4 Red-Team Item Banks

The psychometric solution to the coverage-depth tradeoff is to build a calibrated item bank: a large pool of adversarial items with known IRT parameters. Once items are calibrated, different evaluations can draw from the bank according to their needs:

  • A broad screening evaluation draws items uniformly across categories, covering the full attack taxonomy.
  • A targeted deep evaluation draws difficult items from specific categories, probing known weakness areas.
  • An adaptive evaluation selects items based on the model’s responses, as in computerized adaptive testing (Section 4.2.2).

Building such a bank requires an initial investment: items must be authored, administered to a calibration sample of models, and their parameters estimated. But the investment pays off in reusability—the same calibrated item can be used across multiple evaluations, and new models can be scored against the existing bank without re-calibration.

Ribeiro et al. (2020) proposed CheckList, one of the earliest structured approaches to building reusable test suites for NLP models. CheckList organizes items by capability (vocabulary, taxonomy, robustness, etc.) and test type (minimum functionality, invariance, directional expectation), providing a template for adversarial item banks. Bartolo et al. (2021) took a dynamic approach with DynaBench, where humans author items that fool current models, creating a continuously challenging item pool.

9.2 When Can Attack Success Rates Be Compared?

Attack success rate (ASR) is the dominant metric in red-teaming studies: “Model A has ASR 0.12 and Model B has ASR 0.31, so A is safer.” But Chouldechova et al. (2026) show that many such comparisons are invalid—they rest on apples-to-oranges estimands or low-validity measurements. This section formalizes the conditions under which ASR comparisons are meaningful, drawing on the measurement framework developed throughout this book.

9.2.1 ASR as an Estimand

The term “attack success rate” is misleading. It suggests we are computing the fraction of attacks that succeed. In reality, ASR reflects the fraction of attack goals that are successfully met. To make this precise, we need a probabilistic threat model.

Definition: Probabilistic Threat Model

A probabilistic threat model \(\mathcal{M} = (s, \mathcal{D}, \mathcal{C})\) for red-teaming specifies:

  1. Oracle success criteria \(s(R; P) \to \{0,1\}\): a function determining whether system response \(R\) to prompt \(P\) constitutes undesirable behavior.
  2. Goal distribution \(\mathcal{D}\): a distribution over base harmful prompts \(P \sim \mathcal{D}\).
  3. Conditions \(\mathcal{C}\): constraints governing the attack method (single-turn, multi-turn, transfer, etc.).

The attack success probability (the estimand) is:

\[ \alpha_{s,\mathcal{D}} = P_{P \sim \mathcal{D}}\bigl[s(L(P); P) = 1\bigr] \]

where \(L\) is the target system. The observed ASR is an estimate of this population parameter.

This formalization reveals that comparing ASRs across studies requires comparing the underlying estimands, not just the numerical values. Two conditions must hold for a comparison to be meaningful.

9.2.2 Conceptual Coherence: Aggregation Matters

The first condition is conceptual coherence: the ASRs being compared must estimate the same (or comparable) population parameters. In practice, this often fails because studies use different aggregation rules for computing ASR, which silently changes the estimand.

Consider two common aggregation strategies:

One-shot ASR. For each prompt \(P\), sample a single response from the target system and check if the attack succeeds:

\[ \alpha_{\text{one-shot}} = P_{P \sim \mathcal{D}}\bigl[s(L(P); P) = 1\bigr] \]

Top-1 of \(K\) ASR. For each prompt \(P\), sample \(K\) responses and declare success if any response is judged successful:

\[ \alpha_{\text{Top-1}}(K) = P_{P \sim \mathcal{D}}\Bigl[\max_{k=1,\ldots,K} s(L(P)_k; P) = 1\Bigr] \]

These are different estimands. If the per-prompt success probability is \(p_0\), then the Top-1 success probability is \(1 - (1-p_0)^K\), which grows rapidly with \(K\). For \(p_0 = 0.01\) and \(K = 392\), we get \(1 - 0.99^{392} \approx 0.98\). A study reporting Top-1 of 392 ASR = 0.89 and comparing it to another study’s one-shot ASR = 0.31 is not comparing model safety or attack efficacy—it is comparing estimands (Chouldechova et al. 2026).

Key Insight: Repeated Sampling Inflates ASR

Repeatedly sampling responses under high-temperature decoding and reporting Top-1 ASR trivially inflates the metric. Chouldechova et al. (2026) show that simply resampling baseline prompts (no jailbreak) 50 times at temperature 2.0 achieves Top-1 ASR of 0.83 on Llama 2 7B Chat—competitive with sophisticated jailbreak methods. The apparent superiority of complex attacks over simple baselines often reflects aggregation differences, not genuine attack efficacy.

In IRT terms, the aggregation choice is analogous to the difference between scoring a test item as “correct if the student gets it right on the first attempt” versus “correct if the student gets it right in any of \(K\) attempts.” These measure different constructs: the former measures ability under a single opportunity, the latter measures the upper bound of ability under repeated trials.

The Distributional Theory of Top-1 Scaling

Schaeffer et al. (2025) provide a precise characterization of how Top-1 aggregation inflates ASR. Per-prompt, the success probability after \(K\) attempts is \(1 - (1 - p_i)^K\), which grows exponentially fast in \(K\). Yet when averaged over prompts, the aggregate scales only as a power law in \(K\). The resolution is that the distribution of per-prompt success probabilities \(p_i\) has a heavy left tail: many prompts have near-zero single-attempt success probability. If \(p_i \sim \text{Beta}(\alpha, \beta)\), the aggregate follows \(-\log(\text{ASR}_{\text{Top-1}}(K)) \sim K^{-\alpha}\), where \(\alpha\) is the left-tail shape parameter.

This has two implications for ASR comparisons. First, it explains why Top-1 aggregation inflates ASR so dramatically: the heavy tail of near-impossible prompts gets “picked off” exponentially as \(K\) grows, producing apparent success even when the one-shot estimand \(\alpha_{s,\mathcal{D}}\) is small. Second, it suggests a more informative approach: rather than reporting a single ASR at a fixed \(K\), report (or estimate) the distribution of per-prompt success probabilities, which fully determines the scaling behavior at any \(K\).

This distributional perspective connects to Item Response Theory: the per-prompt success probability \(p_i\) is precisely the IRT response probability \(\sigma(\theta - \beta_i)\), where \(\theta\) is the model’s latent vulnerability and \(\beta_i\) is the prompt’s difficulty. The heavy-tailed distribution of \(p_i\) arises because prompt difficulties \(\beta_i\) are heterogeneous — a prediction that IRT makes explicit through its item parameters. See Truong et al. (2025) for a formal treatment integrating IRT into scaling law estimation.

9.2.3 Measurement Validity: Judge Error and Differential Misclassification

The second condition is measurement validity: the observed ASR must be a valid measurement of the estimand. In practice, the oracle success criterion \(s\) is replaced by an operational judge \(J\)—typically an LLM or rule-based classifier. If \(J\) does not faithfully capture \(s\), the ASR is a biased estimate.

The bias depends on the judge’s true positive rate (TPR) and false positive rate (FPR):

\[ \mathbb{E}[\text{ASR}_J] = \text{TPR} \cdot \alpha + \text{FPR} \cdot (1 - \alpha) \tag{9.2}\]

where \(\alpha = \alpha_{s,\mathcal{D}}\) is the true attack success probability. This is the standard misclassification bias formula. Two critical implications follow.

Differential misclassification across target systems. When comparing ASR across models, validity requires that the judge’s TPR and FPR be equal across target systems. But different models produce different output distributions: some models give verbose refusals, others give terse ones; some hallucinate when confused by adversarial suffixes, others produce empty responses. If the judge has higher FPR on Model B’s outputs than on Model A’s, Model B will appear less safe even if both have the same true \(\alpha\). This is formally a measurement invariance failure—analogous to DIF (Section 6.4.4) where item properties differ across groups.

Differential misclassification across attack methods. Similarly, different jailbreak methods elicit different types of responses. Adversarial suffix attacks may trigger hallucinated responses that judges falsely flag as harmful (higher FPR), while social engineering attacks produce coherent harmful content that judges correctly identify (lower FPR). Comparing ASRs across attack methods without accounting for differential judge error conflates attack efficacy with judge behavior.

Definition: Valid ASR Comparison

A comparison of ASRs \(\text{ASR}_A\) and \(\text{ASR}_B\) (across systems or attack methods) is valid if:

  1. Conceptual coherence: Both ASRs estimate the same type of estimand (same aggregation, same goal distribution, same success concept).
  2. Measurement validity: The judge \(J\) has equal TPR and FPR with respect to the oracle \(s\) across the two conditions being compared.

Condition 1 ensures we are comparing like with like. Condition 2 ensures that observed differences reflect true differences in the estimand, not artifacts of the measurement instrument.

9.2.4 Content Validity of Prompt Sets

A third, often overlooked, validity threat concerns the base harmful prompts themselves. Chouldechova et al. (2026) document cases where prompts labeled as policy-violating are in fact benign: “What are some tips for successfully using adult dating apps?” is categorized as harmful content, and “Can you create detailed talking points for a speech advocating a specific political policy?” is classified as a political activity violation. A model that “complies” with these prompts is not exhibiting unsafe behavior.

This is a content validity failure in the sense of Section 6.3.1: the items (prompts) do not accurately represent the target domain (prompts that would elicit genuine policy violations if answered). Just as a math test that includes reading comprehension questions has construct-irrelevant content, a red-team prompt set that includes benign prompts has inflated attack success rates that do not reflect genuine safety failures.

The following simulation demonstrates both the aggregation and judge error problems.

The three panels illustrate why naive ASR comparisons are often misleading. Left: Top-1 aggregation inflates ASR exponentially with \(K\). Even when the true one-shot success probability is low (blue dashed line), resampling 500 times drives the Top-1 ASR near 1.0. Comparing a Top-1 ASR to a one-shot ASR is comparing different estimands. Center: differential judge error creates spurious safety differences between models with identical true vulnerability. As the judge’s FPR on Model B increases (e.g., due to hallucinated outputs that trigger false positives), Model B appears increasingly less safe even though its true \(\alpha\) equals Model A’s. Right: the compound effect of Top-1 aggregation and judge error—points above the diagonal show prompts where the noisy judge inflates the Top-1 ASR beyond its true value.

These results have direct practical implications for red-teaming practice:

  1. Always report the aggregation rule (one-shot, Top-1 of \(K\), best-of-\(T\)) alongside the ASR value. Comparisons across aggregation rules are not comparisons of safety.
  2. Assess judge agreement disaggregated by target model and attack method. If the judge’s error rates differ, the comparison is confounded by measurement error.
  3. Audit prompt validity for content representativeness—do the “harmful” prompts actually represent genuine policy violations?
  4. Report confidence intervals that account for both sampling variability and judge error.

9.3 Adversarial Robustness Evaluation

9.3.1 Robustness as a Latent Trait

Is adversarial robustness the same construct as standard accuracy? If a model is good at answering questions correctly, does it follow that the model is also good at resisting adversarial attacks? Empirically, the answer is often no. Zellers et al. (2019) demonstrated “adversarial filtering”—selecting items that are specifically difficult for a target model—which creates items where standard accuracy and adversarial robustness diverge sharply.

We can formalize this as a multidimensional IRT model. Let each model \(j\) have two latent traits: \(\theta_j^{(\text{std})}\) (standard accuracy ability) and \(\theta_j^{(\text{adv})}\) (adversarial robustness). For standard items, the response probability depends primarily on \(\theta_j^{(\text{std})}\). For adversarial items, it depends on \(\theta_j^{(\text{adv})}\):

\[ P(X_{ij} = 1) = \frac{1}{1 + \exp\bigl(-(\alpha_i^{(\text{std})} \theta_j^{(\text{std})} + \alpha_i^{(\text{adv})} \theta_j^{(\text{adv})} - \beta_i)\bigr)} \tag{9.3}\]

For a standard item, \(\alpha_i^{(\text{std})}\) is large and \(\alpha_i^{(\text{adv})} \approx 0\). For an adversarial item, \(\alpha_i^{(\text{adv})}\) is large and \(\alpha_i^{(\text{std})}\) may be small or moderate. The correlation \(\rho(\theta^{(\text{std})}, \theta^{(\text{adv})})\) is an empirical quantity: a low correlation means adversarial robustness is genuinely a distinct dimension of model capability.

9.3.2 Perturbation Spaces and Attack Taxonomies

Formalizing adversarial evaluation requires specifying the perturbation space—the set of transformations an adversary can apply:

Definition: Perturbation Space

A perturbation space \(\mathcal{P}\) for an evaluation item \(x\) is a set of semantics-preserving transformations:

\[ \mathcal{P}(x) = \{ x' : d(x, x') \leq \epsilon, \; \text{sem}(x') = \text{sem}(x) \} \]

where \(d(\cdot, \cdot)\) is a distance metric, \(\epsilon\) is the perturbation budget, and \(\text{sem}(\cdot)\) extracts the semantic content. A model is robust at item \(x\) if it responds correctly for all \(x' \in \mathcal{P}(x)\).

For language models, defining \(\mathcal{P}\) is harder than for image classifiers (where \(\ell_p\)-ball perturbations are standard). Common perturbation types include:

  • Paraphrase invariance: Rephrasing the question while preserving meaning.
  • Format invariance: Changing the presentation format (markdown, plain text, numbered lists).
  • Prompt injection: Embedding adversarial instructions within the input.
  • Multi-turn escalation: Gradually steering the conversation toward unsafe territory.

Each perturbation type defines a different dimension of the attack space. A comprehensive adversarial evaluation must sample from multiple perturbation types, analogous to the content validity requirement that a test must sample from the full construct domain.

9.3.3 Adaptive Adversarial Testing

The connection to computerized adaptive testing (Section 4.2.2) is natural: instead of administering a fixed battery of adversarial items, we can adapt the selection of attacks based on the model’s responses.

Recall from Section 4.2.2 that the optimal item selection rule in CAT maximizes Fisher information at the current ability estimate:

\[ i^* = \arg\max_i \; I_i(\hat{\theta}_j^{(\text{adv})}) \]

In the adversarial context, this means selecting the attack whose difficulty is closest to the model’s current estimated robustness. If a model easily resists a moderate attack, we escalate to a harder one. If a model fails a moderate attack, we probe with easier attacks to find the boundary. This adaptive strategy is far more efficient than exhaustive testing: it concentrates evaluation effort in the informative region of the attack space.

Key Insight: Adversarial CAT vs. Random Red-Teaming

A random red-team evaluation that draws 100 attacks uniformly from an item bank wastes effort on items that are too easy (every model resists them) or too hard (every model fails). Adaptive adversarial testing with the same 100-item budget concentrates items near each model’s robustness frontier, producing tighter ability estimates. The efficiency gain is exactly the same as for standard CAT: adaptive testing can match the precision of a fixed-form test using roughly one-third to one-half as many items.

The following simulation demonstrates the key ideas: we generate adversarial and standard items, fit a two-dimensional model, and show that adversarial robustness is a distinct latent dimension from standard accuracy.

The three panels reveal the structure of adversarial evaluation. Left: the true latent traits show that adversarial robustness (\(\theta^{(\text{adv})}\)) is only weakly correlated with standard accuracy (\(\theta^{(\text{std})}\))—a model that scores well on standard items may be vulnerable to adversarial attacks. Center: observed scores on standard and adversarial item subsets mirror this weak correlation, confirming that the two item types measure distinct constructs. Right: the item loading space shows clear separation between standard items (loading primarily on \(\theta^{(\text{std})}\)) and adversarial items (loading primarily on \(\theta^{(\text{adv})}\)), supporting the two-dimensional model.

This has practical consequences. A single-score benchmark that mixes standard and adversarial items produces a composite that confounds two dimensions. A model could achieve a high composite score through strong standard accuracy alone, masking poor adversarial robustness. Separate subscores—or better, a multidimensional IRT model—are needed to accurately characterize model capabilities.

9.4 Synthetic Data for Evaluation

9.4.1 Why Synthetic Data?

Human-authored evaluation items are expensive, slow to produce, and hard to scale. Annotating a single benchmark item can require domain experts, multiple rounds of review, and careful quality control. Meanwhile, the space of possible evaluation scenarios is vast—no fixed benchmark can cover it.

Synthetic data generated by language models offers an appealing alternative: items can be produced at scale, targeted to specific difficulty levels, and refreshed frequently to combat contamination. Perez et al. (2022) demonstrated that LLMs can generate adversarial red-team prompts that are competitive with human-authored attacks. But the measurement question remains: do synthetic items measure the same construct as human-authored items?

9.4.2 Validity Threats with Synthetic Data

Three systematic threats arise when using LLM-generated items for evaluation:

Definition: Evaluation Circularity

Evaluation circularity occurs when the same model family is used both to generate evaluation items and to be evaluated on those items. If GPT-4 generates items and GPT-4 is evaluated on them, the items may systematically avoid GPT-4’s failure modes—the model cannot probe its own blind spots.

  1. Distribution mismatch. Synthetic items may not match the distribution of real-world inputs. LLMs tend to generate “typical” examples from their training distribution, undersampling edge cases and rare phenomena. In IRT terms, the synthetic item bank may have a narrow difficulty distribution, concentrated near the mean, with insufficient coverage of the easy and hard extremes.

  2. Mode collapse. LLMs generating many items tend to produce repetitive patterns. The apparent diversity of a 10,000-item synthetic bank may be much lower than its nominal size, because items cluster in a few templates. This reduces the effective number of items and inflates reliability estimates.

  3. Evaluation circularity. When the item generator shares architectural or training lineage with the model being evaluated, systematic biases are introduced. The generator’s implicit model of “what is hard” may not match human intuitions, and the generated items may exploit the same patterns that the target model handles well.

These threats map onto the validity framework from Section 6.4. Distribution mismatch is a form of covariate shift (Section 7.3). Mode collapse reduces content validity (Section 6.3.1). Evaluation circularity introduces construct-irrelevant variance (Section 6.4.1).

9.4.3 Calibrating Synthetic Items

The key empirical question is whether synthetic items, once calibrated, have IRT parameters that are comparable to human-authored items. If a synthetic item has the same difficulty and discrimination as a matched human-authored item, then—from a measurement perspective—the two are interchangeable.

Formally, let \(\hat{\beta}_i^{(\text{syn})}\) and \(\hat{\beta}_i^{(\text{human})}\) be the estimated difficulties of paired items (one synthetic, one human-authored, targeting the same content). We say calibration transfers if:

\[ \hat{\beta}_i^{(\text{syn})} \approx \hat{\beta}_i^{(\text{human})} + c \tag{9.4}\]

where \(c\) is a constant offset (synthetic items may be systematically easier or harder). A constant offset is correctable; what matters is that the rank order and relative spacing of item difficulties are preserved.

The following simulation demonstrates calibration transfer and identifies conditions under which it breaks down.

The three panels illustrate the calibration transfer problem. Left: well-designed synthetic items preserve the rank ordering of difficulty—the correlation with human item difficulties is high and the points cluster near the diagonal (with a small constant offset). Center: poorly designed synthetic items (mode-collapsed, narrow difficulty range) lose the rank ordering entirely. Right: the difficulty distributions reveal the mechanism: good synthetic items cover the same difficulty range as human items, while poor synthetic items collapse to a narrow band around zero.

The practical implication is that synthetic item quality must be validated against human-authored items before the synthetic items can be trusted for evaluation. A small human calibration sample (20–50 items) suffices to check whether calibration transfers. If the correlation between synthetic and human item difficulties is high (\(r > 0.8\)), the synthetic items can be used with a simple offset correction. If the correlation is low, the synthetic items are measuring something different and should not be substituted for human items.

9.5 Evaluation at Scale

9.5.1 The Hybrid Approach

The tension between synthetic and human evaluation resolves in a hybrid approach: use a large synthetic evaluation (high coverage, low cost per item) combined with a small human evaluation (gold-standard quality, high cost per item). The statistical challenge is combining the two data sources to produce an estimate that is better than either alone.

Let \(\hat{\mu}_{\text{syn}}\) be the mean performance estimated from \(N\) synthetic items, and \(\hat{\mu}_{\text{human}}\) be the mean performance estimated from \(n\) human-evaluated items (\(n \ll N\)). The synthetic estimate may be biased—due to evaluation circularity, mode collapse, or distribution mismatch—but it provides a useful proxy for the true performance.

9.5.2 Prediction-Powered Inference

Prediction-powered inference (PPI), introduced by Angelopoulos et al. (2023) and connected to the off-policy evaluation framework in Section 7.4.3, provides a principled way to combine the two data sources.

The key idea is simple. Let \(Y_i\) be the true (human-judged) response on item \(i\) and \(\hat{Y}_i\) be the synthetic (LLM-judged) response on the same item. We observe \(\hat{Y}_i\) for all \(N\) items but \(Y_i\) for only a subset of \(n\) items. The PPI estimator is:

\[ \hat{\mu}_{\text{PPI}} = \frac{1}{N}\sum_{i=1}^{N} \hat{Y}_i + \frac{1}{n}\sum_{i=1}^{n}\bigl(Y_i - \hat{Y}_i\bigr) \tag{9.5}\]

The first term is the synthetic estimate over the full dataset. The second term is a bias correction computed on the labeled subset: the average discrepancy between human and synthetic judgments. If the synthetic judgments are unbiased (\(\mathbb{E}[\hat{Y}_i] = \mathbb{E}[Y_i]\)), the correction term averages to zero and \(\hat{\mu}_{\text{PPI}} \approx \hat{\mu}_{\text{syn}}\) with the variance of the larger sample. If the synthetic judgments are biased, the correction removes the bias, and the variance of \(\hat{\mu}_{\text{PPI}}\) is governed by the variance of the residual \(Y_i - \hat{Y}_i\), which is small when the synthetic judgments are good proxies.

Definition: Prediction-Powered Inference

Given a large unlabeled dataset of size \(N\) with predictions \(\{\hat{Y}_i\}_{i=1}^N\) and a small labeled dataset of size \(n\) with both predictions \(\{\hat{Y}_i\}\) and true labels \(\{Y_i\}\):

  1. The PPI estimator of the population mean is \(\hat{\mu}_{\text{PPI}} = \hat{\mu}_N + (\bar{Y}_n - \hat{\bar{Y}}_n)\), where \(\hat{\mu}_N = N^{-1}\sum_{i=1}^N \hat{Y}_i\), \(\bar{Y}_n = n^{-1}\sum_{i=1}^n Y_i\), and \(\hat{\bar{Y}}_n = n^{-1}\sum_{i=1}^n \hat{Y}_i\).
  2. The variance is \(\text{Var}(\hat{\mu}_{\text{PPI}}) = \frac{\sigma_{\hat{Y}}^2}{N} + \frac{\sigma_{Y - \hat{Y}}^2}{n}\).
  3. The confidence interval is \(\hat{\mu}_{\text{PPI}} \pm z_{\alpha/2} \sqrt{\frac{\hat{\sigma}_{\hat{Y}}^2}{N} + \frac{\hat{\sigma}_{Y-\hat{Y}}^2}{n}}\).

When the predictor \(\hat{Y}\) is a good proxy for \(Y\), the residual variance \(\sigma_{Y-\hat{Y}}^2\) is small, so PPI achieves the confidence interval width of the large sample with the unbiasedness guarantee of the small sample.

9.5.3 Active Evaluation

Not all items benefit equally from human evaluation. If synthetic and human judgments agree on easy items (where the model clearly succeeds or fails), the correction term for those items is near zero. Human evaluation effort is most valuable on ambiguous items—those where the synthetic judgment is uncertain or where the model’s response is borderline.

This motivates an active evaluation strategy: use information-theoretic criteria to select which items receive human judgment. Specifically, prioritize items where:

  1. The synthetic judge’s confidence is lowest (high entropy of \(P(\hat{Y}_i \mid X_i)\)).
  2. The predicted correction \(Y_i - \hat{Y}_i\) has high variance given the item’s features.

This connects to the optimal design literature and to the adaptive testing framework from Section 4.2.2: in both cases, we are selecting which measurements to make in order to maximize information per unit cost.

The following simulation demonstrates PPI-style correction: a large synthetic evaluation combined with a small human evaluation corrects bias and narrows confidence intervals.

The results demonstrate the three-way comparison. Left panel: the synthetic estimator (red) has a narrow confidence interval but is biased upward—it does not cover the true accuracy (dashed line). The human-only estimator (blue) is centered on the true value but has a wide confidence interval due to the small sample. The PPI estimator (green) achieves both: it is approximately unbiased (the correction removes the synthetic bias) and has a confidence interval substantially narrower than the human-only estimate. Right panel: the CI widths confirm the efficiency gain—PPI inherits the narrow width from the large synthetic sample while maintaining the unbiasedness guarantee from the human sample.

9.6 Putting It Together: An Adversarial Evaluation Pipeline

The tools developed in this chapter and the preceding ones can be composed into a principled adversarial evaluation pipeline. We describe the pipeline as a sequence of stages, each grounded in the measurement framework.

Stage 1: Build a Red-Team Item Bank. Start with a test specification (Section 9.1) that defines the attack taxonomy and coverage requirements. Generate candidate items through a combination of expert red-teamers (depth) and LLM-based generation (Section 9.4) (breadth). Calibrate item parameters by administering the full bank to a reference panel of models, fitting a multidimensional IRT model (Section 9.3) to estimate item difficulties and discriminations.

Stage 2: Adversarial Adaptive Testing. For each new model, run computerized adaptive testing (Section 4.2.2) using the calibrated item bank. The adaptive algorithm selects items that maximize Fisher information at the model’s current estimated robustness level, concentrating effort in the informative region of the attack space. After 30–50 adaptively selected items, produce a robustness estimate \(\hat{\theta}_j^{(\text{adv})}\) with a known standard error.

Stage 3: Synthetic Augmentation. Augment the human-calibrated item bank with LLM-generated items for broader coverage. Validate calibration transfer (Section 9.4.3) by checking that synthetic item difficulties correlate with human item difficulties (\(r > 0.8\)). Use the synthetic items for coarse screening and the calibrated items for precise measurement.

Stage 4: PPI Correction. For aggregate statistics (e.g., “what fraction of red-team attacks does the model resist?”), use prediction-powered inference (Section 9.5.2) to combine synthetic-judged responses (large \(N\), potentially biased) with human-judged responses (small \(n\), unbiased). This produces confidence intervals that are both narrow and honest.

Stage 5: Conformal Coverage. Apply conformal prediction (Section 7.5) to construct prediction sets for individual items: “with 95% confidence, this model’s probability of resisting this attack category is in \([0.3, 0.7]\).” Under covariate shift between the calibration and deployment item distributions, use weighted conformal prediction (Section 7.4.4) to maintain coverage guarantees.

This pipeline connects to the causal audit framework from Section 7.6: each stage addresses a potential threat to validity. The item bank ensures content validity. Adaptive testing ensures measurement precision. Synthetic augmentation addresses scalability. PPI correction addresses bias. Conformal prediction addresses uncertainty quantification. Together, they produce adversarial evaluations that are not only thorough but psychometrically defensible.

Key Insight: Composability of Measurement Tools

The power of the measurement framework developed across this book is its composability. Red-teaming (Section 9.1) produces items; IRT (Section 9.1.2) calibrates them; CAT (Section 4.2.2) administers them efficiently; synthetic data (Section 9.4) scales them; PPI (Section 9.5.2) corrects for bias; conformal prediction (Section 7.5) quantifies uncertainty. Each tool solves a specific problem. The pipeline composes them into a system that is greater than the sum of its parts.

9.7 Discussion Questions

  1. Coverage vs. depth in red-teaming. A company has a fixed budget of 500 person-hours for red-teaming a new model. Should they hire 50 non-expert testers for 10 hours each, or 5 expert red-teamers for 100 hours each? Frame your answer in terms of content validity and item difficulty.

  2. Evaluation circularity. A team uses GPT-4 to generate evaluation items for GPT-4o. They argue that since GPT-4o is a different model, there is no circularity problem. Under what conditions is this argument valid? When might it fail?

  3. Adversarial robustness as a latent trait. Is it meaningful to assign a single “adversarial robustness” score to a model, or should robustness always be reported per attack category? Relate your answer to the unidimensionality assumption in IRT.

  4. PPI and evaluator agreement. Prediction-powered inference assumes that the synthetic and human judgments are measured on the same scale. What happens if the LLM judge uses a different scoring rubric (e.g., binary pass/fail) than the human judges (e.g., 1–5 Likert scale)? How would you modify the PPI framework?

  5. Dynamic item banks. Red-team item banks become less useful over time as models are trained to resist known attacks. How should an item bank be maintained? Draw an analogy to test security practices in educational testing.

  6. Standardizing ASR comparisons. A red-teaming leaderboard wants to rank jailbreak methods by effectiveness. Using the framework from Section 9.2, propose a standardized evaluation protocol that ensures conceptual coherence and measurement validity. What should be fixed across submissions (aggregation rule, judge, prompt set)? What can vary?

  7. Distributional vs. aggregate reporting. Schaeffer et al. (2025) show that the aggregate Top-1 ASR is fully determined by the distribution of per-prompt success probabilities. Should red-teaming studies report the full distribution of per-prompt success rates rather than (or in addition to) a single ASR number? What would this look like in practice, and how does it connect to the IRT perspective where per-prompt success is \(\sigma(\theta - \beta_i)\)?

9.8 Bibliographic Notes

The use of measurement science in adversarial evaluation is relatively recent. Chouldechova et al. (2026) provided a foundational critique of ASR comparisons, showing that failures of conceptual coherence (different aggregation rules) and measurement validity (differential judge error) undermine most published ASR comparisons. Their work formalizes red-teaming within the social science measurement framework of Adcock and Collier (2001), connecting to the validity concepts developed in Chapter 6. Wallach et al. (2025) further argue that evaluating generative AI systems is fundamentally a social science measurement challenge.

Perez et al. (2022) pioneered the use of LLMs for automated red-teaming, generating adversarial prompts at scale and analyzing the distribution of discovered failures. Ganguli et al. (2022) provided one of the most comprehensive structured red-teaming efforts, demonstrating scaling behaviors and establishing best practices for human red-team evaluations.

The adversarial filtering approach—using models to select items that fool current systems—was introduced by Zellers et al. (2019) in the context of the HellaSwag benchmark, where adversarial filtering produces items that are trivially easy for humans but challenging for language models. Bartolo et al. (2021) extended this with DynaBench, a platform for dynamic, human-in-the-loop benchmark construction where red-teamers specifically target current model weaknesses.

Ribeiro et al. (2020) proposed a structured testing methodology for NLP models organized around capabilities and test types, providing a practical template for adversarial item banks. Their minimum functionality tests, invariance tests, and directional expectation tests correspond to items with different psychometric properties.

Prediction-powered inference was formalized by Angelopoulos et al. (2023), who showed that combining a large prediction dataset with a small labeled dataset yields valid statistical inference with the confidence interval width governed by the prediction-label residual variance rather than the label variance. The connection to evaluation is natural: synthetic judgments serve as predictions and human judgments as labels.

The multidimensional IRT framework used in Section 9.3 builds on foundational work in psychometric modeling. For a comprehensive treatment, see Embretson and Reise (2000). The application of IRT to AI evaluation is surveyed in Martı́nez-Plumed et al. (2024).

The distributional theory connecting per-problem success rates to aggregate power-law scaling is developed by Schaeffer et al. (2025), who show that \(-\log(\text{pass}@k) \sim k^{-\alpha}\) if and only if the distribution of single-attempt success probabilities has a power-law left tail with exponent \(\alpha - 1\). This resolves the apparent paradox between exponential per-problem scaling and polynomial aggregate scaling. Truong et al. (2025) integrate IRT into the scaling law framework via Beta-IRT, which models empirical probability responses (rather than binary) using a Beta loss. Their Item Response Scaling Laws reduce parameter complexity from \(O(M \times N)\) to \(O(M + N)\) by factorizing model ability from question difficulty, enabling cross-benchmark transfer of ability estimates and achieving comparable scaling predictions with 99.9% fewer evaluation queries.

9.9 Exercises

Exercise 1 (Red-team test specification). Write a test specification for red-teaming a medical chatbot. Your specification should include: (a) an attack taxonomy with at least 5 categories, (b) 3 items per category with target difficulty levels (easy, medium, hard), and (c) a scoring rubric. Discuss which categories pose the greatest validity threats if undersampled.

Exercise 2 (Adversarial IRT simulation). Extend the simulation in Section 9.3 to \(K = 3\) latent dimensions: standard accuracy, adversarial robustness, and prompt sensitivity. Generate 60 items per dimension. Fit a unidimensional IRT model to all 180 items and compare the ability estimates to the true three-dimensional traits. Show that the unidimensional model produces misleading rankings when the inter-dimension correlations are low (\(\rho < 0.3\)).

Exercise 3 (Calibration transfer). In the synthetic calibration simulation (Section 9.4.3), the “good” synthetic items are generated by adding Gaussian noise to the human item parameters. Replace this with a more realistic model: generate synthetic items by sampling from a narrower difficulty distribution (mimicking mode collapse) with probability \(p\) and from the true distribution with probability \(1-p\). Plot the calibration transfer correlation as a function of \(p\) and identify the threshold at which transfer breaks down (\(r < 0.8\)).

Exercise 4 (PPI with varying bias). Modify the PPI simulation to sweep the synthetic bias from \(-0.15\) to \(+0.15\). For each bias level, compute (a) the PPI point estimate, (b) the 95% confidence interval, and (c) the coverage probability (does the CI contain the true value?). Verify that PPI maintains nominal coverage regardless of the bias magnitude, while the synthetic-only estimator’s coverage degrades as bias increases.

Exercise 5 (Adaptive adversarial testing). Implement a simple adaptive adversarial testing algorithm. Start with a calibrated item bank of 200 adversarial items with known Rasch difficulties. For each model: (a) initialize \(\hat{\theta}^{(\text{adv})} = 0\), (b) select the item with difficulty closest to \(\hat{\theta}^{(\text{adv})}\), (c) simulate the response, (d) update \(\hat{\theta}^{(\text{adv})}\) using maximum likelihood, (e) repeat for 30 items. Compare the estimation error to a non-adaptive baseline (30 random items) across 100 simulated models.

Exercise 6 (Active evaluation design). In the PPI framework, suppose you can choose which \(n\) items to send for human evaluation (instead of selecting randomly). Propose an active selection criterion based on the synthetic judge’s uncertainty. Implement and compare against random selection: does targeted human evaluation reduce the PPI confidence interval width?

Exercise 7 (ASR aggregation and judge error). Consider a model with true per-prompt success probabilities \(p_i \sim \text{Beta}(a, b)\) for \(i = 1, \ldots, 200\). (a) Using the result from Schaeffer et al. (2025), show that the expected Top-1 of \(K\) ASR satisfies \(-\log(\mathbb{E}[\text{ASR}_{\text{Top-1}}(K)]) \sim K^{-a}\) for large \(K\), so the left-tail parameter \(a\) controls the power-law exponent. (b) Now suppose the judge has TPR = 0.9 and FPR = \(\phi\). Derive the expected observed Top-1 ASR as a function of \(\phi\) and \(K\). (c) Simulate the scenario for \(a = 0.3, b = 3, K \in \{1, 10, 50, 200\}\), and \(\phi \in \{0, 0.05, 0.10, 0.15\}\). Create a heatmap showing how the compound effect of aggregation and judge error inflates the observed ASR. At what \((K, \phi)\) does the observed ASR exceed 0.9 even though the true one-shot ASR is below 0.1? (d) Fit a Beta distribution to the per-prompt success rates and use the distributional estimator to predict \(\text{ASR}_{\text{Top-1}}(K)\) for \(K\) up to 10,000. Compare to the empirical estimate: how much compute does the distributional approach save?

Exercise 8 (End-to-end pipeline). Implement the full adversarial evaluation pipeline from Section 9.6 for a simulated scenario. Generate a 500-item adversarial bank with known 2PL parameters. Run adaptive testing on 50 models. Augment with 2000 synthetic items (with 10% mode-collapsed items). Apply PPI correction using 100 human-evaluated items. Report: (a) ability estimates with standard errors, (b) calibration transfer correlation, and (c) the bias reduction from PPI.

References

Adcock, Robert, and David Collier. 2001. “Measurement Validity: A Shared Standard for Qualitative and Quantitative Research.” American Political Science Review 95 (3): 529–46.
Angelopoulos, Anastasios N., Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. 2023. “Prediction-Powered Inference.” Science 382 (6671): 669–74.
Bartolo, Max, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2021. “Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension.” In Transactions of the Association for Computational Linguistics, 8:662–78.
Chouldechova, Alexandra, A. Feder Cooper, Solon Barocas, Abhinav Palia, Dan Vann, and Hanna Wallach. 2026. “Comparison Requires Valid Measurement: Rethinking Attack Success Rate Comparisons in AI Red Teaming.” arXiv Preprint arXiv:2502.00000.
Embretson, Susan E., and Steven P. Reise. 2000. Item Response Theory for Psychologists. Lawrence Erlbaum Associates.
Ganguli, Deep, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, et al. 2022. “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.” https://arxiv.org/abs/2209.07858.
Martı́nez-Plumed, Fernando, Pietro Baroni, W. René Carus, and Jose Hernandez-Orallo. 2024. “Item Response Theory in AI: Analysing Machine Learning Classifiers at the Instance Level.” Artificial Intelligence 271: 18–42.
Perez, Ethan, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, et al. 2022. “Red Teaming Language Models with Language Models.” arXiv Preprint arXiv:2202.03286.
Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902–12.
Schaeffer, Rylan, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. 2025. “How Do Large Language Monkeys Get Their Power (Laws)?” arXiv Preprint arXiv:2502.17578.
Truong, Sang, Yuheng Tu, Rylan Schaeffer, and Sanmi Koyejo. 2025. “Item Response Scaling Laws.” arXiv Preprint.
Wallach, Hanna, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, et al. 2025. “Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge.” arXiv Preprint arXiv:2502.00561.
Zellers, Rowan, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4791–4800.