Skip to main content

ML Engineering & Research

Surrogate Public Data

Surrogate Public Data for DP on tabular data: LLM-synthesized tabular datasets from schema-level specs (zero privacy budget) used as drop-in public data for differential privacy mechanisms. Benchmarks surrogate generation methods (direct CSV vs SCM agents, across LLMs) on DP pretraining utility (AUC Advantage) across privacy budgets.

3items
21subjects
86%observed
Modelsubject type
MITlicense
generaldomain
ml_engineeringdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 21 subjects × 3 items, 86% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

Surrogate Public Data response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: AUC Advantage in [−1, 1] (per privacy budget ε): how distinguishable a differentially-private model trained on LLM-surrogate data is from one trained on real data — 0 = indistinguishable. Shown per privacy budget ε.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 1-19% solve rate

Differentially private synthetic-data / DP-pretraining task on private tabular dataset family: we

Subject outcomes

  • surrogate:gen-claude-MIX-MAXscore -0.084
  • surrogate:gen-claude-MIX-UNIFscore -0.087
  • surrogate:baseline_domainscore -0.282
Item 2-12% solve rate

Differentially private synthetic-data / DP-pretraining task on private tabular dataset family: edad

Subject outcomes

  • surrogate:2020score -0.054
  • surrogate:csv-claudescore -0.057
  • surrogate:baseline_domainscore -0.241
Item 3-3% solve rate

Differentially private synthetic-data / DP-pretraining task on private tabular dataset family: acs

Subject outcomes

  • surrogate:csv-gptscore -0.023
  • surrogate:csv-claudescore -0.024
  • surrogate:baseline_univariatescore -0.061

Subjects

The models, agents, and reward models evaluated.

21 subjects, ranked by mean response across this benchmark's items.

  1. 1surrogate:massachusetts_upsampled-0.027
  2. 2surrogate:gen-claude-MIX-MAX-0.082
  3. 3surrogate:gen-claude-MIX-UNIF-0.083
  4. 4surrogate:2020-0.085
  5. 5surrogate:csv-claude-0.089
  6. 6surrogate:gen-MIX-UNIF-0.095
  7. 7surrogate:csv-gpt-0.097
  8. 8surrogate:gen-MIX-MAX-0.102
  9. 9surrogate:csv-llama-0.105
  10. 10surrogate:gen-llama-MIX-UNIF-0.115
  11. 11surrogate:gen-gpt-MIX-UNIF-0.116
  12. 12surrogate:gen-llama-MIX-MAX-0.12
  13. 13surrogate:sdscm-llama-3-8b-0.122
  14. 14surrogate:gen-gpt-MIX-MAX-0.126
  15. 15surrogate:arbitrary-0.132
  16. 16surrogate:no-pre-0.133
  17. 17surrogate:sdscm-gpt2-0.135
  18. 18surrogate:sdscm-olmo-1b-hf-0.136
  19. 19surrogate:baseline_univariate-0.147
  20. 20surrogate:baseline_domain-0.154
  21. 21surrogate:2018-0.159