Stanford UniversityAI Measurement Science

Preference

MT-Bench

MT-Bench GPT-4 single-answer judgment scores (34 models, 80 questions, 2 turns).

160items

34subjects

CC-BY-4.0license

preferencedomain

textmodality

item-level responses released

Saturation status: No

Original source Paper ← All benchmarks

Response matrix

MT-Bench response matrix: AI models (rows) against items (columns) — lowhighUnobserved
Scale: {1, 2, ..., 10}

Subjects

Full data on Hugging Face Back to the gallery