Skip to main content

Human Preference & Alignment

InterMT

52.6k multi-turn multimodal dialogues with 32.4k human-labeled preference pairs.

12,557items
2subjects
100%observed
Modelsubject type

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 2 subjects × 12,557 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

InterMT response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: 0 to 3 human rating (per dimension): nine multimodal-dialogue quality dimensions (helpfulness, coherence, image–text consistency, …); higher is better.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Subjects

The models, agents, and reward models evaluated.

2 subjects, ranked by mean response across this benchmark's items.

  1. 14a8db3fc2.2
  2. 22b13d0f82.178