Human Preference & Alignment

Tulu Human Evaluation

Tulu 1 human evaluation: 332 prompt instances, each with completions from open instruction-tuned models (tulu_65B, tulu_7B, tulu_human_only_65B) and ChatGPT, graded by human annotators in pairwise comparisons. Subjects are the AI models; the response is the per-(model, item) human acceptability label (yes/no -> 1/0). Only the released human-eval slice is per-item public; the paper's automatic benchmark predictions (MMLU, GSM8k, BBH, etc.) are not released and are excluded.

332items

4subjects

100%observed

Apache-2.0license

generaldomain

preferencedomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 4 subjects × 332 items, 100% of cells evaluated.

Tulu Human Evaluation response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

A list of all movies that meet the criteria given should be compiled.

movies directed by Spike Lee by release date

Subject outcomes

tulu_65B incorrect
tulu_7B incorrect
chatgpt incorrect
tulu_human_only_65B incorrect

Item 233% solve rate

Find sentences from reliable sources such as the Guardian or Forbes that contain the exact match for the given sentence or phrase. Also, mention the source of each sentence.

There are great options

Subject outcomes

tulu_65B correct
chatgpt correct
tulu_7B incorrect
tulu_human_only_65B incorrect

Item 350% solve rate

On the basis of the subject of the email, determine whether the email should be treated as spam or not.

Dermatologists don't like her!

Subject outcomes

tulu_7B correct
chatgpt correct
tulu_65B incorrect
tulu_human_only_65B incorrect

Item 467% solve rate

Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).

Subject outcomes

chatgpt correct
tulu_65B correct
tulu_7B incorrect
tulu_human_only_65B incorrect

Item 575% solve rate

Using a given amount, determine an appropriate tip.

14% tip on $47.50

Subject outcomes

chatgpt correct
tulu_65B correct
tulu_7B incorrect
tulu_human_only_65B incorrect

Item 683% solve rate

Provide an example of how a table of contents can be generated automatically in a LATEX document. Make sure your example contains the necessary commands.

Subject outcomes

tulu_65B correct
chatgpt correct
tulu_7B correct
tulu_human_only_65B incorrect

Item 783% solve rate

Give examples of popular shows and movies in the genre.

Genre: Crime TV Shows

Subject outcomes

chatgpt correct
tulu_65B correct
tulu_human_only_65B correct
tulu_7B incorrect

Item 8100% solve rate

Explain the process of natural selection and how it contributes to the evolution and adaptation of species.

Subject outcomes

chatgpt correct
tulu_65B correct
tulu_7B correct
tulu_human_only_65B correct

Item 9100% solve rate

What if Isaac Newton had focused on biology instead of physics?

Subject outcomes

tulu_65B correct
tulu_7B correct
chatgpt correct
tulu_human_only_65B correct

Item 10100% solve rate

List the personality traits that are required to be successful in the given job.

Social Media Marketer

Subject outcomes

tulu_7B correct
tulu_65B correct
chatgpt correct
tulu_human_only_65B correct

Item 11100% solve rate

Consider the best time of year to visit the given city, and provide your reasons for choosing that time.

Sydney, Australia

Subject outcomes

chatgpt correct
tulu_65B correct
tulu_7B correct
tulu_human_only_65B correct

Item 12100% solve rate

We need to write a thematic tweet to share company news and facts. Please take a careful look at the facts and details provided and create a tweet based on them.

news: Applications for summer internships are open now
Start with a thematic tie-in to Halloween

Subject outcomes

tulu_65B correct
chatgpt correct
tulu_7B correct
tulu_human_only_65B correct

Subjects

The models, agents, and reward models evaluated.

4 subjects, ranked by mean response (accuracy) across this benchmark's items.

1chatgpt0.9086
2tulu_65B0.8035
3tulu_human_only_65B0.7265
4tulu_7B0.6889

Full data on Hugging Face Back to the gallery