Skip to main content

Human Preference & Alignment

Tulu Human Evaluation

Tulu 1 human evaluation: 332 prompt instances, each with completions from open instruction-tuned models (tulu_65B, tulu_7B, tulu_human_only_65B) and ChatGPT, graded by human annotators in pairwise comparisons. Subjects are the AI models; the response is the per-(model, item) human acceptability label (yes/no -> 1/0). Only the released human-eval slice is per-item public; the paper's automatic benchmark predictions (MMLU, GSM8k, BBH, etc.) are not released and are excluded.

332items
4subjects
100%observed
Apache-2.0license
generaldomain
preferencedomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 4 subjects × 332 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

Tulu Human Evaluation response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

A list of all movies that meet the criteria given should be compiled.

movies directed by Spike Lee by release date

Subject outcomes

  • tulu_65B incorrect
  • tulu_7B incorrect
  • chatgpt incorrect
  • tulu_human_only_65B incorrect
Item 233% solve rate

Find sentences from reliable sources such as the Guardian or Forbes that contain the exact match for the given sentence or phrase. Also, mention the source of each sentence.

There are great options

Subject outcomes

  • tulu_65B correct
  • chatgpt correct
  • tulu_7B incorrect
  • tulu_human_only_65B incorrect
Item 350% solve rate

On the basis of the subject of the email, determine whether the email should be treated as spam or not.

Dermatologists don't like her!

Subject outcomes

  • tulu_7B correct
  • chatgpt correct
  • tulu_65B incorrect
  • tulu_human_only_65B incorrect
Item 467% solve rate

Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).

Subject outcomes

  • chatgpt correct
  • tulu_65B correct
  • tulu_7B incorrect
  • tulu_human_only_65B incorrect
Item 575% solve rate

Using a given amount, determine an appropriate tip.

14% tip on $47.50

Subject outcomes

  • chatgpt correct
  • tulu_65B correct
  • tulu_7B incorrect
  • tulu_human_only_65B incorrect
Item 683% solve rate

Provide an example of how a table of contents can be generated automatically in a LATEX document. Make sure your example contains the necessary commands.

Subject outcomes

  • tulu_65B correct
  • chatgpt correct
  • tulu_7B correct
  • tulu_human_only_65B incorrect
Item 783% solve rate

Give examples of popular shows and movies in the genre.

Genre: Crime TV Shows

Subject outcomes

  • chatgpt correct
  • tulu_65B correct
  • tulu_human_only_65B correct
  • tulu_7B incorrect
Item 8100% solve rate

Explain the process of natural selection and how it contributes to the evolution and adaptation of species.

Subject outcomes

  • chatgpt correct
  • tulu_65B correct
  • tulu_7B correct
  • tulu_human_only_65B correct
Item 9100% solve rate

What if Isaac Newton had focused on biology instead of physics?

Subject outcomes

  • tulu_65B correct
  • tulu_7B correct
  • chatgpt correct
  • tulu_human_only_65B correct
Item 10100% solve rate

List the personality traits that are required to be successful in the given job.

Social Media Marketer

Subject outcomes

  • tulu_7B correct
  • tulu_65B correct
  • chatgpt correct
  • tulu_human_only_65B correct
Item 11100% solve rate

Consider the best time of year to visit the given city, and provide your reasons for choosing that time.

Sydney, Australia

Subject outcomes

  • chatgpt correct
  • tulu_65B correct
  • tulu_7B correct
  • tulu_human_only_65B correct
Item 12100% solve rate

We need to write a thematic tweet to share company news and facts. Please take a careful look at the facts and details provided and create a tweet based on them.

  • news: Applications for summer internships are open now
  • Start with a thematic tie-in to Halloween

Subject outcomes

  • tulu_65B correct
  • chatgpt correct
  • tulu_7B correct
  • tulu_human_only_65B correct

Subjects

The models, agents, and reward models evaluated.

4 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1chatgpt0.9086
  2. 2tulu_65B0.8035
  3. 3tulu_human_only_65B0.7265
  4. 4tulu_7B0.6889