A list of all movies that meet the criteria given should be compiled.
movies directed by Spike Lee by release date
Subject outcomes
- tulu_65B incorrect
- tulu_7B incorrect
- chatgpt incorrect
- tulu_human_only_65B incorrect
Human Preference & Alignment
Tulu 1 human evaluation: 332 prompt instances, each with completions from open instruction-tuned models (tulu_65B, tulu_7B, tulu_human_only_65B) and ChatGPT, graded by human annotators in pairwise comparisons. Subjects are the AI models; the response is the per-(model, item) human acceptability label (yes/no -> 1/0). Only the released human-eval slice is per-item public; the paper's automatic benchmark predictions (MMLU, GSM8k, BBH, etc.) are not released and are excluded.
Response matrix
Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 4 subjects × 332 items, 100% of cells evaluated.
Fit to width. Hover for subject & item; click a cell for details.

Scale: 1 = correct · 0 = incorrect
Sample items
A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.
A list of all movies that meet the criteria given should be compiled.
movies directed by Spike Lee by release date
Subject outcomes
Find sentences from reliable sources such as the Guardian or Forbes that contain the exact match for the given sentence or phrase. Also, mention the source of each sentence.
There are great options
Subject outcomes
On the basis of the subject of the email, determine whether the email should be treated as spam or not.
Dermatologists don't like her!
Subject outcomes
Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).
Subject outcomes
Using a given amount, determine an appropriate tip.
14% tip on $47.50
Subject outcomes
Provide an example of how a table of contents can be generated automatically in a LATEX document. Make sure your example contains the necessary commands.
Subject outcomes
Give examples of popular shows and movies in the genre.
Genre: Crime TV Shows
Subject outcomes
Explain the process of natural selection and how it contributes to the evolution and adaptation of species.
Subject outcomes
What if Isaac Newton had focused on biology instead of physics?
Subject outcomes
List the personality traits that are required to be successful in the given job.
Social Media Marketer
Subject outcomes
Consider the best time of year to visit the given city, and provide your reasons for choosing that time.
Sydney, Australia
Subject outcomes
We need to write a thematic tweet to share company news and facts. Please take a careful look at the facts and details provided and create a tweet based on them.
Subject outcomes
Subjects
4 subjects, ranked by mean response (accuracy) across this benchmark's items.