what is the word for clarified butter in the balkans and middle east
Subject outcomes
- FiD incorrect
- GPT-3.5 (text-davinci-003) incorrect
- ChatGPT (gpt-3.5-turbo) incorrect
- New Bing incorrect
Reasoning & Knowledge
EVOUNA (QA-Eval): per-(model, question) human correctness judgments for five Open-QA systems (FiD, GPT-3.5/text-davinci-003, ChatGPT, GPT-4, New Bing) answering the same Open-domain questions from Natural Questions (3,610) and TriviaQA (2,000). Each model answer was manually judged correct/incorrect by a human annotator.
Response matrix
Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 5 subjects × 5,176 items, 98% of cells evaluated.
Fit to width. Hover for subject & item; click a cell for details.

Scale: 1 = correct · 0 = incorrect
Sample items
A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.
what is the word for clarified butter in the balkans and middle east
Subject outcomes
By the end of the 17th series of Who Wants To Be A Millionaire in 2005, how many people had sat in the hot seat?
Subject outcomes
what is the meaning of lord shiva lingam
Subject outcomes
locations for the film an englishman who went up a hill
Subject outcomes
where does the last name waller come from
Subject outcomes
who was the first executive president of guyana
Subject outcomes
How many 'Pockets' would you find on a pool table?
Subject outcomes
What is the term for nutrient enrichment of lakes?
Subject outcomes
what is the function of the pericardial sac
Subject outcomes
In 1998 at the age of 77 who became the oldest person to fly in space?
Subject outcomes
Sir William Walton's 'Crown Imperial Coronation March' was written for whose coronation?
Subject outcomes
when did united states enter world war i
Subject outcomes
Subjects
5 subjects, ranked by mean response (accuracy) across this benchmark's items.