Skip to main content

Reasoning & Knowledge

EVOUNA

EVOUNA (QA-Eval): per-(model, question) human correctness judgments for five Open-QA systems (FiD, GPT-3.5/text-davinci-003, ChatGPT, GPT-4, New Bing) answering the same Open-domain questions from Natural Questions (3,610) and TriviaQA (2,000). Each model answer was manually judged correct/incorrect by a human annotator.

5,176items
5subjects
98%observed
Apache-2.0license
generaldomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 5 subjects × 5,176 items, 98% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

EVOUNA response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: smen

what is the word for clarified butter in the balkans and middle east

Subject outcomes

  • FiD incorrect
  • GPT-3.5 (text-davinci-003) incorrect
  • ChatGPT (gpt-3.5-turbo) incorrect
  • New Bing incorrect
Item 220% solve rateanswer: 1132/1,132/one thousand, one hundred and thirty-two

By the end of the 17th series of Who Wants To Be A Millionaire in 2005, how many people had sat in the hot seat?

Subject outcomes

  • New Bing correct
  • FiD incorrect
  • GPT-3.5 (text-davinci-003) incorrect
  • ChatGPT (gpt-3.5-turbo) incorrect
  • GPT-4 incorrect
Item 340% solve rateanswer: reminds of the Omnipotent Lord

what is the meaning of lord shiva lingam

Subject outcomes

  • ChatGPT (gpt-3.5-turbo) correct
  • New Bing correct
  • FiD incorrect
  • GPT-3.5 (text-davinci-003) incorrect
  • GPT-4 incorrect
Item 460% solve rateanswer: Llanrhaeadr-ym-Mochnullt/Llansilin in Powys

locations for the film an englishman who went up a hill

Subject outcomes

  • FiD correct
  • New Bing correct
  • GPT-4 correct
  • ChatGPT (gpt-3.5-turbo) incorrect
  • GPT-3.5 (text-davinci-003) incorrect
Item 580% solve rateanswer: Anglo-Saxon origin

where does the last name waller come from

Subject outcomes

  • GPT-3.5 (text-davinci-003) correct
  • GPT-4 correct
  • ChatGPT (gpt-3.5-turbo) correct
  • New Bing correct
  • FiD incorrect
Item 680% solve rateanswer: Arthur Chung/Forbes Burnham

who was the first executive president of guyana

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing incorrect
Item 7100% solve rateanswer: six/6

How many 'Pockets' would you find on a pool table?

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing correct
Item 8100% solve rateanswer: Eutrophication/Hypertrophication/Cultural eutrophication/Nutrient loading/Eutrofication/Euthrophication

What is the term for nutrient enrichment of lakes?

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing correct
Item 9100% solve rateanswer: Lubricates the heart

what is the function of the pericardial sac

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing correct
Item 10100% solve rateanswer: John Herschel Glenn/John Glenn, Jr./John H. Glenn/John H. Glenn Jr./John Glrnn/John Glenn/John Glenn Junior/J H Glenn/John Herschel Glenn Jr./John Herschel Glenn, Jr./Glenn, Jr., John Herschell/John Herschell Glenn, Jr./John H. Glenn, Jr.

In 1998 at the age of 77 who became the oldest person to fly in space?

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing correct
Item 11100% solve rateanswer: Albert, Duke of York/HM King George VI/Prince Albert of Cornwall and York/George VI, King of England/Albert of York/George VI of Scotland/George VI of Britain/GVIR/Prince Albert of Wales/George VI of the United Kingdom/George VI of Great Britain and Ireland/George VI of England/Albert of Cornwall/King George VI/George vi/George VI of Ireland/Albert Frederick Arthur George/King george the 6th/George VI of the UK/Albert of Cornwall and York/George vi of the united kingdom/George VI of India/George VI of Great Britain/George VI, Emperor/George Vi/King George V1/GEORGE VI/George the Sixth/Albert of Wales/George VI/Prince Albert of York/George VI of Canada/King George VI of the United Kingdom/Prince Albert, Duke of York/Prince Albert of Cornwall

Sir William Walton's 'Crown Imperial Coronation March' was written for whose coronation?

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing correct
Item 12100% solve rateanswer: April 1917

when did united states enter world war i

Subject outcomes

  • FiD correct
  • GPT-3.5 (text-davinci-003) correct
  • ChatGPT (gpt-3.5-turbo) correct
  • GPT-4 correct
  • New Bing correct

Subjects

The models, agents, and reward models evaluated.

5 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1New Bing0.8297
  2. 2GPT-40.8294
  3. 3ChatGPT (gpt-3.5-turbo)0.7583
  4. 4FiD0.7337
  5. 5GPT-3.5 (text-davinci-003)0.7005