Skip to main content

Reasoning & Knowledge

RealTime QA

RealTime QA: a dynamic multiple-choice news QA benchmark in which new questions are released weekly from CNN / THE WEEK news quizzes. We ingest the published GRANULAR per-(model, question) baseline predictions from the official realtimeqa_public repository: each response is one model's per-item correctness {0,1} on one weekly multiple-choice question (does the chosen 0-based choice index match the gold answer). Covers all weeks with released baselines (2022, 2023, 2026), both the standard (qa) and none-of-the-above (nota) MC settings, with retrieval method (closed-book / DPR / GCS) recorded as an evaluation condition. Free-text generation (_gen) outputs are excluded because they lack a gold choice index.

4,743items
16subjects
23%observed
MITlicense
generaldomain
question_answeringdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 16 subjects × 4,743 items, 23% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

RealTime QA response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: Sharm el-Sheikh, Egypt

King Charles has reportedly been asked by Liz Truss not to attend this year’s Cop27 climate conference, which is being held where? Choices: 0. Sharm el-Sheikh, Egypt

  1. Dubai, United Arab Emirates
  2. Lima, Peru
  3. None of the above

Subject outcomes

  • GPT-3 (text-davinci) incorrect
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 20% solve rateanswer: Limited resources and low enrollment

Why is the Red Clay Consolidated School District planning to "phase-out" Spanish immersion at Marbrook Elementary and Skyline Middle School? Choices: 0. High enrollment and demand for other programs

  1. Limited resources and low enrollment
  2. Changes in district leadership
  3. A decline in the quality of Spanish immersion programs

Subject outcomes

  • GPT-3 (text-davinci) incorrect
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 30% solve rateanswer: Elizabeth Gilbert

Which US author is pulling a novel set in Russia from publication? Choices: 0. Sue Monk Kidd

  1. Elizabeth Gilbert
  2. Helen DeWitt
  3. None of the above

Subject outcomes

  • GPT-3 (text-davinci) incorrect
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 417% solve rateanswer: May

The pandemic-era border restriction known as Title 42 is set to expire in which month? Choices: 0. March

  1. May
  2. October
  3. December

Subject outcomes

  • GPT-3 (text-davinci) incorrect
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 517% solve rateanswer: A good sauce

King Charles reportedly said that the “secret” to a good kebab is what? Choices: 0. A spicy marinade

  1. Organic meat
  2. A good sauce
  3. Gluten-free bread

Subject outcomes

  • T5 (closed-book QA) correct
  • GPT-3 (text-davinci) incorrect
  • RAG (DPR reader) incorrect
Item 633% solve rateanswer: Harry Styles

The 2023 Grammy Awards honored some of the world’s most talented musicians. Which artist won the show’s biggest prize, Album of the Year? Choices: 0. Harry Styles

  1. Adele
  2. Beyoncé
  3. Bad Bunny

Subject outcomes

  • GPT-3 (text-davinci) incorrect
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 738% solve rateanswer: New York City

Which American city got its first “full” casino this week? Choices: 0. Chicago

  1. San Francisco
  2. Los Angeles
  3. New York City

Subject outcomes

  • openai/gpt-5.4 correct
  • meta-llama/llama-4-maverick correct
  • anthropic/claude-opus-4.6 incorrect
  • google/gemini-2.5-pro incorrect
  • meta-llama/llama-4-scout incorrect
  • openai/gpt-5.3-chat incorrect
Item 850% solve rateanswer: None of the above

Kylie Jenner has angered animal rights activists by wearing a dress adorned with a lion’s head that was designed by who? Choices: 0. Valentino

  1. Viktor & Rolf
  2. Balmain
  3. None of the above

Subject outcomes

  • GPT-3 (text-davinci) correct
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 950% solve rateanswer: Dish Network

Which Satellite TV provider was hit with a $150,000 fine for failing to properly dispose of one of its satellites in space? Choices: 0. Cox

  1. Dish Network
  2. Spectrum
  3. DirecTV

Subject outcomes

  • GPT-3 (text-davinci) correct
  • T5 (closed-book QA) correct
  • RAG (DPR reader) correct
Item 1067% solve rateanswer: 7.8

Search and rescue efforts continue in Turkey and Syria after a devastating earthquake this week. What was the magnitude of the quake that struck the region? Choices: 0. 5.1

  1. 6.3
  2. 7.8
  3. 9.5

Subject outcomes

  • GPT-3 (text-davinci) correct
  • T5 (closed-book QA) incorrect
  • RAG (DPR reader) incorrect
Item 1183% solve rateanswer: Recent high inflation and increased repair costs caused by severe weather events.

What is one of the reasons cited by Delmarva Power for requesting the rate increase? Choices: 0. To fund grid improvements.

  1. Recent high inflation and increased repair costs caused by severe weather events.
  2. To provide a bonus to employees for their hard work.
  3. None of the above

Subject outcomes

  • GPT-3 (text-davinci) correct
  • RAG (DPR reader) correct
  • T5 (closed-book QA) incorrect
Item 1283% solve rateanswer: 1,000

Approximately how many people have been arrested during the current protests in Iran? Choices: 0. 100

  1. 500
  2. 700
  3. 1,000

Subject outcomes

  • GPT-3 (text-davinci) correct
  • RAG (DPR reader) correct
  • T5 (closed-book QA) incorrect

Subjects

The models, agents, and reward models evaluated.

16 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1openai/gpt-5.20.7395
  2. 2google/gemini-3-pro-preview0.7026
  3. 3openai/gpt-4.10.6342
  4. 4anthropic/claude-sonnet-4.50.6237
  5. 5anthropic/claude-opus-4.60.6071
  6. 6anthropic/claude-sonnet-4.60.5893
  7. 7openai/gpt-5.3-chat0.585
  8. 8meta-llama/llama-4-maverick0.5746
  9. 9openai/gpt-5.40.5675
  10. 10anthropic/claude-haiku-4.50.55
  11. 11meta-llama/llama-4-scout0.5484
  12. 12google/gemini-2.5-pro0.5295
  13. 13google/gemini-3.1-pro-preview0.5155
  14. 14GPT-3 (text-davinci)0.4647
  15. 15T5 (closed-book QA)0.3604
  16. 16RAG (DPR reader)0.3318