Reasoning & Knowledge

RealTime QA

RealTime QA: a dynamic multiple-choice news QA benchmark in which new questions are released weekly from CNN / THE WEEK news quizzes. We ingest the published GRANULAR per-(model, question) baseline predictions from the official realtimeqa_public repository: each response is one model's per-item correctness {0,1} on one weekly multiple-choice question (does the chosen 0-based choice index match the gold answer). Covers all weeks with released baselines (2022, 2023, 2026), both the standard (qa) and none-of-the-above (nota) MC settings, with retrieval method (closed-book / DPR / GCS) recorded as an evaluation condition. Free-text generation (_gen) outputs are excluded because they lack a gold choice index.

4,743items

16subjects

23%observed

MITlicense

generaldomain

question_answeringdomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 16 subjects × 4,743 items, 23% of cells evaluated.

RealTime QA response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: Sharm el-Sheikh, Egypt

King Charles has reportedly been asked by Liz Truss not to attend this year’s Cop27 climate conference, which is being held where? Choices: 0. Sharm el-Sheikh, Egypt

Dubai, United Arab Emirates
Lima, Peru
None of the above

Subject outcomes

GPT-3 (text-davinci) incorrect
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 20% solve rateanswer: Limited resources and low enrollment

Why is the Red Clay Consolidated School District planning to "phase-out" Spanish immersion at Marbrook Elementary and Skyline Middle School? Choices: 0. High enrollment and demand for other programs

Limited resources and low enrollment
Changes in district leadership
A decline in the quality of Spanish immersion programs

Subject outcomes

GPT-3 (text-davinci) incorrect
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 30% solve rateanswer: Elizabeth Gilbert

Which US author is pulling a novel set in Russia from publication? Choices: 0. Sue Monk Kidd

Elizabeth Gilbert
Helen DeWitt
None of the above

Subject outcomes

GPT-3 (text-davinci) incorrect
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 417% solve rateanswer: May

The pandemic-era border restriction known as Title 42 is set to expire in which month? Choices: 0. March

May
October
December

Subject outcomes

GPT-3 (text-davinci) incorrect
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 517% solve rateanswer: A good sauce

King Charles reportedly said that the “secret” to a good kebab is what? Choices: 0. A spicy marinade

Organic meat
A good sauce
Gluten-free bread

Subject outcomes

T5 (closed-book QA) correct
GPT-3 (text-davinci) incorrect
RAG (DPR reader) incorrect

Item 633% solve rateanswer: Harry Styles

The 2023 Grammy Awards honored some of the world’s most talented musicians. Which artist won the show’s biggest prize, Album of the Year? Choices: 0. Harry Styles

Adele
Beyoncé
Bad Bunny

Subject outcomes

GPT-3 (text-davinci) incorrect
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 738% solve rateanswer: New York City

Which American city got its first “full” casino this week? Choices: 0. Chicago

San Francisco
Los Angeles
New York City

Subject outcomes

openai/gpt-5.4 correct
meta-llama/llama-4-maverick correct
anthropic/claude-opus-4.6 incorrect
google/gemini-2.5-pro incorrect
meta-llama/llama-4-scout incorrect
openai/gpt-5.3-chat incorrect

Item 850% solve rateanswer: None of the above

Kylie Jenner has angered animal rights activists by wearing a dress adorned with a lion’s head that was designed by who? Choices: 0. Valentino

Viktor & Rolf
Balmain
None of the above

Subject outcomes

GPT-3 (text-davinci) correct
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 950% solve rateanswer: Dish Network

Which Satellite TV provider was hit with a $150,000 fine for failing to properly dispose of one of its satellites in space? Choices: 0. Cox

Dish Network
Spectrum
DirecTV

Subject outcomes

GPT-3 (text-davinci) correct
T5 (closed-book QA) correct
RAG (DPR reader) correct

Item 1067% solve rateanswer: 7.8

Search and rescue efforts continue in Turkey and Syria after a devastating earthquake this week. What was the magnitude of the quake that struck the region? Choices: 0. 5.1

Subject outcomes

GPT-3 (text-davinci) correct
T5 (closed-book QA) incorrect
RAG (DPR reader) incorrect

Item 1183% solve rateanswer: Recent high inflation and increased repair costs caused by severe weather events.

What is one of the reasons cited by Delmarva Power for requesting the rate increase? Choices: 0. To fund grid improvements.

Recent high inflation and increased repair costs caused by severe weather events.
To provide a bonus to employees for their hard work.
None of the above

Subject outcomes

GPT-3 (text-davinci) correct
RAG (DPR reader) correct
T5 (closed-book QA) incorrect

Item 1283% solve rateanswer: 1,000

Approximately how many people have been arrested during the current protests in Iran? Choices: 0. 100

500
700
1,000

Subject outcomes

GPT-3 (text-davinci) correct
RAG (DPR reader) correct
T5 (closed-book QA) incorrect

Subjects

The models, agents, and reward models evaluated.

16 subjects, ranked by mean response (accuracy) across this benchmark's items.

1openai/gpt-5.20.7395
2google/gemini-3-pro-preview0.7026
3openai/gpt-4.10.6342
4anthropic/claude-sonnet-4.50.6237
5anthropic/claude-opus-4.60.6071
6anthropic/claude-sonnet-4.60.5893
7openai/gpt-5.3-chat0.585
8meta-llama/llama-4-maverick0.5746
9openai/gpt-5.40.5675
10anthropic/claude-haiku-4.50.55
11meta-llama/llama-4-scout0.5484
12google/gemini-2.5-pro0.5295
13google/gemini-3.1-pro-preview0.5155
14GPT-3 (text-davinci)0.4647
15T5 (closed-book QA)0.3604
16RAG (DPR reader)0.3318

Full data on Hugging Face Back to the gallery