Skip to main content

Reasoning & Knowledge

LLM-Uncertainty-Bench

LLM-Uncertainty-Bench: 25 LLMs on five 6-way (A-F) multiple-choice datasets (MMLU, CosmosQA, HellaSwag, Halu-Dialogue, Halu-Summarization). Per-item option logits are released; we grade argmax vs the gold letter.

49,931items
25subjects
100%observed
MITlicense
generaldomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 25 subjects × 49,931 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

LLM-Uncertainty-Bench response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Knowledge: Stanley Kubrick wrote Full Metal Jacket. Full Metal Jacket is starring Vincent D'OnofrioFull Metal Jacket is directed by Stanley Kubrick. Stanley Kubrick directed Eyes Wide Shut Context: [Human]: Do you know anything about Stanley Kubrick? [Assistant]: Sure! He wrote Full Metal Jacket, starring Vincent D'Onofrio [Human]: Any other films that he directed? I have only watched Eyes Wide Shut. Question: Which of the following responses is the most suitable one for the given dialogue? Options:

  • A.

    He also wrote Abarat, The Great and Secret Show, and Imajica.

  • B.

    Other than Roger Federer, I don't know too much about the sport :), but isn't Smith in the International Tennis Hall of Fame?

  • C.

    He also directed The Shining and 2001: A Space Odyssey.

  • D.

    He also wrote A Clockwork Orange and AI

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-13b-hf incorrect
  • Llama-2-70b-hf incorrect
  • Llama-2-7b-hf incorrect
  • deepseek-llm-7b-chat incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 219% solve rate

Context: A man and boy are outside on a driveway. they Question: Which of the following is the best ending to the given context? Options:

  • A.

    are playing croquet on a green grass.

  • B.

    finger comb the yard with the rake.

  • C.

    attach a hose to a machine attached to the grass, and use it to fuel their mower.

  • D.

    are competing to see how fast they can shovel snow off the driveway.

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-70b-hf correct
  • Qwen-72B correct
  • Qwen-14B correct
  • deepseek-llm-7b-chat incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 329% solve rate

Question: For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020? Scenario 1 | I borrowed money from someone and never returned it. Scenario 2 | I had my daughter's pet dog put down to punish her for crying. Options:

  • A.

    Wrong, Wrong

  • B.

    Wrong, Not wrong

  • C.

    Not wrong, Wrong

  • D.

    Not wrong, Not wrong

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-7b-hf correct
  • Yi-34B correct
  • Mistral-7B-v0.1 correct
  • Llama-2-7b-chat-hf incorrect
  • Llama-2-70b-chat-hf incorrect
  • deepseek-llm-67b-chat incorrect
Item 437% solve rate

Question: According to Monetarist theory when the money supply is changed the economy is affected Options:

  • A.

    only because a change in the money supply affects interest rates.

  • B.

    in several ways.

  • C.

    because interest rates change and so do many other factors that affect spending.

  • D.

    because many factors that affect spending change but not interest rates.

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-70b-hf correct
  • Qwen-7B correct
  • Mistral-7B-v0.1 correct
  • deepseek-llm-67b-chat incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 544% solve rate

Context: These are toxic times on Tyneside but the thousands of supporters who boycotted were at least spared another dispiriting display which equalled a club-record rot of six straight defeats in the Premier League. Were it not for a freaky five wins on the spin last autumn, the Magpies would be headed for the Championship. In fact, head coach John Carver fears his side – the worst in the division on present form – could yet plummet towards the bottom three. ‘We are in amongst a relegation battle. Leicester, for example, are winning games, but we aren’t,’ he said after goals from Nacer Chadli, Christian Eriksen and Harry Kane’s 30th of the season condemned them to a 10th loss in 14. Christian Eriksen celebrates after the Swedish midfielder's curling free-kick misses his team-mates and Tim Krul in the Newcastle goal. Nacer Chadli tries his luck from 30 yards out to open the scoring for Tottenham on the half-hour mark with a well-placed left-footed strike. Newcastle players join Jack Colback as the hard-working midfielder pounces on a loose ball in the penalty area. Harry Kane celebrates putting the result beyond doubt as Spurs claim a 3-1 win over Newcastle, who sink to a sixth loss on the bounce. NEWCASTLE (4-1-4-1): Krul 4; Janmaat 6.5, Williamson 6, Coloccini 6, Anita 6 (Armstrong 75, 6); R Taylor 6; Gouffran 4.5 (Ameobi 46, 6), Colback 6.5, Abeid 4.5 (Obertan 46, 6), Cabella 5.5; Perez 6.5. Subs not used: Riviere, Kemen, Sterry, Woodman. Scorers: Colback 46. Bookings: Colback, Cabella. TOTTENHAM (4-2-3-1): Vorm 6; Dier 6, Vertonghen 7, Fazio 6, Rose 6.5 (Davies 81); Paulinho 6.5 (Mason 88), Bentaleb 6.5; Lamela 6, Chadli 7.5 (Dembele 78), Eriksen 7; Kane 6.5. Subs not used: Soldado, Yedlin, Townsend, Friedel. Scorers: Chadli 30, Eriksen 53, Kane 90. Bookings: Lamela, Vertonghen, Paulinho, Bentaleb. Referee: Kevin Friend 7 Attendance: 47, 427. MOM: Nacer Chadli. Eriksen's free-kick goes direction in from out wide for the winner - CLICK HERE to see more from Sportsmail's MATCH ZONE. ‘I’m angry. It’s another defeat and that’s not good enough. This is the most difficult situation I’ve ever been in but I’m not going to give up on it.’ Although the official attendance was returned as a massively inaccurate 47, 427 – making it the biggest of the weekend in the top-flight – estimates had it nearer 40,000 as fans stayed away in protest at Mike Ashley’s running of the club, which this month returned record profits of £18.7million before revealing they have a further £34m in the bank. The billionaire will need to part with a sizeable chunk of that cash this summer if Newcastle are to avoid a season-long struggle against relegation next time around. One banner – unfurled during a pre-match protest outside of the Gallowgate End – captured the mood of frustration and mistrust among supporters – ‘NUFC – Balance sheet champions, 2008-2014’. Sportsmail’s Jamie Carragher, meanwhile, launched a scathing attack on Ashley. ‘People will say as supporters you’ve got to come back, you’ve got to get behind your team. But what are the owners doing for supporters?’ he said. ‘Since Mike Ashley’s been here they’ve been relegated once and they could be relegated again. ‘So he’s balanced the books and he’s made money because he’s a businessman, but I think of these people and wonder, “Don’t you want to be excited yourself?”. ‘I’m getting bored by Newcastle. Why does Mike Ashley want the club if there’s nothing happening? There’s no cup run, I just don’t get it. I don’t get why he wants to own a football club.’ The defeat leaves Newcastle seven points above the drop zone and Carver has previously confessed that he does not know where their next point is coming from. Tottenham striker Harry Kane at full stretch with of Spurs is chased down by Newcastle captain Fabricio Coloccini. After Kane was chosen as caretaker of the captain's armband, Jan Vertonghen returned to lead Tottenham at St James' Park. Kane is assessed early in the match after inadvertently being caught in the head by the arm of Newcastle's Vernon Anita. On this evidence, they may well have to rely on their current tally of 35 being enough. They were hardly helped here by Tim Krul. But perhaps the loss of form of the usually-dependable goalkeeper best represents the problems which have plagued the squad in recent weeks. The Dutch stopper could have done more to keep out Chadli’s first-half opener and was again culpable as Eriksen struck to restore the lead after Jack Colback had levelled early in the second half. Chadli appeared to have exhausted his options as he floated in possession 25 yards from goal only to drop his shoulder before slamming into the bottom corner with Krul sluggish in his pursuit of the strike. Carver reacted to an insipid first half by hooking the ineffective Yoan Gouffran and midfielder Mehdi Abeid and within 20 seconds they were level when Colback steamed onto a loose ball inside the area before slotting home. Chadli was given space outside the box and pulled the trigger just as Newcastle midfielder Jack Colback arrived in defence. Colback is too late as the Belgian laces his shot from range and it nestles in the bottom corner giving Newcastle keeper Tim Krul no chance. Chadli celebrates after scoring the goal to send Spurs on their way on a tough day for Newcastle amid fan protests. Chadli leaps in the air with delight after his sublime strike for the opening goal of the match. Christian Eriksen (left) tussles with Newcastle's Ryan Taylor at St James' Park on Sunday. But parity was to last just eight minutes and Eriksen’s free-kick – intended as a cross – bounced through the weak grasp of Krul and into the back of the net. Krul, in fairness, made a string of fine stops to keep Newcastle within one but he was powerless to prevent Kane rolling beyond him in stoppage-time after he had skipped onto Erik Lamela’s through-ball. In doing so Kane became the first Spurs player since Gary Lineker in 1992 to score 30 goals in one season. And the England striker now has Lineker’s total of 35 in his sights. Jack Colback is in the right place at the right time to pounce on a loose ball in the midfield and equalise for Newcastle. Colback jumps for joy after his fourth goal of the season, all of which have been scored at Newcastle's home St James' Park. Fans applaud while one holds up a 'Love NUFC, No Ashley' sign, while Colback's Newcastle team-mates join in the celebrations. ‘I'm very proud but there are five games left and, hopefully, I'll get a few more goals,’ he said. Manager Mauricio Pochettino added: ‘We were better than Newcastle and deserved the win. ‘For Harry, it is a fantastic achievement. We need to push him now because he has a lot of improvement left in him.’ Newcastle, meanwhile, will also need to improve in the final five matches of their sorry season, or risk starting next in the second tier. Eriksen (right punching the air) and Spurs celebrate after his free-kick saw them regain the lead against United. Newcastle keeper Krul is left stranded as Eriksen's free-kick misses a line of his own players and curls into the back of the net. Spurs midfielder Erik Lamela (centre) gets a talking to before a yellow card comes his way after a sliding tackle on Vurnon Anita. Kane calmly slots Tottenham's third goal in time added on after coming one-on-one with Krul. Kane's goal after a quiet game by the prolific striker's standards brings him back level with Sergio Aguero as the Premier League's top scorer. A less than full stand at St James' Park watches the action on Sunday between Newcastle United and Tottenham. A plane hired by the supporters of rivals Sunderland heap misery on United with a reminder their last five consecutive derby losses. 'We don't demand a team that wins, we demand a team that tries,' reads a banner erected outside St James' Park on Sunday. A young boy holds up one of the hundreds of Ashleyout.com flyers handed out before the match.

Question: Which of the following best summarizes the given document? Options:

  • A.

    The Cheyenne Mountain Complex is a newly built high-tech surveillance station that has been constructed to aid the United States military in the event of a nuclear war with Russia. The base was designed by the world's top engineers with cutting-edge technology that can detect and intercept missile threats in real time. It is staffed by a group of highly trained soldiers who work around the clock to ensure the safety of American citizens. The base has been built to endure even the harshest of circumstances, including a direct hit from a 50 megaton bomb.

  • B.

    Harry Kane led Tottenham to a big win over Newcastle United, scoring the opener with a stunning strike from outside the box before adding a brace with a curling free-kick and a header late in the game. Jack Colback scored the consolation goal for the home side. The match was marred by protests from Newcastle fans against the team's poor form and owner Mike Ashley's lack of investment. Tottenham are now strong contenders for a Europa League spot, while Newcastle's struggles continue.

  • C.

    Alvaro Arbeloa, Asier Illarramendi and Nacho have given tour of complex. First-team players at Real Madrid have personalised rooms at club's HQ. Carlo Ancelotti's side can relax at HQ before and after training. Rooms contain Real Madrid-crested pillows, en suite and personalised TV. Real assistant Fernando Hierro has also shown off his private compartment. Real Madrid face rivals Atletico Madrid in Champions League on Tuesday.

  • D.

    Nacer Chadli opened the scoring with a left-footed strike from outside the box on the half-hour mark. Jack Colback equalised for the home side immediately after half-time after the ball fell kindly in the area. Christian Eriksen won Spurs the lead back with the Swede's curling free-kick missing everyone. Harry Kane topped off a relatively quiet game with a runaway goal after regular time was up. Newcastle have now lost six consecutive Premier League matches under manager John Carver. Fans protested before and during the match against owner Mike Ashley's perceived lack of ambition.

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-13b-hf correct
  • Llama-2-70b-hf correct
  • Llama-2-7b-hf correct
  • deepseek-llm-7b-chat incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 651% solve rate

Knowledge: Brett Ratner produced Kites. Kites is written by Robin BhattKites has genre Romance Film Context: [Human]: What can you tell me about the director Brett Ratner? [Assistant]: He produced Kites, which was written by Robin Bhatt. [Human]: Who starred in that movie, Kites? [Assistant]: Steven Michael Quezada and Kabir Bedi [Human]: Interesting...what kind of genre is it? Question: Which of the following responses is the most suitable one for the given dialogue? Options:

  • A.

    Yes, she does have a good voice. She's also produced songs like Falling Free.

  • B.

    Well she starred in it, playing a young woman who inherits a farm in the countryside. Have you seen any of her other movies?

  • C.

    Drama/Thriller

  • D.

    Kites is actually a romantic comedy.

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-70b-hf correct
  • Yi-34B correct
  • Qwen-14B correct
  • deepseek-llm-7b-chat incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 757% solve rate

Knowledge: Django Unchained is written by Quentin TarantinoQuentin Tarantino directed Death Proof. Death Proof has genre SlasherDeath Proof has genre Slasher Context: [Human]: Do you know anything about Django Unchained the movie? [Assistant]: Yes, it was written by Quentin Tarantino in 2012, starring Samuel L. Jackson. [Human]: I like Quentin Tarantino, could you tell me more of his movies? [Assistant]: Grindhouse, Death Proof, and Four Rooms are also very good ones from him. [Human]: Death Proof sounds interesting can you tell me what the genre is and who stars in it? Question: Which of the following responses is the most suitable one for the given dialogue? Options:

  • A.

    It's a Thriller starring George Clooney, Emma Stone, and Ryan Gosling.

  • B.

    It's a Slasher starring Eli Roth, Rosario Dawson and Marley Shelton.

  • C.

    Renee ZeIlweger is great in that movie as well. I actually don't remember who else is is Hope Springs. Sorry.

  • D.

    It appears to be written in the year 1907, but I'm not exactly certain.

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-13b-hf correct
  • Llama-2-70b-hf correct
  • Qwen-7B correct
  • Yi-34B-Chat incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 862% solve rate

Question: _____ is a key driving force shaping the current and future state of business ethics, in particular regarding _____, ______ and _____ related issues. Options:

  • A.

    Globalisation, Cultural, Legal, Accountability

  • B.

    Nationalisation, Cultural, Environmental, Accountability

  • C.

    Privatisation, Cultural, Legal, Accountability

  • D.

    Globalisation, Technological, Financial, Accountability

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-70b-hf correct
  • Qwen-7B correct
  • Qwen-14B correct
  • Llama-2-7b-chat-hf incorrect
  • Llama-2-70b-chat-hf incorrect
  • deepseek-llm-7b-chat incorrect
Item 967% solve rate

Knowledge: Atlas Shrugged is written by Ayn RandAyn Rand wrote We the LivingWe the Living has genre Historical fiction Context: [Human]: Could you recommend books from the author of Atlas Shrugged? [Assistant]: Sure, I know that was written by Ayn Rand. She also wrote The Fountainhead, which I enjoyed. Have you read it? [Human]: No, I have not read that book. I will check that out. Anymore books by Rand that should be in my radar? Question: Which of the following responses is the most suitable one for the given dialogue? Options:

  • A.

    She wrote "We the Living", which is historical fiction. She also wrote "Anthem" and "The Romantic Manifesto". Have you heard of any of them?

  • B.

    No, it's actually a science fiction novel. It was released in 2015.

  • C.

    They were released in the 1930's. Little House in the Big Woods was in 1932, and Little House on the Prairie was 1935. They are very popular books.

  • D.

    She wrote "The Garden of Eden", which is a romantic novel. She also wrote "The Virtue of Selfishness" and "Capitalism: The Unknown Ideal". Have you heard of any of them?

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-7b-hf correct
  • Yi-34B correct
  • Qwen-14B correct
  • internlm-7b incorrect
  • mpt-7b incorrect
  • Llama-2-70b-chat-hf incorrect
Item 1072% solve rate

Context: [header] How to remove a posted review on yelp [title] Open your web browser to the yelp webpage. [title] Click the " about me " tab, from the yelp home page. [step] From this page, you'll see your reviews. Question: Which of the following is the best ending to the given context? Options:

  • A.

    [title] Scroll down to the review you want to remove. [step] In this case, tag a book you want to remove and click the " remove from " button at the top of the search bar.

  • B.

    All of them are posted on yelp on a monthly basis. If your reviews you've posted already appear in your yelp account on a weekly basis, you may be eligible to get a confirmation email back.

  • C.

    [title] Scroll down until you've found the review you'd like to edit and look for two link buttons (edit and remove) slightly underneath the review. [title] Click the " remove " button on the right side of the page, , if you would rather like to remove the full review.

  • D.

    Write down a few comments, such as " how many styles would you recommend to my liking? " or " what do you have to say about my profile? " [substeps] Your reviews will be listed by default. When choosing comments, choose one which will be most helpful to you.

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-13b-hf correct
  • Llama-2-70b-hf correct
  • Llama-2-7b-hf correct
  • Llama-2-7b-chat-hf incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 1176% solve rate

Context: This was not the message of the march . This was , evidently , the message of these other groups . But no one was acting poorly during the march we were part of . Question: What must have happened to the march that we had to restate our position ? Options:

  • A.

    Some groups must have joined the march to confirm the messages .

  • B.

    Some groups must have joined the march in solidarity to the messages .

  • C.

    Some groups must have joined the march with different messages .

  • D.

    Some groups must have joined the march to enforce the messages .

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-13b-hf correct
  • Llama-2-70b-hf correct
  • Llama-2-7b-hf correct
  • Llama-2-7b-chat-hf incorrect
  • falcon-40b-instruct incorrect
  • falcon-7b-instruct incorrect
Item 1281% solve rate

Context: Jayfry waited behind the tree and waited . It was going to be awhile before the end of the column passed him , so he started to relax and take the first real look around him . It was raining , so the clouds were gray overhead and the cold made his body tighten , and one could not tell if he was nervous or just trying to stay warm in his scantily clad robes . The truth was , it was both . Question: Why is Jayfry waiting out in the cold on this miserable day ? Options:

  • A.

    Jayfry likes this type of weather .

  • B.

    Jayfry is waiting for someone .

  • C.

    Jayfry gets rejuvenated by this type of weather .

  • D.

    Jayfry enjoys being outside all the time .

  • E.

    I don't know

  • F.

    None of the above

Subject outcomes

  • Llama-2-13b-hf correct
  • Llama-2-70b-hf correct
  • Llama-2-7b-hf correct
  • Llama-2-7b-chat-hf incorrect
  • Yi-6B-Chat incorrect
  • falcon-40b-instruct incorrect

Subjects

The models, agents, and reward models evaluated.

25 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1Yi-34B0.8149
  2. 2Qwen-72B0.7832
  3. 3deepseek-llm-67b-chat0.7775
  4. 4Qwen-14B0.741
  5. 5Yi-34B-Chat0.73
  6. 6Llama-2-70b-hf0.724
  7. 7deepseek-llm-67b-base0.7171
  8. 8Yi-6B0.6889
  9. 9Mistral-7B-v0.10.6433
  10. 10Llama-2-13b-hf0.6027
  11. 11Qwen-7B0.6
  12. 12Yi-6B-Chat0.5976
  13. 13deepseek-llm-7b-chat0.575
  14. 14Llama-2-70b-chat-hf0.5136
  15. 15internlm-7b0.4917
  16. 16Llama-2-7b-hf0.4671
  17. 17deepseek-llm-7b-base0.4587
  18. 18Llama-2-13b-chat-hf0.4536
  19. 19Qwen-1_8B0.4228
  20. 20Llama-2-7b-chat-hf0.3811
  21. 21falcon-40b0.3478
  22. 22falcon-40b-instruct0.3381
  23. 23mpt-7b0.2734
  24. 24falcon-7b0.2487
  25. 25falcon-7b-instruct0.2445