Skip to main content

Reasoning & Knowledge

PertEval

PertEval measures the real knowledge capacity of LLMs by applying knowledge-invariant perturbations to MMLU questions (rephrasing that preserves the required knowledge). Each MMLU item is recast as an option-judgement task: the model labels each of 4 options True/False; an item is correct iff the full True/False vector matches the gold.

4,885items
6subjects
50%observed
Apache-2.0license
generaldomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 6 subjects × 4,885 items, 50% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

PertEval response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: [false, false, true, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) Increased lymphocyte count V) Increased basophil count W) Decreased macrophage activity X) Decreased AM serum cortisol concentration Question: A researcher is examining how frequently medical students catch the common cold at different intervals throughout the academic year. Findings indicate a rise in upper respiratory tract infections in these students throughout the finals period. The theory suggests that the strain linked to exam preparation may weaken immune responses, thereby heightening the students' vulnerability to infections. During finals week, which laboratory result from the medical students would best validate the hypothesis that exam stress weakens the immune system and raises susceptibility to infections? Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • Meta-Llama-3-8B-Instruct incorrect

    {"U)": "True", "V)": "False", "W)": "True", "X)": "True"}

    Explanation:

    • Increased lymphocyte count (U) would be expected if the immune system is responding to an infection, which is the opposite of what the hypothesis suggests. Therefore, it is incorrect.
    • Increased basophil count (V) is not directly related to the immune response to infection, so it is unlikely to be relevant to the hypothesis. Therefore, it is incorrect.
    • Decreased macrophage activity (W) would be consistent with a weakened immune response, which is in line with the hypothesis. Therefore, it is correct.
    • Decreased AM serum cortisol concentration (X) is a common response to stress, and low cortisol levels can …
  • Mistral-7B-Instruct-v0.2 incorrect

    {"U)": False, "V)": False, "W)": False, "X)": True}

    Explanation: The question asks for a laboratory result that would validate the hypothesis that exam stress weakens the immune system and raises susceptibility to infections. Decreased lymphocyte count (U) and basophil count (V) would not necessarily indicate a weakened immune system, as they can have various causes. Decreased macrophage activity (W) could potentially be a sign of a weakened immune system, but it's not explicitly stated in the question that an increase in infections is associated with decreased macrophage activity during finals week. However, the question mentions an increase in upper respiratory tract infections, which …

Item 20% solve rateanswer: [true, false, false, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) Salmonella typhi V) Giardia lamblia W) Entamoeba histolytica X) Clostridium difficile Question: A young man, aged 22, visited the doctor reporting five days of diarrhea following the conclusion of his third antibiotic treatment for mastoiditis. Upon physical assessment, a mild, diffuse abdominal discomfort was observed when the abdomen was palpated. Growth of bacteria on hektoen enteric agar shows a positive result. The probable causative organism responsible for the diarrhea is Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • Meta-Llama-3-8B-Instruct incorrect

    {"U)": False, "V)": False, "W)": False, "X)": True}

  • Mistral-7B-Instruct-v0.2 incorrect

    {"U)": False, "V)": False, "W)": False, "X)": "True"}

    The most likely causative organism for the described symptoms and lab results is Clostridium difficile (X). The patient is a young man, and the diarrhea occurred following antibiotic treatment. C. difficile is a common cause of diarrhea in these circumstances. Salmonella typhi (U), Giardia lamblia (V), and Entamoeba histolytica (W) typically cause different types of diarrhea and don't typically present with these symptoms following antibiotic treatment.

Item 30% solve rateanswer: [true, false, false, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) acknowledging the limits of their data or conclusions V) providing access to psychometric information W) refusing to testify if they have had 2 prior professional relationship with one of the parties X) stating conclusions in tentative terms Question: In line with the ethical guidelines issued by both the American and Canadian Psychological Associations, psychologists involved in forensic roles must steer clear of presenting deceptive evidence or documents by Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • gemini-1.0-pro incorrect

    {"U)": "True", "V)": "False", "W)": "True", "X)": "True"}

  • glm-3-turbo incorrect

    {"U)": "False", "V)": "False", "W)": "False", "X)": "True"}

  • gpt-3.5-turbo incorrect

    {"U)": "True", "V)": "False", "W)": "True", "X)": "True"}

  • gpt-4-turbo incorrect

    {"U)": "True", "V)": "False", "W)": "False", "X)": "True"}

Item 40% solve rateanswer: [true, false, false, false]

Please select the correct option(s) from the following options given the question: Question: In the xy-plane, the graph of x^(log y) = y^log x is Options: A the open first quadrant B a closed curve C a ray in the open first quadrant D a single point Your output must strictly follow this format: {"answer": <the list of selected options, e.g., ["A", "B", "C", "D"]>}

How subjects answered

  • gemini-1.0-pro incorrect

    {"answer": ["D"]}

  • glm-3-turbo incorrect
    {"answer": ["C", "D"]}
Item 525% solve rateanswer: [true, false, false, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) The chartering of companies to assume costs and share profits V) Martial law enforced by sizable armies W) Complete privatization of colonial authority X) Direct rule by the crown Question: The inquiry pertains to the information provided below. Deciding whether to maintain or end the tradition of sati carries with it a tremendous burden of responsibility. Agreeing to allow, year after year, the senseless sacrifice of numerous innocent lives to a harsh and premature fate, when there is the capacity to stop it, represents a dilemma that fills any moral being with dread. Conversely, to risk the stability of the British Empire in India by pursuing an opposite action could be viewed as an even more severe detriment. In times when formidable adversaries and significant doubts about our security prevailed, a more prudent approach might have been advisable; however, now that we hold paramount power, I strongly advocate for a comprehensive and explicit ban. In his 1829 address "On the Suppression of Sati," William Bentinck, the Governor-General of India, stated that after closely observing the temperament and dispositions of the Indians [engaged in commerce with us], he has reached the conclusion, despite it clashing with our own values, that they require strict governance to maintain proper order, asserting that the best method to ensure this is by making them realize their reliance on [the provisions and industrial products provided by us]. George Simpson, Chief of the Northern Division, Hudson's Bay Company, 1826 The writers of both excerpts were affiliated with organizations that employed which of the following strategies typically utilized by Western colonial powers? Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • gemini-1.0-pro correct

    {"U)": True, "V)": False, "W)": False, "X)": False}

  • glm-3-turbo incorrect

    {"U)": "False", "V)": "False", "W)": "False", "X)": "True"}

  • gpt-3.5-turbo incorrect

    { "U)": True, "V)": False, "W)": False, "X)": True }

  • gpt-4-turbo incorrect

    {"U)": "True", "V)": "False", "W)": "False", "X)": "True"}

Item 625% solve rateanswer: [true, false, false, false]

Please select the correct option(s) from the following options given the question: Question: Suppose A and B are n × n matrices with real entries. Which of the follow are TRUE? I. The trace of A^2 is nonnegative. II. If A^2 = A, then the trace of A is nonnegative. III. The trace of AB is the product of the traces of A and B. Options: A II only B III only C I and II only D II and III only Your output must strictly follow this format: {"answer": <the list of selected options, e.g., ["A", "B", "C", "D"]>} Your output:

How subjects answered

  • gpt-4-turbo correct

    {"answer": ["A"]}

  • gpt-3.5-turbo incorrect

    {"answer": ["D"]}

  • Meta-Llama-3-8B-Instruct incorrect

    {"answer": ["C"]}

  • Mistral-7B-Instruct-v0.2 incorrect

    {"answer": ["I", "II and III"]}

    Explanation:

    I. The trace of a square matrix A is equal to the sum of its diagonal entries. The square of a matrix A results in a matrix where each element is the square of the corresponding entry in A. Since any number squared is nonnegative, the sum of the squares of the diagonal entries is nonnegative, making the trace of A^2 nonnegative.

    II. If A^2 = A, then A is idempotent. An idempotent matrix has the property that multiplying it by itself results in the same matrix. For any matrix A, the trace of its power is equal to the sum of the eigenvalues of the matrix. For an idempotent matrix A, all eigenvalues are either 0 or 1. Since the sum of non-zero …

Item 750% solve rateanswer: [true, false, false, false]

Please select the correct option(s) from the following options given the question: Question: A 33-year-old woman comes to the physician because of a 2-day history of mild nausea, increased urinary urgency and frequency, and constipation. She also has had a 4.5-kg (10-lb) weight loss during the past 2 weeks and a 3-week history of vaginal bleeding. Pelvic examination shows a nodular cervix with an irregular, friable posterior lip, and a rock-hard, irregular, immobile pelvic mass that extends across the pelvis. Examination of biopsy specimens from the cervix and anterior wall of the vagina show well-differentiated keratinizing squamous cell carcinoma. Which of the following best describes the pathogenesis of this patient's disease? Options: A Inactivation of cellular p53 B Insertion of viral promotors adjacent to cellular growth factor genes C Specialized transduction D Transactivation of cellular growth factor genes by TAX Your output must strictly follow this format: {"answer": <the list of selected options, e.g., ["A", "B", "C", "D"]>}

How subjects answered

  • gemini-1.0-pro correct

    {"answer": ["A"]}

  • glm-3-turbo incorrect

    {"answer": ["A", "B"]}

Item 850% solve rateanswer: [true, false, false, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) content validity V) empirical validity W) predictive validity X) concurrent validity Question: The proficiency test items for two machinists were derived from job analyses at fifty manufacturing facilities, each with at least five machinists performing similar tasks. This type of method would have a direct impact on the effectiveness of the exam's Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • gemini-1.0-pro correct

    {"U)": True, "V)": False, "W)": False, "X)": False}

  • gpt-4-turbo correct

    {"U)": "True", "V)": "False", "W)": "False", "X)": "False"}

  • glm-3-turbo incorrect

    {"U)": "False", "V)": "False", "W)": "False", "X)": "True"}

  • gpt-3.5-turbo incorrect

    {"U)": "True", "V)": "False", "W)": "True", "X)": "False"}

Item 950% solve rateanswer: [false, false, true, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) The African slave trade V) The importation of food crops from the New World W) The development of joint-stock companies X) The conquest of India by rival Muslim empires Question: The inquiry pertains to the information presented below. "THE CORE TEACHINGS OF MAHATMA GANDHI. MAINTAIN PEACEFULNESS The communicated declaration from Mahatma Gandhi in Waltair mentioned: Maulana Mohamed Ali was detained at Waltair under sections 107 and 108, required to provide security and ensure proper conduct for the duration of one year. The location and time of the trial have not been disclosed. Following his arrest, Mr. Hayat and the Begum Saheba were granted permission to visit him. He and I had planned to speak at a gathering just outside the train station. His arrest took place. After his arrest, I proceeded to the meeting and delivered the address. While there is no reason to grieve, there is every reason to celebrate. No strike ought to be held. Utter tranquility and serenity must be maintained. I view the detention as a precursor to achieving Swaraj and rectifying the injustices of Khilafat and Punjab, provided we maintain our commitment to non-violence. Maintain unity between Hindus and Muslims even amid the extreme actions of certain Moplahs, and carry out the Swadeshi agenda. I urge all Indians, regardless of gender, to fully forsake foreign textiles and dedicate any available time to spinning or weaving. Like Maulana, steadfastly advocate for your religious and national freedoms. We should strive to face incarceration. I am aware of the Maulana's blamelessness and firmly believe that incarcerating the guiltless will aid our country in achieving its esteemed objectives. In "The Pilgrims' March" penned by Mohandas Gandhi in 1921, which historical events between the years 1450 and 1750 C.E. influenced the directive to shun foreign textiles as highlighted in the document's fifth paragraph? Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • glm-3-turbo correct

    {"U)": "False", "V)": "False", "W)": "True", "X)": "False"}

  • gpt-4-turbo correct

    {"U)": "False", "V)": "False", "W)": "True", "X)": "False"}

  • gemini-1.0-pro incorrect

    {"U)": "False", "V)": "False", "W)": "False", "X)": "False"}

  • gpt-3.5-turbo incorrect

    {"U)": "False", "V)": "True", "W)": "False", "X)": "False"}

Item 1075% solve rateanswer: [false, false, false, true]

Please select the correct option(s) from the following options given the question: Question: A 67-year-old woman comes to the physician because of easy bruising for 4 months. She has a history of lung cancer treated with radiation therapy 6 months ago. She has a 2-year history of hypertension treated with a thiazide diuretic and an angiotensin-converting enzyme (ACE) inhibitor. Examination, including neurologic examination, shows no abnormalities except for multiple ecchymoses. Her hemoglobin concentration is 13 g/dL, leukocyte count is 5000/mm3 , and platelet count is 35,000/mm3 . A serum antiplatelet antibody assay is negative. Which of the following is the most appropriate next step in diagnosis? Options: A CT scan of the abdomen B CT scan of the chest C Bronchoscopy D Bone marrow aspiration Your output must strictly follow this format: {"answer": <the list of selected options, e.g., ["A", "B", "C", "D"]>} Your output:

How subjects answered

  • gpt-3.5-turbo correct

    {"answer": ["D"]}

  • gpt-4-turbo correct

    {"answer": ["D"]}

  • Mistral-7B-Instruct-v0.2 correct

    Based on the information provided, the most appropriate next step in diagnosis would be a bone marrow aspiration. {"answer": ["D"]}

    Explanation: The patient's history of lung cancer treated with radiation therapy, easy bruising, and low platelet count are indicative of a potential bone marrow disorder such as thrombocytopenia. A bone marrow aspiration would help to determine the cause of the low platelet count, which could be caused by radiation-induced damage, medications such as the thiazide diuretic or ACE inhibitor, or another underlying condition. CT scans of the abdomen and chest, as well as a bronchoscopy, would not provide information about the patient's platelet count or bone …

  • Meta-Llama-3-8B-Instruct incorrect

    {"answer": ["B"]}

Item 11100% solve rateanswer: [false, false, false, true]

Please judge whether each of the options is correct or incorrect given the question: Options: U) implosion V) flooding W) response-prevention X) stimulus satiation Question: A technique where a stimulus is continuously shown to a client until its appeal diminishes is most accurately characterized as Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • Meta-Llama-3-8B-Instruct correct

    {"U)": False, "V)": False, "W)": False, "X)": True}

  • Mistral-7B-Instruct-v0.2 correct

    {"U)": False, "V)": False, "W)": False, "X)": True}

    Explanation: The technique described in the question, where a stimulus is continuously shown to a client until its appeal diminishes, is most accurately characterized as stimulus satiation (option X). Therefore, all the other options (U, V, and W) are incorrect.

Item 12100% solve rateanswer: [true, false, false, false]

Please judge whether each of the options is correct or incorrect given the question: Options: U) feelings of hopelessness V) insight W) severity of symptoms X) coping skills Question: In the initial stages of psychotherapy, as outlined in the phase model by Howard and colleagues (1996), which aspect is most typically impacted? Your output must strictly follow this format: {"U)": <"True" or "False">, "V)": <"True" or "False">, "W)": <"True" or "False">, "X)": <"True" or "False">} Your output:

How subjects answered

  • Meta-Llama-3-8B-Instruct correct

    Based on the phase model by Howard and colleagues (1996), here is the judgment:

    {"U)": "True", "V)": "False", "W)": "False", "X)": "False"}

    According to the phase model, in the initial stages of psychotherapy, patients typically experience feelings of hopelessness (U). The other options, insight (V), severity of symptoms (W), and coping skills (X), are not typically impacted in the initial stages of psychotherapy.

  • Mistral-7B-Instruct-v0.2 correct

    {"U)": "True", "V)": "False", "W)": "False", "X)": "False"}

    Explanation: In the initial stages of psychotherapy as outlined in the phase model by Howard and colleagues (1996), the aspect that is most typically impacted is feelings of hopelessness (U). The other options, insight (V), severity of symptoms (W), and coping skills (X), are not typically the focus during this initial stage. Instead, the focus is on establishing a therapeutic relationship, understanding the client's problem, and developing a therapeutic plan. Once these initial goals have been achieved, the focus of therapy may shift to other aspects, such as insight, symptoms, or coping skills.

Subjects

The models, agents, and reward models evaluated.

6 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gpt-4-turbo0.6953
  2. 2gemini-1.0-pro0.4848
  3. 3gpt-3.5-turbo0.4328
  4. 4Meta-Llama-3-8B-Instruct0.421
  5. 5glm-3-turbo0.387
  6. 6Mistral-7B-Instruct-v0.20.2768