Skip to main content

Safety & Security

CTIBench

CTIBench: a benchmark for evaluating LLMs on cyber threat intelligence tasks, including multiple-choice CTI knowledge (CTI-MCQ), CVE-to-CWE root-cause mapping (CTI-RCM), CVSS vulnerability severity prediction (CTI-VSP), and threat-actor attribution (CTI-TAA). Each task provides per-(model, item) predictions for five LLMs graded against gold answers.

5,483items
5subjects
100%observed
MITlicense
cybersecuritydomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 5 subjects × 5,483 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

CTIBench response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: B

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: Which mitigation strategy aims to create an allowlist for valid input? Options: A) Use canonicalized data B) Assume all input is malicious C) Test your decoding process against malicious input D) Perform regular security audits Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

  • gpt-3.5-turbo incorrect

    A

  • gpt-4-turbo incorrect

    A

  • Gemini-1.5 incorrect

    A

  • Llama-3-70B incorrect

    A

  • Llama-3-8B incorrect

    A

Item 20% solve rateanswer: B

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: What technique ID corresponds to System Information Discovery in the MITRE ATT&CK framework? Options: A) T1425 B) T1426 C) T1427 D) T1428 Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

  • gpt-3.5-turbo incorrect

    A

  • gpt-4-turbo incorrect

    C

  • Gemini-1.5 incorrect

    D

  • Llama-3-70B incorrect

    A

  • Llama-3-8B incorrect

    A

Item 30% solve rateanswer: CWE-1188

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: A vulnerability in SonicWall Global Management System (GMS), allow a remote user to gain access to the appliance using existing SSH key. This vulnerability affects GMS versions 9.1, 9.0, 8.7, 8.6, 8.4, 8.3 and earlier.

How subjects answered

  • gpt-3.5-turbo incorrect

    CWE-798

  • gpt-4-turbo incorrect

    CWE-294

  • Gemini-1.5 incorrect

    CWE-284

  • Llama-3-70B incorrect

    CWE-798

  • Llama-3-8B incorrect

    CWE-306

Item 420% solve rateanswer: CWE-22

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of data in real-time. There is a COPY FROM function in the CrateDB database that is used to import file data into database tables. This function has a flaw, and authenticated attackers can use the COPY FROM function to import arbitrary file content into database tables, resulting in information leakage. This vulnerability is patched in 5.3.9, 5.4.8, 5.5.4, and 5.6.1.

How subjects answered

  • Llama-3-8B correct

    CWE-22

  • gpt-3.5-turbo incorrect

    CWE-918

  • gpt-4-turbo incorrect

    CWE-284

  • Gemini-1.5 incorrect

    CWE-434

  • Llama-3-70B incorrect

    CWE-73

Item 520% solve rateanswer: D

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: Which CWE ID is not directly related to the attack pattern described in CAPEC-13? Options: A) CWE-285: Improper Authorization B) CWE-74: Injection C) CWE-302: Authentication Bypass by Assumed-Immutable Data D) CWE-89: SQL Injection Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

  • Gemini-1.5 correct

    D

  • gpt-3.5-turbo incorrect

    B

  • gpt-4-turbo incorrect

    B

  • Llama-3-70B incorrect

    A

  • Llama-3-8B incorrect

    A

Item 640% solve rateanswer: CVSS:3.1/AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H

Analyze the following CVE description and calculate the CVSS v3.1 Base Score. Determine the values for each base metric: AV, AC, PR, UI, S, C, I, and A. Summarize each metric's value and provide the final CVSS v3.1 vector string. Valid options for each metric are as follows: - Attack Vector (AV): Network (N), Adjacent (A), Local (L), Physical (P) - Attack Complexity (AC): Low (L), High (H) - Privileges Required (PR): None (N), Low (L), High (H) - User Interaction (UI): None (N), Required (R) - Scope (S): Unchanged (U), Changed (C) - Confidentiality (C): None (N), Low (L), High (H) - Integrity (I): None (N), Low (L), High (H) - Availability (A): None (N), Low (L), High (H) Summarize each metric's value and provide the final CVSS v3.1 vector string. Ensure the final line of your response contains only the CVSS v3 Vector String in the following format: Example format: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H CVE Description: In ged, there is a possible out of bounds write due to an integer overflow. This could lead to local denial of service with System execution privileges needed. User interaction is not needed for exploitation Patch ID: ALPS07835901; Issue ID: ALPS07835901.

How subjects answered

  • gpt-4-turbo correct

    AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H

  • Gemini-1.5 correct

    AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H

  • gpt-3.5-turbo incorrect

    AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:H/A:N

  • Llama-3-70B incorrect

    AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:H/A:H

  • Llama-3-8B incorrect

    AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:H/A:H

Item 760% solve rateanswer: CWE-119

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: In netdagent, there is a possible information disclosure due to an incorrect bounds check. This could lead to local escalation of privilege with System execution privileges needed. User interaction is not needed for exploitation. Patch ID: ALPS07944011; Issue ID: ALPS07944011.

How subjects answered

  • gpt-3.5-turbo correct

    CWE-119

  • gpt-4-turbo correct

    CWE-119

  • Llama-3-70B correct

    CWE-119

  • Gemini-1.5 incorrect

    CWE-20

  • Llama-3-8B incorrect

    CWE-20

Item 880% solve rateanswer: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

Analyze the following CVE description and calculate the CVSS v3.1 Base Score. Determine the values for each base metric: AV, AC, PR, UI, S, C, I, and A. Summarize each metric's value and provide the final CVSS v3.1 vector string. Valid options for each metric are as follows: - Attack Vector (AV): Network (N), Adjacent (A), Local (L), Physical (P) - Attack Complexity (AC): Low (L), High (H) - Privileges Required (PR): None (N), Low (L), High (H) - User Interaction (UI): None (N), Required (R) - Scope (S): Unchanged (U), Changed (C) - Confidentiality (C): None (N), Low (L), High (H) - Integrity (I): None (N), Low (L), High (H) - Availability (A): None (N), Low (L), High (H) Summarize each metric's value and provide the final CVSS v3.1 vector string. Ensure the final line of your response contains only the CVSS v3 Vector String in the following format: Example format: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H CVE Description: A vulnerability, which was classified as critical, has been found in Tenda AC10 16.03.10.13/16.03.10.20. Affected by this issue is the function fromSetSysTime of the file /goform/SetSysTimeCfg. The manipulation of the argument timeZone leads to stack-based buffer overflow. The attack may be launched remotely. The exploit has been disclosed to the public and may be used. The identifier of this vulnerability is VDB-257780. NOTE: The vendor was contacted early about this disclosure but did not respond in any way.

How subjects answered

  • gpt-3.5-turbo correct

    AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

  • gpt-4-turbo correct

    AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

  • Gemini-1.5 correct

    AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

  • Llama-3-70B correct

    AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

  • Llama-3-8B incorrect

    AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H

Item 980% solve rateanswer: CWE-502

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: In createFromParcel of GpsNavigationMessage.java, there is a possible Parcel serialization/deserialization mismatch. This could lead to local escalation of privilege with no additional execution privileges needed. User interaction is not needed for exploitation.Product: AndroidVersions: Android-10 Android-11 Android-12 Android-9Android ID: A-196970023

How subjects answered

  • gpt-3.5-turbo correct

    CWE-502

  • gpt-4-turbo correct

    CWE-502

  • Gemini-1.5 correct

    CWE-502

  • Llama-3-70B correct

    CWE-502

  • Llama-3-8B incorrect

    CWE-20

Item 10100% solve rateanswer: C

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: Which of the following best defines a CSF Target Profile as per the NIST Cybersecurity Framework? Options: A) It describes how an organization currently achieves desired cybersecurity outcomes B) It includes specific tools and techniques used for threat mitigation C) It outlines the desired outcomes an organization has selected and prioritized for its cybersecurity objectives D) It sets baseline security measures for third-party vendors Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

  • gpt-3.5-turbo correct

    C

  • gpt-4-turbo correct

    C

  • Gemini-1.5 correct

    C

  • Llama-3-70B correct

    C

  • Llama-3-8B correct

    C

Item 11100% solve rateanswer: B

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: What potential consequences can arise from CWE-942? Options: A) Bypass firewall rules B) Execute unauthorized code or commands C) Trigger denial of service attacks D) None of the above Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

  • gpt-3.5-turbo correct

    B

  • gpt-4-turbo correct

    B

  • Gemini-1.5 correct

    B

  • Llama-3-70B correct

    B

  • Llama-3-8B correct

    B

Item 12100% solve rateanswer: CWE-79

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: Maccms 10 contains a cross-site scripting (XSS) vulnerability in the Editing function under the Member module. This vulnerability is exploited via a crafted payload in the nickname text field.

How subjects answered

  • gpt-3.5-turbo correct

    CWE-79

  • gpt-4-turbo correct

    CWE-79

  • Gemini-1.5 correct

    CWE-79

  • Llama-3-70B correct

    CWE-79

  • Llama-3-8B correct

    CWE-79

Subjects

The models, agents, and reward models evaluated.

5 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gpt-4-turbo0.6189
  2. 2Gemini-1.50.5834
  3. 3Llama-3-70B0.5673
  4. 4gpt-3.5-turbo0.5092
  5. 5Llama-3-8B0.4559