Safety & Security

CTIBench

CTIBench: a benchmark for evaluating LLMs on cyber threat intelligence tasks, including multiple-choice CTI knowledge (CTI-MCQ), CVE-to-CWE root-cause mapping (CTI-RCM), CVSS vulnerability severity prediction (CTI-VSP), and threat-actor attribution (CTI-TAA). Each task provides per-(model, item) predictions for five LLMs graded against gold answers.

5,483items

5subjects

100%observed

MITlicense

cybersecuritydomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 5 subjects × 5,483 items, 100% of cells evaluated.

CTIBench response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: B

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: Which mitigation strategy aims to create an allowlist for valid input? Options: A) Use canonicalized data B) Assume all input is malicious C) Test your decoding process against malicious input D) Perform regular security audits Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

gpt-3.5-turbo incorrect
A
gpt-4-turbo incorrect
A
Gemini-1.5 incorrect
A
Llama-3-70B incorrect
A
Llama-3-8B incorrect
A

Item 20% solve rateanswer: B

How subjects answered

gpt-3.5-turbo incorrect
A
gpt-4-turbo incorrect
C
Gemini-1.5 incorrect
D
Llama-3-70B incorrect
A
Llama-3-8B incorrect
A

Item 30% solve rateanswer: CWE-1188

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: A vulnerability in SonicWall Global Management System (GMS), allow a remote user to gain access to the appliance using existing SSH key. This vulnerability affects GMS versions 9.1, 9.0, 8.7, 8.6, 8.4, 8.3 and earlier.

How subjects answered

gpt-3.5-turbo incorrect
CWE-798
gpt-4-turbo incorrect
CWE-294
Gemini-1.5 incorrect
CWE-284
Llama-3-70B incorrect
CWE-798
Llama-3-8B incorrect
CWE-306

Item 420% solve rateanswer: CWE-22

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of data in real-time. There is a COPY FROM function in the CrateDB database that is used to import file data into database tables. This function has a flaw, and authenticated attackers can use the COPY FROM function to import arbitrary file content into database tables, resulting in information leakage. This vulnerability is patched in 5.3.9, 5.4.8, 5.5.4, and 5.6.1.

How subjects answered

Llama-3-8B correct
CWE-22
gpt-3.5-turbo incorrect
CWE-918
gpt-4-turbo incorrect
CWE-284
Gemini-1.5 incorrect
CWE-434
Llama-3-70B incorrect
CWE-73

Item 520% solve rateanswer: D

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: Which CWE ID is not directly related to the attack pattern described in CAPEC-13? Options: A) CWE-285: Improper Authorization B) CWE-74: Injection C) CWE-302: Authentication Bypass by Assumed-Immutable Data D) CWE-89: SQL Injection Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

Gemini-1.5 correct
D
gpt-3.5-turbo incorrect
B
gpt-4-turbo incorrect
B
Llama-3-70B incorrect
A
Llama-3-8B incorrect
A

Item 640% solve rateanswer: CVSS:3.1/AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H

Analyze the following CVE description and calculate the CVSS v3.1 Base Score. Determine the values for each base metric: AV, AC, PR, UI, S, C, I, and A. Summarize each metric's value and provide the final CVSS v3.1 vector string. Valid options for each metric are as follows: - Attack Vector (AV): Network (N), Adjacent (A), Local (L), Physical (P) - Attack Complexity (AC): Low (L), High (H) - Privileges Required (PR): None (N), Low (L), High (H) - User Interaction (UI): None (N), Required (R) - Scope (S): Unchanged (U), Changed (C) - Confidentiality (C): None (N), Low (L), High (H) - Integrity (I): None (N), Low (L), High (H) - Availability (A): None (N), Low (L), High (H) Summarize each metric's value and provide the final CVSS v3.1 vector string. Ensure the final line of your response contains only the CVSS v3 Vector String in the following format: Example format: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H CVE Description: In ged, there is a possible out of bounds write due to an integer overflow. This could lead to local denial of service with System execution privileges needed. User interaction is not needed for exploitation Patch ID: ALPS07835901; Issue ID: ALPS07835901.

How subjects answered

gpt-4-turbo correct
AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H
Gemini-1.5 correct
AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:N/A:H
gpt-3.5-turbo incorrect
AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:H/A:N
Llama-3-70B incorrect
AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:H/A:H
Llama-3-8B incorrect
AV:L/AC:L/PR:H/UI:N/S:U/C:N/I:H/A:H

Item 760% solve rateanswer: CWE-119

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: In netdagent, there is a possible information disclosure due to an incorrect bounds check. This could lead to local escalation of privilege with System execution privileges needed. User interaction is not needed for exploitation. Patch ID: ALPS07944011; Issue ID: ALPS07944011.

How subjects answered

gpt-3.5-turbo correct
CWE-119
gpt-4-turbo correct
CWE-119
Llama-3-70B correct
CWE-119
Gemini-1.5 incorrect
CWE-20
Llama-3-8B incorrect
CWE-20

Item 880% solve rateanswer: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H

Analyze the following CVE description and calculate the CVSS v3.1 Base Score. Determine the values for each base metric: AV, AC, PR, UI, S, C, I, and A. Summarize each metric's value and provide the final CVSS v3.1 vector string. Valid options for each metric are as follows: - Attack Vector (AV): Network (N), Adjacent (A), Local (L), Physical (P) - Attack Complexity (AC): Low (L), High (H) - Privileges Required (PR): None (N), Low (L), High (H) - User Interaction (UI): None (N), Required (R) - Scope (S): Unchanged (U), Changed (C) - Confidentiality (C): None (N), Low (L), High (H) - Integrity (I): None (N), Low (L), High (H) - Availability (A): None (N), Low (L), High (H) Summarize each metric's value and provide the final CVSS v3.1 vector string. Ensure the final line of your response contains only the CVSS v3 Vector String in the following format: Example format: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H CVE Description: A vulnerability, which was classified as critical, has been found in Tenda AC10 16.03.10.13/16.03.10.20. Affected by this issue is the function fromSetSysTime of the file /goform/SetSysTimeCfg. The manipulation of the argument timeZone leads to stack-based buffer overflow. The attack may be launched remotely. The exploit has been disclosed to the public and may be used. The identifier of this vulnerability is VDB-257780. NOTE: The vendor was contacted early about this disclosure but did not respond in any way.

How subjects answered

gpt-3.5-turbo correct
AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H
gpt-4-turbo correct
AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H
Gemini-1.5 correct
AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H
Llama-3-70B correct
AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H
Llama-3-8B incorrect
AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H

Item 980% solve rateanswer: CWE-502

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: In createFromParcel of GpsNavigationMessage.java, there is a possible Parcel serialization/deserialization mismatch. This could lead to local escalation of privilege with no additional execution privileges needed. User interaction is not needed for exploitation.Product: AndroidVersions: Android-10 Android-11 Android-12 Android-9Android ID: A-196970023

How subjects answered

gpt-3.5-turbo correct
CWE-502
gpt-4-turbo correct
CWE-502
Gemini-1.5 correct
CWE-502
Llama-3-70B correct
CWE-502
Llama-3-8B incorrect
CWE-20

Item 10100% solve rateanswer: C

You are given a multiple-choice question (MCQ) from a Cyber Threat Intelligence (CTI) knowledge benchmark dataset. Your task is to choose the best option among the four provided. Return your answer as a single uppercase letter: A, B, C, or D. Question: Which of the following best defines a CSF Target Profile as per the NIST Cybersecurity Framework? Options: A) It describes how an organization currently achieves desired cybersecurity outcomes B) It includes specific tools and techniques used for threat mitigation C) It outlines the desired outcomes an organization has selected and prioritized for its cybersecurity objectives D) It sets baseline security measures for third-party vendors Important: The last line of your answer should contain only the single letter corresponding to the best option, with no additional text.

How subjects answered

gpt-3.5-turbo correct
C
gpt-4-turbo correct
C
Gemini-1.5 correct
C
Llama-3-70B correct
C
Llama-3-8B correct
C

Item 11100% solve rateanswer: B

How subjects answered

gpt-3.5-turbo correct
B
gpt-4-turbo correct
B
Gemini-1.5 correct
B
Llama-3-70B correct
B
Llama-3-8B correct
B

Item 12100% solve rateanswer: CWE-79

Analyze the following CVE description and map it to the appropriate CWE. Provide a brief justification for your choice. Ensure the last line of your response contains only the CWE ID. CVE Description: Maccms 10 contains a cross-site scripting (XSS) vulnerability in the Editing function under the Member module. This vulnerability is exploited via a crafted payload in the nickname text field.

How subjects answered

gpt-3.5-turbo correct
CWE-79
gpt-4-turbo correct
CWE-79
Gemini-1.5 correct
CWE-79
Llama-3-70B correct
CWE-79
Llama-3-8B correct
CWE-79

Subjects

The models, agents, and reward models evaluated.

5 subjects, ranked by mean response (accuracy) across this benchmark's items.

1gpt-4-turbo0.6189
2Gemini-1.50.5834
3Llama-3-70B0.5673
4gpt-3.5-turbo0.5092
5Llama-3-8B0.4559

Full data on Hugging Face Back to the gallery