Skip to main content

Medical

MTBBench

MTBBench evaluates multimodal LLMs on sequential clinical decision-making in oncology. Two cohorts: HANCOCK (multimodal pathology — TMA images, H&E slides, cell density) and MSK-CHORD (longitudinal genomics — genomic profiles, clinical timelines). Models act as agentic AI physicians, optionally using external tools (DrugBank, CONCH, PubMed). Each case contains multiple multi-choice questions; correctness is binary (matched vs. ground truth).

516items
23subjects
64%observed
MITlicense
medicinedomain
textmodality
imagemodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 23 subjects × 516 items, 64% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

MTBBench response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate

Based on the H&E and IHC slides, what is the most likely explanation for the observed immune cell infiltration pattern in the tumor?

  • A.

    The tumor has an immune-excluded phenotype with limited infiltration in both tumor center and invasion front

  • B.

    The tumor shows an immune-desert phenotype with almost no immune infiltration

  • C.

    The tumor demonstrates an inflamed phenotype with significant immune infiltration in both tumor center and invasion front

  • D.

    The tumor exhibits an inflamed phenotype restricted to the invasion front, with limited infiltration in the tumor center

  • E.

    The tumor shows a predominantly M2 macrophage-driven immune response

  • F.

    The tumor has no notable immune microenvironment

How subjects answered

  • gemma3-12b incorrect

    {"response": "C) The tumor demonstrates an inflamed phenotype with significant immune infiltration in both tumor center and invasion front"}

  • gemma3-12b-with-tools incorrect

    {"response": "C) The tumor demonstrates an inflamed phenotype with significant immune infiltration in both tumor center and invasion front"}

  • gemma3-27b incorrect

    {"response": "D) The tumor exhibits an inflamed phenotype restricted to the invasion front, with limited infiltration in the tumor center"}

  • qwen25-7b-with-tools incorrect

    {"response": "C) The tumor demonstrates an inflamed phenotype with significant immune infiltration in both tumor center and invasion front"}

  • qwen25-32b incorrect

    {"response": "D) The tumor exhibits an inflamed phenotype restricted to the invasion front, with limited infiltration in the tumor center"}

  • qwen25-32b-with-tools incorrect

    {"response": "C) The tumor demonstrates an inflamed phenotype with significant immune infiltration in both tumor center and invasion front"}

Item 213% solve rate

Integrating both H&E and IHC findings, what is the most likely immune evasion mechanism in this tumor?

  • A.

    Low MHC1 expression in both tumor center and invasion front

  • B.

    High CD8+ T-cell infiltration in the tumor center

  • C.

    Presence of lymphovascular and vascular invasion

  • D.

    High CD163+ macrophage infiltration in the invasion front

  • E.

    Uniformly high CD3+ T-cell infiltration across all regions

  • F.

    Perinodal invasion with high immune cell density

How subjects answered

  • gemma3-27b-with-tools correct

    {"response": "A) Low MHC1 expression in both tumor center and invasion front"}

  • llama90b correct

    {"response": "A) Low MHC1 expression in both tumor center and invasion front", "files_accessed": ["TMA_IHC_TumorCenter_MHC1_0.png", "TMA_IHC_TumorCenter_MHC1_1.png", "TMA_IHC_InvasionFront_MHC1_0.png", "TMA_IHC_InvasionFront_MHC1_1.png"]}

  • gemma3-12b-with-tools incorrect

    {"response": "D) High CD163+ macrophage infiltration in the invasion front"}

  • qwen25-7b-with-tools incorrect

    {"response": "C) Squamous cell carcinoma, keratinizing", "files_accessed": ["PrimaryTumor_HE_176.jpg", "PrimaryTumor_HE_176_roi.jpg"]}

  • qwen25-32b incorrect

    {"response": "D) High CD163+ macrophage infiltration in the invasion front"}

  • qwen25-32b-with-tools incorrect

    {"response": "D) High CD163+ macrophage infiltration in the invasion front"}

Item 329% solve rate

Given that the patient is 54 years old, will the cancer progress in the next 12 months?

  • A.

    Yes

  • B.

    No

How subjects answered

  • qwen3-32b correct

    {"response": "B) No", "files_accessed": ["timeline1.txt"]}

  • qwen3-32b-with-tools correct

    {"response": "B) No", "files_accessed": ["timeline1.txt"]}

  • qwen3-8b correct

    {"response": "B) No", "files_accessed": ["timeline1.txt"]}

  • llama33-70b-with-tools incorrect

    {"response": "A", "files_accessed": ["timeline1.txt"]}

  • llama33-70b incorrect

    {"response": "A", "files_accessed": ["timeline1.txt"]}

  • llama31-8b-with-tools incorrect

    {"response": "A", "files_accessed": ["timeline0.txt", "timeline1.txt"]}

Item 440% solve rate

Is there any evidence of perineural invasion observed on the H&E slides?

  • A.

    Yes, perineural invasion is present

  • B.

    No, no perineural invasion is observed

  • C.

    Yes, along with vascular invasion

  • D.

    No, only lymphovascular invasion is present

  • E.

    Yes, but with no associated lymphovascular invasion

  • F.

    No, and no signs of vascular invasion

How subjects answered

  • gemma3-27b-with-tools correct

    {"response": "A) Yes, perineural invasion is present", "files_accessed": ["PrimaryTumor_HE_225.jpg", "PrimaryTumor_HE_225_roi.jpg"]}

  • gpt4o correct

    {"response": "A) Yes, perineural invasion is present", "files_accessed": ["PrimaryTumor_HE_225.jpg", "PrimaryTumor_HE_225_roi.jpg"]}

  • internvl3-78b correct

    {"response": "A) Yes, perineural invasion is present", "files_accessed": ["PrimaryTumor_HE_225_roi.jpg", "PrimaryTumor_HE_225.jpg"]}

  • mistrallsmall incorrect

    {"response": "B) No, no perineural invasion is observed", "files_accessed": ["PrimaryTumor_HE_225.jpg", "PrimaryTumor_HE_225_roi.jpg"]}

  • qwen25-7b-with-tools incorrect

    {"response": "B) Squamous Cell Carcinoma (SCC), Conventional-Keratinizing", "files_accessed": ["PrimaryTumor_HE_225_roi.jpg", "PrimaryTumor_HE_225.jpg"]}

  • qwen25-7b incorrect

    {"response": "B) No, no perineural invasion is observed", "files_accessed": ["PrimaryTumor_HE_225_roi.jpg", "PrimaryTumor_HE_225.jpg"]}

Item 550% solve rate

Given that the patient is 61 years old, will the cancer have a recurrence in the next 6 months?

  • A.

    Yes

  • B.

    No

How subjects answered

  • gemma3-12b correct

    {"response": "A) Yes", "files_accessed": ["specimen.txt", "timeline0.txt"]}

  • gemma3-12b-with-tools correct

    {"response": "A) Yes", "files_accessed": ["timeline0.txt", "specimen.txt"]}

  • gemma3-27b correct

    {"response": "A) Yes", "files_accessed": ["specimen.txt", "timeline0.txt"]}

  • gemma3-27b-with-tools incorrect

    {"response": "A) Yes", "files_accessed": ["specimen.txt", "timeline0.txt"]}

  • o4-mini incorrect

    {"response": "B) No", "files_accessed": ["timeline0.txt", "specimen.txt"]}

  • llama31-8b-with-tools incorrect

    {"response": "B", "files_accessed": ["specimen.txt", "timeline0.txt", "timeline0.txt"]}

Item 660% solve rate

Based on the IHC slides for CD3, is there a higher density of CD3+ T cells in the Tumor Center or the Invasion Front?

  • A.

    Tumor Center

  • B.

    Invasion Front

How subjects answered

  • gemma3-12b correct

    {"response": "B) Invasion Front", "files_accessed": ["TMA_IHC_InvasionFront_CD3_1.png", "TMA_IHC_TumorCenter_CD3_1.png", "TMA_IHC_InvasionFront_CD3_0.png", "TMA_IHC_TumorCenter_CD3_0.png"]}

  • gemma3-27b correct

    {"response": "B) Invasion Front", "files_accessed": ["TMA_IHC_InvasionFront_CD3_1.png", "TMA_IHC_TumorCenter_CD3_1.png", "TMA_IHC_InvasionFront_CD3_0.png", "TMA_IHC_TumorCenter_CD3_0.png"]}

  • gpt4o correct

    {"response": "B) Invasion Front", "files_accessed": ["TMA_IHC_InvasionFront_CD3_1.png", "TMA_IHC_TumorCenter_CD3_1.png", "TMA_IHC_TumorCenter_CD3_0.png", "TMA_IHC_InvasionFront_CD3_0.png"]}

  • qwen25-7b incorrect

    {"response": "A) Tumor Center", "files_accessed": ["TMA_IHC_TumorCenter_CD3_1.png", "TMA_IHC_InvasionFront_CD3_0.png", "TMA_IHC_TumorCenter_CD3_0.png", "TMA_IHC_InvasionFront_CD3_1.png"]}

  • qwen25-7b-with-tools incorrect

    {"response": "A) Tumor Center", "files_accessed": ["TMA_IHC_InvasionFront_CD3_0.png", "TMA_IHC_InvasionFront_CD3_1.png", "TMA_IHC_TumorCenter_CD3_1.png", "TMA_IHC_TumorCenter_CD3_0.png"]}

  • qwen25-32b-with-tools incorrect

    {"response": "A) Tumor Center", "files_accessed": ["TMA_IHC_InvasionFront_CD3_1.png", "TMA_IHC_InvasionFront_CD3_0.png", "TMA_IHC_TumorCenter_CD3_1.png", "TMA_IHC_TumorCenter_CD3_0.png"]}

Item 767% solve rate

What can be concluded about CD8+ cytotoxic T-cell infiltration in the invasion front based on the IHC images?

  • A.

    High CD8+ T-cell density throughout the invasion front

  • B.

    CD8+ T-cell density is highest in the second invasion front image

  • C.

    Minimal CD8+ T-cell infiltration in both invasion front images

  • D.

    CD8+ T-cell infiltration is higher than CD3+ T-cell infiltration in the invasion front

  • E.

    CD8+ T-cell infiltration is higher in the tumor center compared to the invasion front

  • F.

    No CD8+ T-cell infiltration is observed

How subjects answered

  • gemma3-12b-with-tools correct

    {"response": "C) Minimal CD8+ T-cell infiltration in both invasion front images", "files_accessed": ["TMA_IHC_InvasionFront_CD8_1.png", "TMA_IHC_InvasionFront_CD8_0.png"]}

  • internvl3-38b correct

    {"response": "C) Minimal CD8+ T-cell infiltration in both invasion front images", "files_accessed": ["TMA_IHC_InvasionFront_CD8_1.png", "TMA_IHC_InvasionFront_CD8_0.png"]}

  • gemma3-27b-with-tools correct

    {"response": "C) Minimal CD8+ T-cell infiltration in both invasion front images", "files_accessed": ["TMA_IHC_InvasionFront_CD8_1.png", "TMA_IHC_InvasionFront_CD8_0.png"]}

  • gemma3-27b incorrect

    {"response": "B) CD8+ T-cell density is highest in the second invasion front image"}

  • qwen25-7b incorrect

    {"response": "B) CD8+ T-cell density is highest in the second invasion front image", "files_accessed": ["TMA_IHC_InvasionFront_CD8_0.png", "TMA_IHC_InvasionFront_CD8_1.png"]}

  • qwen25-32b-with-tools incorrect

    {"response": "B) CD8+ T-cell density is highest in the second invasion front image", "files_accessed": ["TMA_IHC_TumorCenter_CD8_1.png", "TMA_IHC_InvasionFront_CD8_1.png", "TMA_IHC_TumorCenter_CD8_0.png", "TMA_IHC_InvasionFront_CD8_0.png"]}

Item 873% solve rate

Upon reviewing the H&E images, is there evidence of lymphovascular invasion (LVI) in the tumor tissue?

  • A.

    Yes, there is lymphovascular invasion

  • B.

    No, lymphovascular invasion is absent

  • C.

    LVI cannot be determined from the H&E images

  • D.

    LVI is present only in the lymph nodes

  • E.

    LVI is present only in the tumor center

  • F.

    LVI is present only at the invasion front

How subjects answered

  • gemma3-12b-with-tools correct

    {"response": "B) No, lymphovascular invasion is absent", "files_accessed": ["TMA_IHC_InvasionFront_HE_1.png", "TMA_IHC_InvasionFront_HE_0.png"]}

  • gpt4o correct

    {"response": "B) No, lymphovascular invasion is absent", "files_accessed": ["PrimaryTumor_HE_740_roi.jpg", "PrimaryTumor_HE_740.jpg"]}

  • gemma3-27b-with-tools correct

    {"response": "B) No, lymphovascular invasion is absent", "files_accessed": ["PrimaryTumor_HE_740_roi.jpg", "TMA_IHC_InvasionFront_HE_1.png", "TMA_IHC_TumorCenter_HE_1.png", "TMA_IHC_TumorCenter_HE_0.png", "TMA_IHC_InvasionFront_HE_0.png", "PrimaryTumor_HE_740.jpg"]}

  • gemma3-27b incorrect

    {"response": "A) Yes, there is lymphovascular invasion", "files_accessed": ["PrimaryTumor_HE_740.jpg", "PrimaryTumor_HE_740_roi.jpg"]}

  • gemma3-12b incorrect

    {"response": "C) LVI cannot be determined from the H&E images", "files_accessed": ["PrimaryTumor_HE_740.jpg", "PrimaryTumor_HE_740_roi.jpg"]}

  • internvl3-78b incorrect

    {"response": "A) Yes, there is lymphovascular invasion", "files_accessed": ["PrimaryTumor_HE_740.jpg", "PrimaryTumor_HE_740_roi.jpg"]}

Item 980% solve rate

Is there any evidence of vascular invasion on the histopathology image from the primary site?

  • A.

    Yes, significant vascular invasion is present

  • B.

    No, vascular invasion is not identified

  • C.

    Possibly, but further staining is required to confirm

  • D.

    Yes, mild vascular invasion is noted

  • E.

    Yes, with perineural invasion also present

  • F.

    No, but lymphovascular invasion is present

How subjects answered

  • gemma3-12b correct

    {"response": "B) No, vascular invasion is not identified", "files_accessed": ["PrimaryTumor_HE_476_roi.jpg", "PrimaryTumor_HE_476.jpg"]}

  • internvl3-38b correct

    {"response": "B) No, vascular invasion is not identified", "files_accessed": ["PrimaryTumor_HE_476.jpg", "PrimaryTumor_HE_476_roi.jpg"]}

  • gpt4o correct

    {"response": "B) No, vascular invasion is not identified", "files_accessed": ["PrimaryTumor_HE_476.jpg", "PrimaryTumor_HE_476_roi.jpg"]}

  • gemma3-27b-with-tools incorrect

    {"response": "D) Yes, mild vascular invasion is noted", "files_accessed": ["PrimaryTumor_HE_476.jpg", "TMA_IHC_InvasionFront_HE_0.png", "TMA_IHC_InvasionFront_HE_1.png"]}

  • gemma3-12b-with-tools incorrect

    {"response": "C) Possibly, but further staining is required to confirm", "files_accessed": ["PrimaryTumor_HE_476.jpg"]}

  • gemma3-27b incorrect

    {"response": "D) Yes, mild vascular invasion is noted", "files_accessed": ["TMA_IHC_TumorCenter_HE_1.png", "PrimaryTumor_HE_476_roi.jpg", "TMA_IHC_TumorCenter_HE_0.png", "PrimaryTumor_HE_476.jpg", "TMA_IHC_InvasionFront_HE_0.png", "TMA_IHC_InvasionFront_HE_1.png"]}

Item 1087% solve rate

Which hematological finding from the blood tests suggests an increased risk of infection or immune compromise during the perioperative period?

  • A.

    Elevated leukocyte count and low lymphocyte percentage

  • B.

    Normal leukocyte count and high granulocyte percentage

  • C.

    Elevated monocyte count and normal lymphocyte percentage

  • D.

    Normal leukocyte count and low lymphocyte percentage

  • E.

    Low leukocyte count and normal granulocyte percentage

  • F.

    Elevated lymphocyte count and low granulocyte percentage

How subjects answered

  • gemma3-12b correct

    {"response": "D) Normal leukocyte count and low lymphocyte percentage", "files_accessed": ["patient_blood_data.json"]}

  • gemma3-12b-with-tools correct

    {"response": "D) Normal leukocyte count and low lymphocyte percentage"}

  • gemma3-27b correct

    {"response": "D) Normal leukocyte count and low lymphocyte percentage", "files_accessed": ["patient_blood_data.json"]}

  • qwen25-32b correct

    {"response": "D) Normal leukocyte count and low lymphocyte percentage"}

  • qwen25-7b-with-tools incorrect

    {"response": "C) Elevated monocyte count and normal lymphocyte percentage", "files_accessed": ["patient_blood_data.json"]}

  • qwen25-7b incorrect

    {"response": "C) High granulocyte count and low lymphocyte percentage", "files_accessed": ["patient_blood_data.json"]}

Item 1193% solve rate

Based on the renal function panel, which of the following findings indicates potential renal dysfunction that may influence perioperative medication clearance?

  • A.

    Elevated urea with a normal creatinine level.

  • B.

    Elevated creatinine with a reduced glomerular filtration rate (GFR).

  • C.

    Normal urea and creatinine levels but reduced GFR.

  • D.

    Reduced urea with a normal creatinine level.

  • E.

    Normal creatinine with elevated calcium levels.

  • F.

    Reduced urea and creatinine levels with a normal GFR.

How subjects answered

  • gemma3-12b correct

    {"response": "B) Elevated creatinine with a reduced glomerular filtration rate (GFR)", "files_accessed": ["patient_blood_data.json"]}

  • gemma3-12b-with-tools correct

    {"response": "B) Elevated creatinine with a reduced glomerular filtration rate (GFR)"}

  • gemma3-27b correct

    {"response": "B) Elevated creatinine with a reduced glomerular filtration rate (GFR)", "files_accessed": ["patient_blood_data.json"]}

  • qwen25-32b correct

    {"response": "B) Elevated creatinine with a reduced glomerular filtration rate (GFR)."}

  • qwen25-32b-with-tools correct

    {"response": "B) Elevated creatinine with a reduced glomerular filtration rate (GFR)."}

  • qwen25-7b-with-tools incorrect

    {"response": "C) Normal urea and creatinine levels but reduced GFR"}

Item 12100% solve rate

What is the histologic type of the primary tumor as observed on the H&E-stained sections?

  • A.

    Adenocarcinoma

  • B.

    Squamous Cell Carcinoma (Basaloid subtype)

  • C.

    Small Cell Carcinoma

  • D.

    Neuroendocrine Tumor

  • E.

    Mucinous Adenocarcinoma

  • F.

    Sarcoma

How subjects answered

  • gemma3-12b correct

    {"response": "B) Squamous Cell Carcinoma (Basaloid subtype)", "files_accessed": ["PrimaryTumor_HE_761.jpg"]}

  • gemma3-12b-with-tools correct

    {"response": "B) Squamous Cell Carcinoma (Basaloid subtype)", "files_accessed": ["PrimaryTumor_HE_761.jpg"]}

  • gemma3-27b correct

    {"response": "B) Squamous Cell Carcinoma (Basaloid subtype)", "files_accessed": ["PrimaryTumor_HE_761.jpg", "PrimaryTumor_HE_761_roi.jpg"]}

  • qwen25-7b-with-tools correct

    {"response": "B) Squamous Cell Carcinoma (Basaloid subtype)", "files_accessed": ["PrimaryTumor_HE_761.jpg", "PrimaryTumor_HE_761_roi.jpg"]}

  • qwen25-32b correct

    {"response": "B) Squamous Cell Carcinoma (Basaloid subtype)", "files_accessed": ["PrimaryTumor_HE_761_roi.jpg", "PrimaryTumor_HE_761.jpg"]}

  • qwen25-32b-with-tools correct

    {"response": "B) Squamous Cell Carcinoma (Basaloid subtype)", "files_hallucinated": ["TMA_IHC_TumorCenter_p40_0.png", "TMA_IHC_TumorCenter_CK5_6_0.png", "TMA_IHC_TumorCenter_p63_0.png", "TMA_IHC_TumorCenter_MUC6_0.png", "TMA_IHC_TumorCenter_MUC6_1.png", "TMA_IHC_TumorCenter_MUC5AC_1.png", "TMA_IHC_TumorCenter_p63_1.png", "TMA_IHC_TumorCenter_p40_1.png", "TMA_IHC_TumorCenter_MUC5AC_0.png", "TMA_IHC_TumorCenter_CK5_6_1.png"]}

Subjects

The models, agents, and reward models evaluated.

23 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1qwen3-32b-with-tools0.6995
  2. 2qwen3-32b0.6831
  3. 3llama33-70b-with-tools0.6721
  4. 4internvl3-78b0.6718
  5. 5llama33-70b0.6667
  6. 6gpt4o0.6597
  7. 7gemma3-12b-with-tools0.6527
  8. 8mistrallsmall0.6462
  9. 9internvl3-38b-with-tools0.641
  10. 10o4-mini0.6283
  11. 11gemma3-27b-with-tools0.6195
  12. 12llama90b0.6187
  13. 13internvl3-38b0.6154
  14. 14qwen25-32b-with-tools0.6154
  15. 15gemma3-12b0.6003
  16. 16qwen25-32b0.6
  17. 17qwen3-8b-with-tools0.5902
  18. 18llama31-8b-with-tools0.5683
  19. 19qwen3-8b0.5683
  20. 20qwen25-7b-with-tools0.5641
  21. 21gemma3-27b0.5567
  22. 22llama31-8b0.5246
  23. 23qwen25-7b0.4897