Skip to main content

Medical

AfriMed-QA

AfriMed-QA: medical QA for African healthcare contexts; binary correctness on MCQ items.

6,910items
30subjects
54%observed
Modelsubject type
CC-BY-NC-SA-4.0license
medicinedomain
multilingualdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 30 subjects × 6,910 items, 54% of cells evaluated. The heatmap shows a representative 90 of 6,910 items — evenly sampled across difficulty — so each cell stays square and legible.

AfriMed-QA response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: D
An 80-year-old man is referred from a nursing home with fever, rigors, nausea and vomiting. He is bedridden and has sacral decubitus ulcers.
Blood cultures grow streptococcus milleri. The most likely site of infection is?

A) Sacral bedsores
B) Urinary tract infection
C) osteomyelitis
D) liver abscess
E) Endocarditis

How subjects answered

  • gemini_pro incorrect

    A

  • gemini_ultra incorrect

    A

  • gpt-3.5-turbo incorrect

    E

Item 235% solve rateanswer: option2
Theoretically, what kind of injury related to laparoscopic entry should be reduced by the Hasson (open) technique, compared to a Veress needle entry?

A. Bladder injury
B. Bowel injury
C. Major vessel injury
D. Splenic injury
E. Ureteric injury

How subjects answered

  • BioMistral-7B correct

    B

  • gpt-4 correct

    B

  • Meta-Llama-3-70B-Instruct incorrect

    C

Item 365% solve rateanswer: option3
Typhoid fever

A. can be caused by any of the Salmonella species
B. low infective dose of organism is required
C. carriage is associated with gall bladder disease
D. main stay of diagnosis is Widal test
E. chloramphenicol is the main drug of choice

How subjects answered

  • claude-3-opus-20240229 correct

    C

  • claude-3-5-sonnet-20240620 correct

    C

  • Meditron-7B incorrect

    B

Item 486% solve rateanswer: E
A 54 year old male patient has a history of cough productive of green sputum with shortness of breath and left pleuritic chest pain. His temperature is 38.4 C, respiratory rate 22 bpm, pulse 110bpm, WBC 12.4 x 109 /L and his CXR shows consolidation at the left base. What is the most likely causative organism for this presentation?

A) Haemophilus influenzae
B) Klebsiella pneumonia
C) Mycoplasma pneumoniae
D) Staphylococcus aureus
E) Streptococcus pneumoniae

How subjects answered

  • gemini_pro correct

    E

  • gemini_ultra correct

    E

  • gpt-4-turbo correct

    E

Item 5100% solve rateanswer: B
Which respiratory condition is characterized by recurring episodes of wheezing, shortness of breath, and chest tightness?

A) Chronic obstructive pulmonary disease (COPD)
B) Asthma
C) Pneumonia
D) Tuberculosis
E) Emphysema

How subjects answered

  • gemini_pro correct

    B

  • gemini_ultra correct

    B

  • gpt-3.5-turbo correct

    B

Item 6100% solve rateanswer: option3
A 46-year-old woman with a history of relapsing and remitting multiple sclerosis comes to the clinic for review. She has been stable for the past 6 months and is currently treated with oral fingolimod. Neurological examination reveals poor coordination, more marked on the left than the right, although there are no other significant signs. Which of the following correctly represents the mode of action of fingolimod?

A. B-cell cytotoxic
B. Mimics myelin peptides
C. Prevents lymphocyte movement across the blood–brain barrier
D. T-cell cytotoxic
E. Increases the expression and concentration of anti-inflammatory agent

How subjects answered

  • BioMistral-7B correct

    C

  • claude-3-sonnet-20240229 correct

    C

  • Meditron-7B correct

    C

Subjects

The models, agents, and reward models evaluated.

30 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gpt-4-turbo0.824
  2. 2gemini_ultra0.797
  3. 3gpt-4o0.793
  4. 4claude-3-5-sonnet-202406200.777
  5. 5llama3-405b-instruct-maas0.763
  6. 6gpt-40.757
  7. 7claude-3-opus-202402290.746
  8. 8medpalm20.742
  9. 9Meta-Llama-3-70B-Instruct0.738
  10. 10gpt-4o-mini0.718
  11. 11medlm0.702
  12. 12gemini_pro0.684
  13. 13gpt-3.5-turbo0.683
  14. 14Phi-3-medium-128k-instruct0.671
  15. 15Meta-Llama-3-8B-Instruct0.668
  16. 16claude-3-haiku-202403070.664
  17. 17claude-3-sonnet-202402290.65
  18. 18Meta-Llama-3.1-8B-Instruct0.619
  19. 19Phi-3-mini-4k-instruct0.604
  20. 20Mixtral-8x7B-Instruct-v0.10.603
  21. 21Phi-3-mini-128k-instruct0.59
  22. 22Llama3-OpenBioLLM-70B0.586
  23. 23JSL-MedLlama-3-8B-v2.00.573
  24. 24gpt-3.5-turbo-11060.563
  25. 25Meditron-7B0.51
  26. 26Mistral-7B-Instruct-v0.30.508
  27. 27Mistral-7B-Instruct-v0.20.485
  28. 28Meta-Llama-3-8B0.472
  29. 29PMC-LLaMA-7B0.463
  30. 30BioMistral-7B0.44