Multimodal

SugarCrepe

SugarCrepe: a non-gameable vision-language compositionality benchmark of ~7.5k (image, positive caption, hard-negative caption) triples across 7 hard-negative types (add/replace/swap of attributes, objects, relations). Models must pick the caption that matches the COCO-2017 image. This build ingests the released per-item GPT-4V responses from the official repo (the only model with per-instance outputs released; the 17 CLIP models in the paper have only aggregate accuracy), in two caption-presentation orders (positive-first / negative-first).

7,512items

1subjects

100%observed

MITlicense

generaldomain

imagemodality

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 1 subjects × 7,512 items, 100% of cells evaluated.

SugarCrepe response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: A woman sitting on a bench near a statue, checking her phone.

Vision-language compositionality task (SugarCrepe, hard-negative type: add_att). Image: 000000295713.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A woman sitting on a bench near a statue, checking her phone." Hard-negative caption: "A woman sitting on a wooden bench near a statue, checking her phone."

Subject outcomes

GPT-4V incorrect

Item 250% solve rateanswer: a man wearing a hat holding a dog wearing a hat

Vision-language compositionality task (SugarCrepe, hard-negative type: add_obj). Image: 000000236166.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "a man wearing a hat holding a dog wearing a hat" Hard-negative caption: "A man wearing a hat and scarf is holding a dog wearing a hat."

Subject outcomes

GPT-4V incorrect

Item 3100% solve rateanswer: A teddy bear sitting with a Finding Nemo stuffed animal on a bench.

Vision-language compositionality task (SugarCrepe, hard-negative type: replace_att). Image: 000000322844.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A teddy bear sitting with a Finding Nemo stuffed animal on a bench." Hard-negative caption: "A teddy bear sitting with a Toy Story stuffed animal on a bench."

Subject outcomes

GPT-4V correct

Item 4100% solve rateanswer: A large pizza is shown on a table, with bottles and glasses of wine.

Vision-language compositionality task (SugarCrepe, hard-negative type: replace_obj). Image: 000000116208.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A large pizza is shown on a table, with bottles and glasses of wine." Hard-negative caption: "A large pasta dish is shown on a table, with bottles and glasses of wine."

Subject outcomes

GPT-4V correct

Item 5100% solve rateanswer: A baby elephant stands near to four larger adult elephants.

Vision-language compositionality task (SugarCrepe, hard-negative type: swap_att). Image: 000000268375.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A baby elephant stands near to four larger adult elephants." Hard-negative caption: "Four baby elephants stand near to a larger adult elephant."

Subject outcomes

GPT-4V correct

Item 6100% solve rateanswer: A cross country skier listens to music as she glides across the snow.

Vision-language compositionality task (SugarCrepe, hard-negative type: replace_obj). Image: 000000439522.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A cross country skier listens to music as she glides across the snow." Hard-negative caption: "An ice skater listens to music as she glides across the ice."

Subject outcomes

GPT-4V correct

Item 7100% solve rateanswer: A little cat is sitting on the table and watching TV.

Vision-language compositionality task (SugarCrepe, hard-negative type: add_att). Image: 000000098839.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A little cat is sitting on the table and watching TV." Hard-negative caption: "A little cat is sitting on the glass table and watching TV."

Subject outcomes

GPT-4V correct

Item 8100% solve rateanswer: A plate of vegetables, chicken, and white rice.

Vision-language compositionality task (SugarCrepe, hard-negative type: replace_att). Image: 000000003845.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A plate of vegetables, chicken, and white rice." Hard-negative caption: "A plate of vegetables, chicken, and brown rice."

Subject outcomes

GPT-4V correct

Item 9100% solve rateanswer: A motorcycle is on esplanade at the car show.

Vision-language compositionality task (SugarCrepe, hard-negative type: add_obj). Image: 000000456394.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A motorcycle is on esplanade at the car show." Hard-negative caption: "Balloons are floating above a motorcycle on the esplanade at the car show."

Subject outcomes

GPT-4V correct

Item 10100% solve rateanswer: A desk with a computer, computer keyboard, mug, empty plastic bottle, red pen, and a laptop.

Vision-language compositionality task (SugarCrepe, hard-negative type: swap_att). Image: 000000226171.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A desk with a computer, computer keyboard, mug, empty plastic bottle, red pen, and a laptop." Hard-negative caption: "A desk with a computer, computer keyboard, mug, red plastic bottle, empty pen, and a laptop."

Subject outcomes

GPT-4V correct

Item 11100% solve rateanswer: A group of people walking past a tall clock.

Vision-language compositionality task (SugarCrepe, hard-negative type: swap_att). Image: 000000409542.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A group of people walking past a tall clock." Hard-negative caption: "A tall group of people walking past a clock."

Subject outcomes

GPT-4V correct

Item 12100% solve rateanswer: A newly married couple standing next to a four layer wedding cake.

Vision-language compositionality task (SugarCrepe, hard-negative type: replace_rel). Image: 000000363207.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A newly married couple standing next to a four layer wedding cake." Hard-negative caption: "A newly married couple sitting at a table in front of a four layer wedding cake."

Subject outcomes

GPT-4V correct

Subjects

The models, agents, and reward models evaluated.

1 subjects, ranked by mean response (accuracy) across this benchmark's items.

1GPT-4V0.9219

Full data on Hugging Face Back to the gallery