Vision-language compositionality task (SugarCrepe, hard-negative type: add_att). Image: 000000295713.jpg (COCO-2017 val). Which caption correctly describes the image? Positive caption: "A woman sitting on a bench near a statue, checking her phone." Hard-negative caption: "A woman sitting on a wooden bench near a statue, checking her phone."
Subject outcomes
- GPT-4V incorrect
