Multimodal

FineGRAIN (T2I)

FineGRAIN T2I failure-mode benchmark: ~17 text-to-image models x 760 prompts, each prompt tagged with one of 27 fine-grained failure modes (counting, colour/shape/texture attribute binding, spatial relations, physics, text rendering, negation, perspective, ...) across 11 categories. Each generated image carries a human label for whether the prompt's failure mode is present; response is the human success verdict (1 = no failure / prompt rendered correctly, 0 = failure).

760items

17subjects

99%observed

Modelsubject type

MITlicense

generaldomain

textmodality

imagemodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 17 subjects × 760 items, 99% of cells evaluated.

FineGRAIN (T2I) response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 182% solve rate

An arrangement of exactly two red apples and precisely three yellow bananas on a circular plate. Blur background, product photography.

How subjects answered

qwen-image correct
images/qwen-image/counts_or_multiple_objects/0.png
gpt_image1 correct
images/gpt_image1/counts_or_multiple_objects/0.png
sd3.5_medium incorrect

Item 287% solve rate

A cyclist speeding down a hill, with frames capturing the downhill descent. Highlight the motion of the legs pedaling and the wheels spinning rapidly.

How subjects answered

wan22 correct
images/wan22/action_and_motion_representation/21.png
seeDream3 correct
images/seeDream3/action_and_motion_representation/21.png
sd3_xl incorrect

Item 387% solve rate

A fabric tree trunk with a velvet texture, standing in a forest of regular trees. The forest is lit by the soft glow of the sunset.

How subjects answered

wan22 correct
images/wan22/texture_attribute_binding/3.png
seeDream3 correct
images/seeDream3/texture_attribute_binding/3.png
sd3_xl incorrect

Item 4100% solve rate

A circular clock with square numbers on the face, mounted on a triangular wall. The background is brightly lit.

How subjects answered

flux correct
seeDream3 correct
images/seeDream3/shape_attribute_binding/17.png
sd2.1 correct
images/sd2.1/shape_attribute_binding/17.png

Item 5100% solve rate

A banner at a sports event that says 'Go Team!' with fans cheering in the background.

How subjects answered

flux correct
seeDream3 correct
images/seeDream3/text-based/18.png
sd2.1 correct
images/sd2.1/text-based/18.png

Item 6100% solve rate

An oil painting depicting an event in ancient Rome. A long table shows clear social hierarchy. The painting should capture the subtle interplay of emotions, social status, and unspoken tensions typical of the era.

How subjects answered

flux correct
seeDream3 correct
images/seeDream3/social_relations/0.png
sd2.1 correct
images/sd2.1/social_relations/0.png

Subjects

The models, agents, and reward models evaluated.

17 subjects, ranked by mean response (accuracy) across this benchmark's items.

1gemini_image1
2wan221
3sd2.11
4qwen-image1
5hidream1
6seeDream31
7sdv1.51
8nano_banana21
9flux2_pro1
10gpt_image151
11gpt_image11
12flux_kontext1
13flux0.516
14sd3.5_large0.516
15sd3_m0.516
16sd3_xl0.516
17sd3.5_medium0.516

Full data on Hugging Face Back to the gallery