Skip to main content

Multimodal

FineGRAIN (T2I)

FineGRAIN T2I failure-mode benchmark: ~17 text-to-image models x 760 prompts, each prompt tagged with one of 27 fine-grained failure modes (counting, colour/shape/texture attribute binding, spatial relations, physics, text rendering, negation, perspective, ...) across 11 categories. Each generated image carries a human label for whether the prompt's failure mode is present; response is the human success verdict (1 = no failure / prompt rendered correctly, 0 = failure).

760items
17subjects
99%observed
Modelsubject type
MITlicense
generaldomain
textmodality
imagemodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 17 subjects × 760 items, 99% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

FineGRAIN (T2I) response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 182% solve rate

An arrangement of exactly two red apples and precisely three yellow bananas on a circular plate. Blur background, product photography.

How subjects answered

  • qwen-image correct

    images/qwen-image/counts_or_multiple_objects/0.png

  • gpt_image1 correct

    images/gpt_image1/counts_or_multiple_objects/0.png

  • sd3.5_medium incorrect
Item 287% solve rate

A cyclist speeding down a hill, with frames capturing the downhill descent. Highlight the motion of the legs pedaling and the wheels spinning rapidly.

How subjects answered

  • wan22 correct

    images/wan22/action_and_motion_representation/21.png

  • seeDream3 correct

    images/seeDream3/action_and_motion_representation/21.png

  • sd3_xl incorrect
Item 387% solve rate

A fabric tree trunk with a velvet texture, standing in a forest of regular trees. The forest is lit by the soft glow of the sunset.

How subjects answered

  • wan22 correct

    images/wan22/texture_attribute_binding/3.png

  • seeDream3 correct

    images/seeDream3/texture_attribute_binding/3.png

  • sd3_xl incorrect
Item 4100% solve rate

A circular clock with square numbers on the face, mounted on a triangular wall. The background is brightly lit.

How subjects answered

  • flux correct
  • seeDream3 correct

    images/seeDream3/shape_attribute_binding/17.png

  • sd2.1 correct

    images/sd2.1/shape_attribute_binding/17.png

Item 5100% solve rate

A banner at a sports event that says 'Go Team!' with fans cheering in the background.

How subjects answered

  • flux correct
  • seeDream3 correct

    images/seeDream3/text-based/18.png

  • sd2.1 correct

    images/sd2.1/text-based/18.png

Item 6100% solve rate

An oil painting depicting an event in ancient Rome. A long table shows clear social hierarchy. The painting should capture the subtle interplay of emotions, social status, and unspoken tensions typical of the era.

How subjects answered

  • flux correct
  • seeDream3 correct

    images/seeDream3/social_relations/0.png

  • sd2.1 correct

    images/sd2.1/social_relations/0.png

Subjects

The models, agents, and reward models evaluated.

17 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gemini_image1
  2. 2wan221
  3. 3sd2.11
  4. 4qwen-image1
  5. 5hidream1
  6. 6seeDream31
  7. 7sdv1.51
  8. 8nano_banana21
  9. 9flux2_pro1
  10. 10gpt_image151
  11. 11gpt_image11
  12. 12flux_kontext1
  13. 13flux0.516
  14. 14sd3.5_large0.516
  15. 15sd3_m0.516
  16. 16sd3_xl0.516
  17. 17sd3.5_medium0.516