Skip to main content

Multimodal

GeckoNum

GeckoNum: numerical reasoning in text-to-image models. 1,386 numeric prompts are rendered by 7 T2I models (5 seeds); humans annotate each generated image on three tasks (object counting, relational description choice, and yes/no DSG questions). Each row grades whether one model's generated image matched the numeric prompt.

2,027items
7subjects
100%observed
CC-BY-4.0license
generaldomain
imagemodality
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 7 subjects × 2,027 items, 100% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

GeckoNum response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: 2

Prompt: There is 1 burger below 2 flutes. Question: How many flutes are in the image?

Subject outcomes

  • dalle_3 incorrect
  • imagen_a incorrect
  • imagen_b incorrect
  • imagen_d incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 22% solve rateanswer: 5

Prompt: 5 corkscrews. Question: How many corkscrews are in the image?

Subject outcomes

  • dalle_3 incorrect
  • imagen_a incorrect
  • imagen_b incorrect
  • imagen_d incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 34% solve rateanswer: 2

Prompt: An image with some books and some cats. There are fewer books than cats. Task: choose the text description that best describes the image.

Subject outcomes

  • imagen_d correct
  • muse_b correct
  • dalle_3 incorrect
  • imagen_a incorrect
  • imagen_c incorrect
  • muse_a incorrect
Item 47% solve rateanswer: 4

Prompt: 2 samosas, two flutes and four spoons. Question: How many spoons are in the image?

Subject outcomes

  • dalle_3 incorrect
  • imagen_a incorrect
  • imagen_b incorrect
  • imagen_d incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 511% solve rateanswer: 5

Prompt: 5 pizzas, 4 books and three okras. Question: How many pizzas are in the image?

Subject outcomes

  • dalle_3 correct
  • imagen_b correct
  • imagen_c correct
  • imagen_d incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 615% solve rateanswer: 9

Prompt: 9 pizzas. Question: How many pizzas are in the image?

Subject outcomes

  • dalle_3 correct
  • muse_b correct
  • imagen_a incorrect
  • imagen_c incorrect
  • imagen_d incorrect
  • muse_a incorrect
Item 721% solve rateanswer: 4

Prompt: 3 black apples and four black mushrooms. Question: How many black apples are in the image?

Subject outcomes

  • dalle_3 correct
  • imagen_a incorrect
  • imagen_b incorrect
  • imagen_d incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 829% solve rateanswer: 4

Prompt: Four coconuts. Question: How many coconuts are in the image?

Subject outcomes

  • muse_b correct
  • muse_a correct
  • dalle_3 incorrect
  • imagen_a incorrect
  • imagen_d incorrect
  • imagen_c incorrect
Item 940% solve rateanswer: 3

Prompt: There are 3 seahorses below 3 pencils. Question: How many seahorses are in the image?

Subject outcomes

  • dalle_3 correct
  • imagen_d correct
  • imagen_a incorrect
  • imagen_c incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 1053% solve rateanswer: 3

Prompt: A picture of 3 samosas. Question: How many samosas are in the image?

Subject outcomes

  • dalle_3 correct
  • imagen_a correct
  • imagen_b incorrect
  • imagen_d incorrect
  • muse_a incorrect
  • muse_b incorrect
Item 1170% solve rateanswer: 3

Prompt: Three seahorses. Question: How many seahorses are in the image?

Subject outcomes

  • dalle_3 correct
  • imagen_a correct
  • imagen_c correct
  • imagen_d correct
  • imagen_b incorrect
  • muse_a incorrect
Item 1291% solve rateanswer: 3

Prompt: Three green bottles. Question: How many bottles are in the image?

Subject outcomes

  • dalle_3 correct
  • imagen_a correct
  • imagen_b correct
  • imagen_d correct
  • muse_a correct
  • muse_b correct

Subjects

The models, agents, and reward models evaluated.

7 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1dalle_30.4317
  2. 2muse_b0.3601
  3. 3imagen_c0.3382
  4. 4muse_a0.3299
  5. 5imagen_d0.2817
  6. 6imagen_a0.2683
  7. 7imagen_b0.266