Multimodal

GeckoNum

GeckoNum: numerical reasoning in text-to-image models. 1,386 numeric prompts are rendered by 7 T2I models (5 seeds); humans annotate each generated image on three tasks (object counting, relational description choice, and yes/no DSG questions). Each row grades whether one model's generated image matched the numeric prompt.

2,027items

7subjects

100%observed

CC-BY-4.0license

generaldomain

imagemodality

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 7 subjects × 2,027 items, 100% of cells evaluated.

GeckoNum response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: 2

Prompt: There is 1 burger below 2 flutes. Question: How many flutes are in the image?

Subject outcomes

dalle_3 incorrect
imagen_a incorrect
imagen_b incorrect
imagen_d incorrect
muse_a incorrect
muse_b incorrect

Item 22% solve rateanswer: 5

Prompt: 5 corkscrews. Question: How many corkscrews are in the image?

Subject outcomes

dalle_3 incorrect
imagen_a incorrect
imagen_b incorrect
imagen_d incorrect
muse_a incorrect
muse_b incorrect

Item 34% solve rateanswer: 2

Prompt: An image with some books and some cats. There are fewer books than cats. Task: choose the text description that best describes the image.

Subject outcomes

imagen_d correct
muse_b correct
dalle_3 incorrect
imagen_a incorrect
imagen_c incorrect
muse_a incorrect

Item 47% solve rateanswer: 4

Prompt: 2 samosas, two flutes and four spoons. Question: How many spoons are in the image?

Subject outcomes

dalle_3 incorrect
imagen_a incorrect
imagen_b incorrect
imagen_d incorrect
muse_a incorrect
muse_b incorrect

Item 511% solve rateanswer: 5

Prompt: 5 pizzas, 4 books and three okras. Question: How many pizzas are in the image?

Subject outcomes

dalle_3 correct
imagen_b correct
imagen_c correct
imagen_d incorrect
muse_a incorrect
muse_b incorrect

Item 615% solve rateanswer: 9

Prompt: 9 pizzas. Question: How many pizzas are in the image?

Subject outcomes

dalle_3 correct
muse_b correct
imagen_a incorrect
imagen_c incorrect
imagen_d incorrect
muse_a incorrect

Item 721% solve rateanswer: 4

Prompt: 3 black apples and four black mushrooms. Question: How many black apples are in the image?

Subject outcomes

dalle_3 correct
imagen_a incorrect
imagen_b incorrect
imagen_d incorrect
muse_a incorrect
muse_b incorrect

Item 829% solve rateanswer: 4

Prompt: Four coconuts. Question: How many coconuts are in the image?

Subject outcomes

muse_b correct
muse_a correct
dalle_3 incorrect
imagen_a incorrect
imagen_d incorrect
imagen_c incorrect

Item 940% solve rateanswer: 3

Prompt: There are 3 seahorses below 3 pencils. Question: How many seahorses are in the image?

Subject outcomes

dalle_3 correct
imagen_d correct
imagen_a incorrect
imagen_c incorrect
muse_a incorrect
muse_b incorrect

Item 1053% solve rateanswer: 3

Prompt: A picture of 3 samosas. Question: How many samosas are in the image?

Subject outcomes

dalle_3 correct
imagen_a correct
imagen_b incorrect
imagen_d incorrect
muse_a incorrect
muse_b incorrect

Item 1170% solve rateanswer: 3

Prompt: Three seahorses. Question: How many seahorses are in the image?

Subject outcomes

dalle_3 correct
imagen_a correct
imagen_c correct
imagen_d correct
imagen_b incorrect
muse_a incorrect

Item 1291% solve rateanswer: 3

Prompt: Three green bottles. Question: How many bottles are in the image?

Subject outcomes

dalle_3 correct
imagen_a correct
imagen_b correct
imagen_d correct
muse_a correct
muse_b correct

Subjects

The models, agents, and reward models evaluated.

7 subjects, ranked by mean response (accuracy) across this benchmark's items.

1dalle_30.4317
2muse_b0.3601
3imagen_c0.3382
4muse_a0.3299
5imagen_d0.2817
6imagen_a0.2683
7imagen_b0.266

Full data on Hugging Face Back to the gallery