Agents & Tool Use

FlySearch

Outdoor UAV object-search scenarios probing how vision-language models explore.

1,384items

10subjects

43%observed

Agentsubject type

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 10 subjects × 1,384 items, 43% of cells evaluated.

FlySearch response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = target object found · 0 = not found (per-run success).

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

10 subjects, ranked by mean response (accuracy) across this benchmark's items.

19b861f340.398
2674ebfc30.291
3a2a69af70.274
423824e3a0.248
57495adbd0.14
67e2fa19c0.092
74992895d0.025
81bb98d8f0.005
9e7d9db9a0.005
102a312e920

Full data on Hugging Face Back to the gallery