Skip to main content

Agents & Tool Use

FlySearch

Outdoor UAV object-search scenarios probing how vision-language models explore.

1,384items
10subjects
43%observed
Agentsubject type

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 10 subjects × 1,384 items, 43% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

FlySearch response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = target object found · 0 = not found (per-run success).

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

10 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 19b861f340.398
  2. 2674ebfc30.291
  3. 3a2a69af70.274
  4. 423824e3a0.248
  5. 57495adbd0.14
  6. 67e2fa19c0.092
  7. 74992895d0.025
  8. 81bb98d8f0.005
  9. 9e7d9db9a0.005
  10. 102a312e920