AndroidWorld task: MarkorTranscribeVideo
Subject outcomes
- DroidRun incorrect
- FinalRun incorrect
- AutoDevice incorrect
Agents & Tool Use
AndroidWorld: 116-task Android agent benchmark; per-task data from 3 public agent pages
Response matrix
Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 3 subjects × 116 items, 100% of cells evaluated. The heatmap shows a representative 9 of 116 items — evenly sampled across difficulty — so each cell stays square and legible.

Scale: 1 = correct · 0 = incorrect
Sample items
A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.
AndroidWorld task: MarkorTranscribeVideo
Subject outcomes
AndroidWorld task: TasksHighPriorityTasks
Subject outcomes
AndroidWorld task: SimpleCalendarNextMeetingWithPerson
Subject outcomes
AndroidWorld task: SimpleCalendarAddRepeatingEvent
Subject outcomes
AndroidWorld task: MarkorMoveNote
Subject outcomes
AndroidWorld task: RecipeDeleteSingleWithRecipeWithNoise
Subject outcomes
Subjects
3 subjects, ranked by mean response (accuracy) across this benchmark's items.