Skip to main content

Agents & Tool Use

AndroidWorld

AndroidWorld: 116-task Android agent benchmark; per-task data from 3 public agent pages

116items
3subjects
100%observed
Agentsubject type
Apache-2.0license
gui_agentdomain
textmodality
gui_screenshotmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 3 subjects × 116 items, 100% of cells evaluated. The heatmap shows a representative 9 of 116 items — evenly sampled across difficulty — so each cell stays square and legible.

AndroidWorld response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: Transcribe the contents of video {video_name} by watching it in VLC player (located in Download) and writing the sequence of strings shown on each frame to the text file {file_name} in Markor as a com
AndroidWorld task: MarkorTranscribeVideo

Subject outcomes

  • DroidRun incorrect
  • FinalRun incorrect
  • AutoDevice incorrect
Item 267% solve rateanswer: What are my high priority tasks in Tasks app? Answer with the titles only. If there are multiples titles, format your answer in a comma separated list.
AndroidWorld task: TasksHighPriorityTasks

Subject outcomes

  • DroidRun correct
  • AutoDevice correct
  • FinalRun incorrect
Item 3100% solve rateanswer: When is my next meeting with {person} in Simple Calendar Pro? Express your answer in the format <month name> <day> <year> <hour in 24-hour format>:<minutes>.
AndroidWorld task: SimpleCalendarNextMeetingWithPerson

Subject outcomes

  • DroidRun correct
  • FinalRun correct
  • AutoDevice correct
Item 4100% solve rateanswer: In Simple Calendar Pro, create a recurring calendar event titled '{event_title}' starting on {year}-{month}-{day} at {hour}h. The event recurs {repeat_rule}, forever, and lasts for {duration_mins} min
AndroidWorld task: SimpleCalendarAddRepeatingEvent

Subject outcomes

  • DroidRun correct
  • FinalRun correct
  • AutoDevice correct
Item 5100% solve rateanswer: In Markor, move the note {file_name} from {source_folder} to {destination_folder}.
AndroidWorld task: MarkorMoveNote

Subject outcomes

  • DroidRun correct
  • FinalRun correct
  • AutoDevice correct
Item 6100% solve rateanswer: Delete the following recipes from Broccoli app: {titles}.
AndroidWorld task: RecipeDeleteSingleWithRecipeWithNoise

Subject outcomes

  • DroidRun correct
  • FinalRun correct
  • AutoDevice correct

Subjects

The models, agents, and reward models evaluated.

3 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1AutoDevice0.948
  2. 2DroidRun0.914
  3. 3FinalRun0.767