methodology

Benchmark Method

The fundamental shift replaces traditional sequential planning with dynamic feedback loops. The Manager creates high-level tasks, the Executor takes one concrete action toward the first task, then the Manager immediately reassesses.

← View Results View Leaderboard

core architecture
manager-executor loop

feedback loop

plan

Manager creates task

execute

Single action

observe

Read environment

reassess

Manager replans

Plan all stepsExecute in orderNo adaptation

new

Dynamic Feedback Loop

01

Manager creates a high-level task

02

Executor takes one concrete action

03

System observes environment changes

04

Manager reassesses and replans

six technical improvements

Key innovations that enable state-of-the-art performance on AndroidWorld.

01

Specialized Text Agent

Routes text-intensive tasks to a dedicated agent with Python shell access. Receives accessibility trees plus current text context and can atomically clear and replace content.

• Routes when Manager tags items as text tasks
• Receives accessibility trees and focused element context
• Atomic clear/replace for reliable text editing

02

Contextual Awareness

Enhanced through device date injection, 0.5-second screen stabilization waits, disabled pointer visualization, differential state tracking, and automatic app capability extraction.

• Device date injected into context
• 0.5s screen stabilization before reads
• Differential state: current vs. previous accessibility tree
• Automatic app capability extraction

03

Transparent Communication

Executors output three components — thought process, chosen action, and description — all injected into Manager context for full decision rationale.

04

Memory System

Guidance scattered throughout system prompts with repeated context in multiple sections for consistent availability. Injections into both system prompt and final user message.

05

Expanded Actions

Eight action primitives covering the full interaction surface of mobile devices, from simple taps to complex clipboard operations.

06

Prompt Engineering

Iterative refinement across system prompts through strategic distribution rather than concentration of instructions. Model-specific optimization patterns.

action primitives

Eight operations covering the full interaction surface of mobile devices.

01 click (by index)

02 long_press

03 type (with focus parameter)

04 system_button

05 swipe (coordinate-based)

06 open_app (by name)

07 copy (clipboard)

08 paste (with clear options)

core insight

Tight feedback loops, task-specific routing, rich state observability, and dynamic replanning outperform rigid plan-then-execute models for mobile UI automation.

feedback loopstask routingstate observabilitydynamic replanning

see the results.

View the full benchmark results with 91.4% success rate across 116 AndroidWorld tasks.

← View Results Explore Framework