当前位置：首页 > news >正文

【GUI-Agent】阿里通义MAI-UI 代码阅读（2）--- 实现

news 2026/5/8 20:34:55

【GUI-Agent】阿里通义MAI-UI 代码阅读（2）--- 实现

【GUI-Agent】阿里通义MAI-UI 代码阅读（2）--- 实现
- 0x00 摘要
- 0x01 工程实现特色
  - 1.1 特色1
  - 1.2 特色2
  - 1.3 特色 3
  - 1.4 小结
- 0x02 提示词
  - 2.1 提示词代码
    - MAI_MOBILE_SYS_PROMPT
    - MAI_MOBILE_SYS_PROMPT_NO_THINKING
    - MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP
    - MAI_MOBILE_SYS_PROMPT_GROUNDING
  - 2.2 移动系统提示词差异一览
  - 2.3 工具集成差异
- 0x03 输出
  - 3.1 输出格式区别
    - 非 MCP 版本（MAI_MOBILE_SYS_PROMPT）
    - MCP 版本（MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP）
  - 3.2 功能范围区别
    - 非 MCP 版本
    - MCP 版本
  - 3.3 实际应用场景
    - 标准 GUI 操作
    - MCP 工具调用
    - 代码实现中的处理
- 0x04 MAIUINaivigationAgent
  - 4.1 核心特色
  - 4.2 定义
  - 4.3 构建图像
  - 4.4 构建文字
  - 4.5 流程
  - 4.6 推理
    - 核心作用
    - 核心特色
    - 流程
      - predict 的流程如下
      - 时序图
    - 代码
  - 4.7 轨迹
    - TrajMemory / TrajStep 数据结构图
    - 派生视图
    - 序列化路径
    - 代码
- 0x05 MAIGroundingAgent
  - 5.1 核心特色
  - 5.2 定义
  - 5.3 数据流
  - 5.4 推理
  - 5.5 解析
- 0xEE 广告
- 购买链接
- 0xFF 参考

0x00 摘要

MAI-UI 是阿里通义实验室发布的一项重磅研究成果：是一个旨在 重塑人机交互方式 的“基础图形用户界面（GUI）智能体”，和阶跃星辰的思路非常类似，因此我们可以互相印证。

MAI-UI的信息如下：

https://arxiv.org/pdf/2512.22047

https://github.com/Tongyi-MAI/MAI-UI

MAI-UI 的两类核心Agent如下，本篇会介绍这两类Agent：

Agent	文件	任务	输出协议
MAIGroundingAgent	src/mai_grounding_agent.py	UI 元素定位（单步）	<grounding_think>.</grounding_think>{"coordinate":[x,y]}，坐标基于 SCALE_FACT0R=999 归一化
MAIUINavigationAgent	src/mai_navigation_agent.py	多步移动端GUI导航，支持ask_user与mcp_call	.<tool_call>{json}</tool_call>，多轮带历史截图

0x01 工程实现特色

MAI-UI 工程实现的三个特色如下。

1.1 特色1

特色1：三套系统提示词对应三种Agent形态：grounding / 纯导航 / ask_user + MCP 增强导航

src/prompt.py同时维护：

MAI_MOBILE_SYS_PROMPT_GROUNDING 一单步元素定位
MAI_MOBILE_SYS_PROMPT 一标准多步导航
MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 一在导航动作集里叠加两个特殊工具：
- ask_user（question）：模型主动反问用户、把任务“打回去"
- mcp_call（tool，args）：调外部MCP工具（如高德导航）补全设备端做不到的能力

意义：

这是"Agent-User Interaction +MCP Augmentation" 范式在代码层面的真实落点一不是新API、就是同一个模型在不同system prompt下解锁不同动作集。
新增交互类工具的正确姿势就是改prompt.py+parse_tagged_text的 schema，而不是另起一个Agent类。

1.2 特色2

特色2是：归一化坐标空间SCALE_FACTOR = 999 + XML标签输出协议（而非function-calling）。

src/mai_grounding_agent.py 与 src/mai_naivigation_agent.py 都硬编码为 SCALE_FACTOR=
999；模型永远输出[0，999]区间整数，由客户端按当前截图（W，H）反归一化。
输出不是OpenAI function-calling，而是裸文本里的 XML 标签：
- Grounding:<grounding_think>...</grounding_think>{"coordinate":[x,y]}
- Navigation:.<tool_call>{json}</tool_call>（兼容 thinking 模型的）
- 解析器：parse_grounding_response、parse_tagged_text，错误统一抛 ValueError。

意义：

跨分辨率泛化：同一个模型同一个权重无缝服务任意手机分辨率，不需要在 prompt里写屏幕尺寸；
协议无关于推理后端一VLLM0.11.0、HFtransformers 本地推理、DashScope都能用，因为只解析纯文本，不依赖任何后端的tool-call结构；
代价：解析鲁棒性必须由客户端自己保证（所以两个parser都做了容错+显式异常）

1.3 特色 3

特色 3：无状态服务端 +客户端自管TrajMemory，每步把历史截图重塞回 messages：

BaseAgent 持有 traj_memory：TrajMemory，每个 TrajStep 同时存 screenshot: Image 和 screenshot_bytes：bytes（渲染vs序列化双用）
MAIUINaivigationAgent._build_messages() 按 runtime_conf["history_n"] 把最近 N 步的“截图+模型回复“重组成多轮user/assistant对话再发给vLLM一一一vLLM 端零会话状态。
save_traj()/load_traj()走bytes，可被序列化/回放/做评测离线分析。
stept的请求体（每步独立、无状态）如下：

意义：

可回放、可评测、可断点续跑—save_traj出dict、load_traj直接灌回，离线replay，不需要真机/模拟器；
横向扩展友好一VLLM可以集群水平扩，因为没有会话粘性，这正契合 scaling parallel environments up to 512"的训l练形态在推理侧的对应做法；
代价：每步N张图都要重传，带宽与 prefill 成本随 history_n线性增长，调小 history_n是常见的省 token 技巧。

1.4 小结

MAI-UI的工程独到之处不是模型本身，而是这套客户端契约：分辨率无关的999坐标空间 + XML标签协议（与后端解耦）+ 无状态多轮重放（与历史长度解耦）+ 三档 prompt解锁的grounding/导航/ask_user+MCP
三种形态一一一后续任何二次开发都沿着这四条线走，而不是去改模型契约。

0x02 提示词

2.1 提示词代码

以下是提示词代码。

MAI_MOBILE_SYS_PROMPT

MAI_MOBILE_SYS_PROMPT = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.## Output Format
For each function call, return the thinking process in <thinking> </thinking> tags, and a json object with function name and arguments within <tool_call></tool_call> XML tags:
```
<thinking>
...
</thinking>
<tool_call>
{"name": "mobile_use", "arguments": <args-json-object>}
</tool_call>
```## Action Space{"action": "click", "coordinate": [x, y]}
{"action": "long_press", "coordinate": [x, y]}
{"action": "type", "text": ""}
{"action": "swipe", "direction": "up or down or left or right", "coordinate": [x, y]} # "coordinate" is optional. Use the "coordinate" if you want to swipe a specific UI element.
{"action": "open", "text": "app_name"}
{"action": "drag", "start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
{"action": "system_button", "button": "button_name"} # Options: back, home, menu, enter
{"action": "wait"}
{"action": "terminate", "status": "success or fail"}
{"action": "answer", "text": "xxx"} # Use escape characters \\', \\", and \\n in text part to ensure we can parse the text in normal python string format.## Note
- Write a small plan and finally summarize your next action (with its target element) in one sentence in <thinking></thinking> part.
- Available Apps: `["Camera","Chrome","Clock","Contacts","Dialer","Files","Settings","Markor","Tasks","Simple Draw Pro","Simple Gallery Pro","Simple SMS Messenger","Audio Recorder","Pro Expense","Broccoli APP","OSMand","VLC","Joplin","Retro Music","OpenTracks","Simple Calendar Pro"]`.
You should use the `open` action to open the app as possible as you can, because it is the fast way to open the app.
- You must follow the Action Space strictly, and return the correct json object within <thinking> </thinking> and <tool_call></tool_call> XML tags.
""".strip()

MAI_MOBILE_SYS_PROMPT_NO_THINKING

MAI_MOBILE_SYS_PROMPT_NO_THINKING = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.## Output Format
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
```
<tool_call>
{"name": "mobile_use", "arguments": <args-json-object>}
</tool_call>
```## Action Space{"action": "click", "coordinate": [x, y]}
{"action": "long_press", "coordinate": [x, y]}
{"action": "type", "text": ""}
{"action": "swipe", "direction": "up or down or left or right", "coordinate": [x, y]} # "coordinate" is optional. Use the "coordinate" if you want to swipe a specific UI element.
{"action": "open", "text": "app_name"}
{"action": "drag", "start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
{"action": "system_button", "button": "button_name"} # Options: back, home, menu, enter
{"action": "wait"}
{"action": "terminate", "status": "success or fail"}
{"action": "answer", "text": "xxx"} # Use escape characters \\', \\", and \\n in text part to ensure we can parse the text in normal python string format.## Note
- Available Apps: `["Camera","Chrome","Clock","Contacts","Dialer","Files","Settings","Markor","Tasks","Simple Draw Pro","Simple Gallery Pro","Simple SMS Messenger","Audio Recorder","Pro Expense","Broccoli APP","OSMand","VLC","Joplin","Retro Music","OpenTracks","Simple Calendar Pro"]`.
You should use the `open` action to open the app as possible as you can, because it is the fast way to open the app.
- You must follow the Action Space strictly, and return the correct json object within <thinking> </thinking> and <tool_call></tool_call> XML tags.
""".strip()

MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP

# Placeholder prompts for future features
MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP = Template("""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format
For each function call, return the thinking process in <thinking> </thinking> tags, and a json object with function name and arguments within <tool_call></tool_call> XML tags:
```
<thinking>
...
</thinking>
<tool_call>
{"name": "mobile_use", "arguments": <args-json-object>}
</tool_call>
```## Action Space{"action": "click", "coordinate": [x, y]}
{"action": "long_press", "coordinate": [x, y]}
{"action": "type", "text": ""}
{"action": "swipe", "direction": "up or down or left or right", "coordinate": [x, y]} # "coordinate" is optional. Use the "coordinate" if you want to swipe a specific UI element.
{"action": "open", "text": "app_name"}
{"action": "drag", "start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
{"action": "system_button", "button": "button_name"} # Options: back, home, menu, enter 
{"action": "wait"}
{"action": "terminate", "status": "success or fail"} 
{"action": "answer", "text": "xxx"} # Use escape characters \\', \\", and \\n in text part to ensure we can parse the text in normal python string format.
{"action": "ask_user", "text": "xxx"} # you can ask user for more information to complete the task.
{"action": "double_click", "coordinate": [x, y]}{% if tools -%}
## MCP Tools
You are also provided with MCP tools, you can use them to complete the task.
{{ tools }}If you want to use MCP tools, you must output as the following format:
```
<thinking>
...
</thinking>
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
```
{% endif -%}## Note
- Available Apps: `["Contacts", "Settings", "Clock", "Maps", "Chrome", "Calendar", "files", "Gallery", "Taodian", "Mattermost", "Mastodon", "Mail", "SMS", "Camera"]`.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in <thinking></thinking> part.
""".strip()
)

MAI_MOBILE_SYS_PROMPT_GROUNDING

MAI_MOBILE_SYS_PROMPT_GROUNDING = """
You are a GUI grounding agent. 
## Task
Given a screenshot and the user's grounding instruction. Your task is to accurately locate a UI element based on the user's instructions.
First, you should carefully examine the screenshot and analyze the user's instructions,  translate the user's instruction into a effective reasoning process, and then provide the final coordinate.
## Output Format
Return a json object with a reasoning process in <grounding_think></grounding_think> tags, a [x,y] format coordinate within <answer></answer> XML tags:
<grounding_think>...</grounding_think>
<answer>
{"coordinate": [x,y]}
</answer>
""".strip()

2.2 移动系统提示词差异一览

只有 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板支持 MCP 工具集成，且通过 Jinja2 条件语法实现动态插入；其余提示词版本均不包含 MCP 功能。

提示词 ID	核心用途	思考标签	操作空间	特殊功能
MAI_MOBILE_SYS_PROMPT	标准 GUI 代理	`` 必须	点击/长按/输入/滑动等全功能	无
MAI_MOBILE_SYS_PROMPT_NO_THINKING	快速响应	无思考标签	同上	省略思考，直接返回 JSON
MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP	模板化+用户询问	可选	同上	ask_user、double_click、Jinja2 模板、MCP 工具集成
MAI_MOBILE_SYS_PROMPT_GROUNDING	纯定位专用	``	仅元素识别	输出 [x,y] 坐标，无操作命令

2.3 工具集成差异

MCP 功能只在 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板层集成，其余版本需外部桥接。

集成位置
- 仅 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 内置 MCP 工具调用入口（通过 Jinja2 模板动态注入）。
- 其余版本无 MCP 工具入口，需外部调用。
提示词层差异
- 标准版：无 MCP 占位符，纯 JSON 输出。
- MCP 版：模板内预留 {{mcp_tools}} 变量，运行时注入具体工具描述。
运行时差异
- 标准版：LLM 输出传统动作 JSON，由外部框架手动转发至 MCP。
- MCP 版：渲染后提示词包含完整 MCP 工具 JSON，LLM 可直接调用。
条件性集成（仅 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP）
- 使用 Jinja2 模板语法 {%if tools -%}...{%endif -%} 实现动态集成
- 独立 ## MCP Tools 区域存放 MCP 工具描述
- 通过 {{tools}} 变量动态插入可用工具信息
- 输出格式与标准移动操作不同：`` 内直接嵌入 MCP 函数调用

0x03 输出

3.1 输出格式区别

非 MCP 版本（MAI_MOBILE_SYS_PROMPT）

统一格式：所有操作通过 mobile_use 函数调用
固定结构：GUI 操作封装在 arguments 字段

示例：

<thinking>...</thinking>
<tool_call>
{"name":"mobile_use","arguments":<args-json-object>}
</tool_call>

MCP 版本（MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP）

双重格式：支持标准 GUI 操作和 MCP 工具调用
工具特定格式：MCP 工具调用使用实际函数名作为 name

示例：

<thinking>...</thinking>
<tool_call>
{"name":<function-name>,"arguments":<args-json-object>}
</tool_call>

下面代码把LLM的输出转换为结构化输出

def parse_action_to_structure_output(text: str) -> Dict[str, Any]:"""Parse model output text into structured action format.Args:text: Raw model output containing thinking and tool_call tags.Returns:Dictionary with keys:- "thinking": The model's reasoning process- "action_json": Parsed action with normalized coordinatesNote:Coordinates are normalized to [0, 1] range by dividing by SCALE_FACTOR."""text = text.strip()results = parse_tagged_text(text)thinking = results["thinking"]tool_call = results["tool_call"]action = tool_call["arguments"]# Normalize coordinates from SCALE_FACTOR range to [0, 1]if "coordinate" in action:coordinates = action["coordinate"]if len(coordinates) == 2:point_x, point_y = coordinateselif len(coordinates) == 4:x1, y1, x2, y2 = coordinatespoint_x = (x1 + x2) / 2point_y = (y1 + y2) / 2else:raise ValueError(f"Invalid coordinate format: expected 2 or 4 values, got {len(coordinates)}")point_x = point_x / SCALE_FACTORpoint_y = point_y / SCALE_FACTORaction["coordinate"] = [point_x, point_y]return {"thinking": thinking,"action_json": action,}

3.2 功能范围区别

非 MCP 版本

有限操作集：仅预定义 GUI 操作（点击、滑动、输入等）
移动设备专属：专注触摸屏界面交互
固定动作空间：无法扩展新操作类型

MCP 版本

扩展操作集：除 GUI 操作外，支持 MCP 工具
系统级功能：可通过 MCP 工具执行复杂系统操作
动态功能：依据配置工具动态扩展功能范围

3.3 实际应用场景

标准 GUI 操作

MCP 版本中标准 GUI 操作仍使用 mobile_use 函数
与非 MCP 版本行为基本相同

MCP 工具调用

需执行 MCP 工具时，使用工具名称作为函数名
可执行复杂任务（系统配置、数据处理等）

代码实现中的处理

在 MAIUIMobileAgent 类中：

若 self.tools 非空，使用 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板；
通过 render(tools=tools_str) 将工具列表注入提示词；
未配置工具时，回退到标准 MAI_MOBILE_SYS_PROMPT。

代码如下：

    @propertydef system_prompt(self) -> str:"""Generate the system prompt based on available MCP tools.Returns:System prompt string, with MCP tools section if tools are configured."""if self.tools:tools_str = "\n".join([json.dumps(tool, ensure_ascii=False) for tool in self.tools])return MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP.render(tools=tools_str)return MAI_MOBILE_SYS_PROMPT

MCP 版本提供更灵活的操作能力，允许智能体在标准 GUI 操作与 MCP 工具间切换，从而执行更复杂任务；非 MCP 版本则专注纯粹移动界面操作。

0x04 MAIUINaivigationAgent

MAIUINaivigationAgent（移动端 GUI 导航智能体） 是整个 MAI-GUI 智能体的 “底座模块”—— 它封装了 LLM 初始化、历史界面上下文管理、多模态消息构建等核心能力，专门为移动端 GUI 自动化场景设计，能基于任务指令和多步历史界面截图，构建标准化的多模态消息发送给 LLM，为后续动作生成提供统一的输入基础。

4.1 核心特色

MAIUINaivigationAgent 的核心逻辑如下：初始化（配置 + LLM 客户端）→ 图片预处理（历史 + 当前截图统一格式）→ 消息构建（按固定结构拼接多模态内容），全流程为 LLM 提供标准化、结构化的输入。

特色维度	具体说明
历史上下文智能管理	支持配置`history_n`参数（默认 3），自动截取最近 N 步的界面截图作为历史上下文，既保留关键操作轨迹，又避免上下文过长导致 LLM 推理效率下降；仅加载`history_n-1`条历史截图 + 当前截图，精准控制上下文长度
多格式图片兼容处理	`_prepare_images`方法支持字节流、PIL Image 等多种图片输入格式，自动转换为 RGB 格式的 PIL Image，解决不同来源截图的格式兼容问题，适配移动端截图的多样化场景
MCP 工具集成能力	初始化时支持传入 MCP 工具列表，为后续 LLM 调用 MCP 工具（如执行设备操作）预留扩展接口，兼容 MCP 协议生态
标准化多模态消息构建	`_build_messages`方法按 “系统提示词→用户指令→历史截图 + 历史响应→当前截图” 的固定逻辑构建消息，严格对齐 LLM 多模态输入格式，确保不同历史长度下消息结构统一
高度可配置化	支持自定义温度、top_k、top_p 等 LLM 推理参数，以及历史上下文长度（history_n），可根据不同移动端任务（如简单点击 / 复杂表单填写）调整配置

4.2 定义

MAIUINaivigationAgent 的定义如下。

class MAIUINaivigationAgent(BaseAgent):"""Mobile automation agent using vision-language models.This agent processes screenshots and natural language instructions togenerate GUI actions for mobile device automation.Attributes:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use for predictions.runtime_conf: Configuration dictionary for runtime parameters.history_n: Number of history steps to include in context."""def __init__(self,llm_base_url: str,model_name: str,runtime_conf: Optional[Dict[str, Any]] = None,tools: Optional[List[Dict[str, Any]]] = None,):"""Initialize the MAIMobileAgent.Args:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use.runtime_conf: Optional configuration dictionary with keys:- history_n: Number of history images to include (default: 3)- max_pixels: Maximum pixels for image processing- min_pixels: Minimum pixels for image processing- temperature: Sampling temperature (default: 0.0)- top_k: Top-k sampling parameter (default: -1)- top_p: Top-p sampling parameter (default: 1.0)- max_tokens: Maximum tokens in response (default: 2048)tools: Optional list of MCP tool definitions. Each tool should be a dictwith 'name', 'description', and 'parameters' keys."""super().__init__()# Store MCP toolsself.tools = tools or []# Set default configurationdefault_conf = {"history_n": 3,"temperature": 0.0,"top_k": -1,"top_p": 1.0,"max_tokens": 2048,}self.runtime_conf = {**default_conf, **(runtime_conf or {})}self.llm_base_url = llm_base_urlself.model_name = model_nameself.llm = OpenAI(base_url=self.llm_base_url,api_key="empty",)# Extract frequently used config valuesself.temperature = self.runtime_conf["temperature"]self.top_k = self.runtime_conf["top_k"]self.top_p = self.runtime_conf["top_p"]self.max_tokens = self.runtime_conf["max_tokens"]self.history_n = self.runtime_conf["history_n"]

4.3 构建图像

_prepare_images 函数被用来构建图像。

    def _prepare_images(self, screenshot_bytes: bytes) -> List[Image.Image]:"""Prepare image list including history and current screenshot.Args:screenshot_bytes: Current screenshot as bytes.Returns:List of PIL Images (history + current)."""# Calculate how many history images to includeif len(self.history_images) > 0:max_history = min(len(self.history_images), self.history_n - 1)recent_history = self.history_images[-max_history:] if max_history > 0 else []else:recent_history = []# Add current image bytesrecent_history.append(screenshot_bytes)# Normalize input typeif isinstance(recent_history, bytes):recent_history = [recent_history]elif isinstance(recent_history, np.ndarray):recent_history = list(recent_history)elif not isinstance(recent_history, list):raise TypeError(f"Unidentified images type: {type(recent_history)}")# Convert all images to PIL formatimages = []for image in recent_history:if isinstance(image, bytes):image = Image.open(BytesIO(image))elif isinstance(image, Image.Image):passelse:raise TypeError(f"Expected bytes or PIL Image, got {type(image)}")if image.mode != "RGB":image = image.convert("RGB")images.append(image)return images

4.4 构建文字

    def _build_messages(self,instruction: str,images: List[Image.Image],) -> List[Dict[str, Any]]:"""Build the message list for the LLM API call.Args:instruction: Task instruction from user.images: List of prepared images.Returns:List of message dictionaries for the API."""messages = [{"role": "system","content": [{"type": "text", "text": self.system_prompt}],},{"role": "user","content": [{"type": "text", "text": instruction}],},]image_num = 0history_responses = self.history_responsesif len(history_responses) > 0:for history_idx, history_response in enumerate(history_responses):# Only include images for recent history (last history_n responses)if history_idx + self.history_n >= len(history_responses):# Add image before the assistant responseif image_num < len(images) - 1:cur_image = images[image_num]encoded_string = pil_to_base64(cur_image)messages.append({"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},}],})image_num += 1messages.append({"role": "assistant","content": [{"type": "text", "text": history_response}],})# Add current image (last one in images list)if image_num < len(images):cur_image = images[image_num]encoded_string = pil_to_base64(cur_image)messages.append({"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},}],})else:# No history, just add the current imagecur_image = images[0]encoded_string = pil_to_base64(cur_image)messages.append({"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},}],})return messages

4.5 流程

MAIUINaivigationAgent 多步循环流程图如下：

特殊动作：

ask_user(question) →暂停，把问题返还给用户（设备-云协同里的用户交互）
mcp_call(tool，args) →调用外部MCP工具（如高德地图导航）
finish() 任务结束

也参见如下：

4.6 推理

核心作用

predict 是 MAI-GUI 智能体的核心决策与动作生成模块，是 GUI Agent 的 “决策大脑”，核心解决 “根据任务指令和当前界面状态，生成下一步具体 GUI 动作” 的问题，区别于单纯的元素定位模块。

predict 的核心功能是接收任务指令（如 “完成 APP 登录”）和当前界面观测信息（截图 + 可选的无障碍树），通过调用大语言模型生成并解析出下一步要执行的结构化 GUI 动作（如点击、滑动、输入等），同时记录完整的任务轨迹（Trajectory），是 GUI Agent 实现 “根据界面状态决策操作” 的核心环节。

predict 的流程闭环是：输入处理→消息构建→LLM 调用→响应解析→轨迹记录→结果输出，全流程覆盖异常处理，确保动作生成的稳定性。

核心特色

特色维度	具体说明
任务轨迹全链路记录	内置 `traj_memory` 轨迹记忆模块，每一步操作都会存储截图、模型响应、解析后的动作、推理过程等全量信息，支持任务溯源、调试和复盘
多维度界面观测输入	同时接收截图（视觉信息）和无障碍树（结构化 UI 信息），相比纯视觉输入更精准理解界面结构，适配复杂 GUI 场景
鲁棒的 LLM 调用与解析	① 内置 3 次 API 重试机制，捕获并打印异常栈信息，提升调用稳定性；② 标准化解析模型响应为 `thinking`（推理过程）+ `action_json`（结构化动作），确保输出格式统一
任务目标持久化	首次调用时将任务指令存入轨迹记忆作为持久化目标，避免后续步骤丢失核心任务方向
日志可视化友好	对包含图片的消息做脱敏打印（`mask_image_urls_for_logging`），既保留日志完整性又避免 Base64 编码刷屏，便于调试

流程

predict 的流程如下

时序图

时序图：用户 ⇔ Agent ⇔ vLLM (Navigation 场景)如下：

要点：

每步都把历史 history_n张截图重新塞进 messages（无服务端会话状态，vLLM是无状态的 chat completions）；
ask_user/mcp_call 是模型直接吐出的tool_call，调度由外层环境完成，agent 本身不做副作用；
日志路径上的 base64 图片一定经过 mask_image_urls_for_logging 替换为 [IMAGE_DATA]。

代码

    def predict(self,instruction: str,obs: Dict[str, Any],**kwargs: Any,) -> Tuple[str, Dict[str, Any]]:"""Predict the next action based on the current observation.Args:instruction: Task instruction/goal.obs: Current observation containing:- screenshot: PIL Image or bytes of current screen- accessibility_tree: Optional accessibility tree data**kwargs: Additional arguments including:- extra_info: Optional extra context stringReturns:Tuple of (prediction_text, action_dict) where:- prediction_text: Raw model response or error message- action_dict: Parsed action dictionary"""# Set task goal if not already setif not self.traj_memory.task_goal:self.traj_memory.task_goal = instruction# Process screenshotscreenshot_pil = obs["screenshot"]screenshot_bytes = safe_pil_to_bytes(screenshot_pil)# Prepare imagesimages = self._prepare_images(screenshot_bytes)# Build messagesmessages = self._build_messages(instruction, images)# Make API call with retry logicmax_retries = 3prediction = Noneaction_json = Nonefor attempt in range(max_retries):try:messages_print = mask_image_urls_for_logging(messages)print(f"Messages (attempt {attempt + 1}):\n{messages_print}")response = self.llm.chat.completions.create(model=self.model_name,messages=messages,max_tokens=self.max_tokens,temperature=self.temperature,top_p=self.top_p,frequency_penalty=0.0,presence_penalty=0.0,extra_body={"repetition_penalty": 1.0, "top_k": self.top_k},seed=42,)prediction = response.choices[0].message.content.strip()print(f"Raw response:\n{prediction}")# Parse responseparsed_response = parse_action_to_structure_output(prediction)thinking = parsed_response["thinking"]action_json = parsed_response["action_json"]print(f"Parsed response:\n{parsed_response}")breakexcept Exception as e:print(f"Error on attempt {attempt + 1}: {e}")traceback.print_exc()prediction = Noneaction_json = None# Return error if all retries failedif prediction is None or action_json is None:print("Max retry attempts reached, returning error flag.")return "llm client error", {"action": None}# Create and store trajectory steptraj_step = TrajStep(screenshot=screenshot_pil,accessibility_tree=obs.get("accessibility_tree"),prediction=prediction,action=action_json,conclusion="",thought=thinking,step_index=len(self.traj_memory.steps),agent_type="MAIMobileAgent",model_name=self.model_name,screenshot_bytes=screenshot_bytes,structured_action={"action_json": action_json},)self.traj_memory.steps.append(traj_step)return prediction, action_json

4.7 轨迹

TrajMemory / TrajStep 数据结构图

派生视图

派生视图（BaseAgent上的@property，避免外部直接遍历steps）如下：

序列化路径

  BaseAgent.save_traj() → {"task_goal", "task_id","steps": [{ screenshot_bytes, accessibility_tree, prediction,action, conclusion, thought,step_index, agent_type, model_name }, ...]}△ 注意：save 时只输出 screenshot_bytes，丢弃 PIL.Image 对象△ structured_action 字段不在 save_traj 输出里（只在内存中使用）BaseAgent.load_traj(traj_memory) →直接覆盖self.traj_memory（需要外部自行从dict重建TrajMemory）

要点：

Screenshot（PIL）+screenshot_bytes（bytes）双份并存：渲染走PIL、序列化/网络走 bytes，不要只保留一个；
thought、action <tool_call>，是解析器parse_tagged_text的两端落点； prediction存原始未解析的字符串，便于回放与debug，不要用解析后的结果覆盖它；
save_traj与TrajStep 字段不完全同构（structured_action 不导出），新增字段时要同步两处，否则round-trip会丢失。

代码

@dataclass
class TrajStep:"""Represents a single step in an agent's trajectory.Attributes:screenshot: PIL Image of the screen at this step.accessibility_tree: Accessibility tree data for the screen.prediction: Raw model prediction/response.action: Parsed action dictionary.conclusion: Conclusion or summary of the step.thought: Model's reasoning/thinking process.step_index: Index of this step in the trajectory.agent_type: Type of agent that produced this step.model_name: Name of the model used.screenshot_bytes: Original screenshot as bytes (for compatibility).structured_action: Structured action with metadata."""screenshot: Image.Imageaccessibility_tree: Optional[Dict[str, Any]]prediction: straction: Dict[str, Any]conclusion: strthought: strstep_index: intagent_type: strmodel_name: strscreenshot_bytes: Optional[bytes] = Nonestructured_action: Optional[Dict[str, Any]] = None@dataclass
class TrajMemory:"""Container for a complete trajectory of agent steps.Attributes:task_goal: The goal/instruction for this trajectory.task_id: Unique identifier for the task.steps: List of trajectory steps."""task_goal: strtask_id: strsteps: List[TrajStep] = field(default_factory=list)

0x05 MAIGroundingAgent

MAIGroundingAgent 是一款基于视觉 - 语言模型（VLM）的 GUI 定位智能体（Grounding Agent），该代码是 GUI Agent 的 “视觉定位模块”，核心解决 “从自然语言 + 截图中精准找到 UI 元素坐标” 的问题，是 GUI Agent 实现界面理解的核心环节

MAIGroundingAgent 的核心功能是接收自然语言指令（如 “点击登录按钮”）和 GUI 界面截图，通过调用大语言模型 API 解析指令意图、识别目标 UI 元素，并输出该元素的标准化坐标（归一化到 [0,1] 范围），为 GUI Agent 的后续操作（如点击、输入）提供精准的元素定位能力 —— 这是 GUI Agent 实现 “看懂界面” 的核心模块。

MAIGroundingAgent 的流程闭环如下：输入预处理→消息构建→LLM 调用→响应解析→结果输出，全流程覆盖异常处理，确保可用性。

5.1 核心特色

特色维度	具体说明
多模态输入处理	同时接收自然语言指令（文本）和界面截图（图像），适配 GUI 交互的视觉 + 语言双输入场景
标准化解析逻辑	固定解析模型输出中的 `（推理过程）和` （坐标）标签，确保输出结构统一；坐标自动归一化（除以 SCALE_FACTOR），适配不同分辨率界面
鲁棒性设计	① 内置 3 次 API 重试机制，应对网络 / 模型临时异常；② 兼容图片格式（自动转换为 RGB）、输入类型（支持 PIL Image / 字节流）；③ 完善的异常捕获，失败时返回明确错误标识
可配置化推理	支持自定义 LLM 推理参数（temperature/top_k/top_p/max_tokens 等），可根据场景调整模型生成策略（如 temperature=0 保证输出确定性）
清晰的流程闭环	从 “输入处理→构建多模态消息→调用 LLM→解析响应→返回标准化结果” 形成完整闭环，输出同时包含模型推理过程和最终坐标，便于调试与溯源

5.2 定义

MAIGroundingAgent 如下。

class MAIGroundingAgent:"""GUI grounding agent using vision-language models.This agent processes a screenshot and natural language instruction tolocate a specific UI element and return its coordinates.Attributes:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use for predictions.runtime_conf: Configuration dictionary for runtime parameters."""def __init__(self,llm_base_url: str,model_name: str,runtime_conf: Optional[Dict[str, Any]] = None,):"""Initialize the MAIGroundingAgent.Args:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use.runtime_conf: Optional configuration dictionary with keys:- max_pixels: Maximum pixels for image processing- min_pixels: Minimum pixels for image processing- temperature: Sampling temperature (default: 0.0)- top_k: Top-k sampling parameter (default: -1)- top_p: Top-p sampling parameter (default: 1.0)- max_tokens: Maximum tokens in response (default: 2048)"""# Set default configurationdefault_conf = {"temperature": 0.0,"top_k": -1,"top_p": 1.0,"max_tokens": 2048,}self.runtime_conf = {**default_conf, **(runtime_conf or {})}self.llm_base_url = llm_base_urlself.model_name = model_nameself.llm = OpenAI(base_url=self.llm_base_url,api_key="empty",)# Extract frequently used config valuesself.temperature = self.runtime_conf["temperature"]self.top_k = self.runtime_conf["top_k"]self.top_p = self.runtime_conf["top_p"]self.max_tokens = self.runtime_conf["max_tokens"]

5.3 数据流

Grounding Agent单步流程图如下：

也可以参见如下：

5.4 推理

    @propertydef system_prompt(self) -> str:"""Return the system prompt for grounding tasks."""return MAI_MOBILE_SYS_PROMPT_GROUNDINGdef _build_messages(self,instruction: str,image: Image.Image,) -> list:"""Build the message list for the LLM API call.Args:instruction: Grounding instruction from user.image: PIL Image of the screenshot.magic_prompt: Whether to use the magic prompt format.Returns:List of message dictionaries for the API."""encoded_string = pil_to_base64(image)messages = [{"role": "system","content": [{"type": "text","text": self.system_prompt,}],}]messages.append({"role": "user","content": [{"type": "text","text": instruction + "\n",},{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},},],})return messagesdef predict(self,instruction: str,image: Union[Image.Image, bytes],**kwargs: Any,) -> Tuple[str, Dict[str, Any]]:"""Predict the coordinate of the UI element based on the instruction.Args:instruction: Grounding instruction describing the UI element to locate.image: PIL Image or bytes of the screenshot.**kwargs: Additional arguments (unused).Returns:Tuple of (prediction_text, result_dict) where:- prediction_text: Raw model response or error message- result_dict: Dictionary containing:- "thinking": Model's reasoning process- "coordinate": Normalized [x, y] coordinate"""# Convert bytes to PIL Image if necessaryif isinstance(image, bytes):image = Image.open(BytesIO(image))if image.mode != "RGB":image = image.convert("RGB")# Build messagesmessages = self._build_messages(instruction, image)# Make API call with retry logicmax_retries = 3prediction = Noneresult = Nonefor attempt in range(max_retries):try:response = self.llm.chat.completions.create(model=self.model_name,messages=messages,max_tokens=self.max_tokens,temperature=self.temperature,top_p=self.top_p,frequency_penalty=0.0,presence_penalty=0.0,extra_body={"repetition_penalty": 1.0, "top_k": self.top_k},seed=42,)prediction = response.choices[0].message.content.strip()print(f"Raw response:\n{prediction}")# Parse responseresult = parse_grounding_response(prediction)print(f"Parsed result:\n{result}")breakexcept Exception as e:print(f"Error on attempt {attempt + 1}: {e}")prediction = Noneresult = None# Return error if all retries failedif prediction is None or result is None:print("Max retry attempts reached, returning error flag.")return "llm client error", {"thinking": None, "coordinate": None}return prediction, result

5.5 解析

def parse_grounding_response(text: str) -> Dict[str, Any]:"""Parse model output text containing grounding_think and answer tags.Args:text: Raw model output containing <grounding_think> and <answer> tags.Returns:Dictionary with keys:- "thinking": The model's reasoning process- "coordinate": Normalized [x, y] coordinateRaises:ValueError: If parsing fails or JSON is invalid."""text = text.strip()result: Dict[str, Any] = {"thinking": None,"coordinate": None,}# Extract thinking contentthink_pattern = r"<grounding_think>(.*?)</grounding_think>"think_match = re.search(think_pattern, text, re.DOTALL)if think_match:result["thinking"] = think_match.group(1).strip()# Extract answer contentanswer_pattern = r"<answer>(.*?)</answer>"answer_match = re.search(answer_pattern, text, re.DOTALL)if answer_match:answer_text = answer_match.group(1).strip()try:answer_json = json.loads(answer_text)coordinates = answer_json.get("coordinate", [])if len(coordinates) == 2:# Normalize coordinates from SCALE_FACTOR range to [0, 1]point_x = coordinates[0] / SCALE_FACTORpoint_y = coordinates[1] / SCALE_FACTORresult["coordinate"] = [point_x, point_y]else:raise ValueError(f"Invalid coordinate format: expected 2 values, got {len(coordinates)}")except json.JSONDecodeError as e:raise ValueError(f"Invalid JSON in answer: {e}")return result