当前位置: 首页 > news >正文

【GUI-Agent】阿里通义MAI-UI 代码阅读(2)--- 实现

【GUI-Agent】阿里通义MAI-UI 代码阅读(2)--- 实现

目录
  • 【GUI-Agent】阿里通义MAI-UI 代码阅读(2)--- 实现
    • 0x00 摘要
    • 0x01 工程实现特色
      • 1.1 特色1
      • 1.2 特色2
      • 1.3 特色 3
      • 1.4 小结
    • 0x02 提示词
      • 2.1 提示词代码
        • MAI_MOBILE_SYS_PROMPT
        • MAI_MOBILE_SYS_PROMPT_NO_THINKING
        • MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP
        • MAI_MOBILE_SYS_PROMPT_GROUNDING
      • 2.2 移动系统提示词差异一览
      • 2.3 工具集成差异
    • 0x03 输出
      • 3.1 输出格式区别
        • 非 MCP 版本(MAI_MOBILE_SYS_PROMPT)
        • MCP 版本(MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP)
      • 3.2 功能范围区别
        • 非 MCP 版本
        • MCP 版本
      • 3.3 实际应用场景
        • 标准 GUI 操作
        • MCP 工具调用
        • 代码实现中的处理
    • 0x04 MAIUINaivigationAgent
      • 4.1 核心特色
      • 4.2 定义
      • 4.3 构建图像
      • 4.4 构建文字
      • 4.5 流程
      • 4.6 推理
        • 核心作用
        • 核心特色
        • 流程
          • predict 的流程如下
          • 时序图
        • 代码
      • 4.7 轨迹
        • TrajMemory / TrajStep 数据结构图
        • 派生视图
        • 序列化路径
        • 代码
    • 0x05 MAIGroundingAgent
      • 5.1 核心特色
      • 5.2 定义
      • 5.3 数据流
      • 5.4 推理
      • 5.5 解析
    • 0xEE 广告
    • 购买链接
    • 0xFF 参考

0x00 摘要

MAI-UI 是阿里通义实验室发布的一项重磅研究成果:是一个旨在 重塑人机交互方式 的“基础图形用户界面(GUI)智能体”,和阶跃星辰的思路非常类似,因此我们可以互相印证。

MAI-UI的信息如下:

https://arxiv.org/pdf/2512.22047

https://github.com/Tongyi-MAI/MAI-UI

MAI-UI 的两类核心Agent如下,本篇会介绍这两类Agent:

Agent 文件 任务 输出协议
MAIGroundingAgent src/mai_grounding_agent.py UI 元素定位(单步) <grounding_think>.</grounding_think>{"coordinate":[x,y]},坐标基于 SCALE_FACT0R=999 归一化
MAIUINavigationAgent src/mai_navigation_agent.py 多步移动端GUI导航,支持ask_user与mcp_call .<tool_call>{json}</tool_call>,多轮带历史截图

0x01 工程实现特色

MAI-UI 工程实现的三个特色如下。

1.1 特色1

特色1:三套系统提示词对应三种Agent形态:grounding / 纯导航 / ask_user + MCP 增强导航

src/prompt.py同时维护:

  • MAI_MOBILE_SYS_PROMPT_GROUNDING 一 单步元素定位
  • MAI_MOBILE_SYS_PROMPT 一 标准多步导航
  • MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 一 在导航动作集里叠加两个特殊工具:
    • ask_user(question):模型主动反问用户、把任务“打回去"
    • mcp_call(tool,args):调外部MCP工具(如高德导航)补全设备端做不到的能力

阿里5

意义:

  • 这是"Agent-User Interaction +MCP Augmentation" 范式在代码层面的真实落点一不是新API、就是同一个模型在不同system prompt下解锁不同动作集。
  • 新增交互类工具的正确姿势就是改prompt.py+parse_tagged_text的 schema,而不是另起一个Agent类。

1.2 特色2

特色2是:归一化坐标空间SCALE_FACTOR = 999 + XML标签输出协议(而非function-calling)。

  • src/mai_grounding_agent.py 与 src/mai_naivigation_agent.py 都硬编码为 SCALE_FACTOR=
    999;模型永远输出[0,999]区间整数,由客户端按当前截图(W,H)反归一化。
  • 输出不是OpenAI function-calling,而是裸文本里的 XML 标签:
    • Grounding:<grounding_think>...</grounding_think>{"coordinate":[x,y]}
    • Navigation:.<tool_call>{json}</tool_call>(兼容 thinking 模型的)
    • 解析器:parse_grounding_response、parse_tagged_text,错误统一抛 ValueError。

阿里6

意义:

  • 跨分辨率泛化:同一个模型同一个权重无缝服务任意手机分辨率,不需要在 prompt里写屏幕尺寸;
  • 协议无关于推理后端一VLLM0.11.0、HFtransformers 本地推理、DashScope都能用,因为只解析纯文本,不依赖任何后端的tool-call结构;
  • 代价:解析鲁棒性必须由客户端自己保证(所以两个parser都做了容错+显式异常)

1.3 特色 3

特色 3:无状态服务端 +客户端自管TrajMemory,每步把历史截图重塞回 messages:

  • BaseAgent 持有 traj_memory:TrajMemory,每个 TrajStep 同时存 screenshot: Image 和 screenshot_bytes:bytes(渲染vs序列化双用)
  • MAIUINaivigationAgent._build_messages() 按 runtime_conf["history_n"] 把最近 N 步的“截图+模型回复“重组成多轮user/assistant对话再发给vLLM一一一vLLM 端零会话状态。
  • save_traj()/load_traj()走bytes,可被序列化/回放/做评测离线分析。
  • stept的请求体(每步独立、无状态)如下:

阿里7

意义:

  • 可回放、可评测、可断点续跑—save_traj出dict、load_traj直接灌回,离线replay,不需要真机/模拟器;
  • 横向扩展友好一VLLM可以集群水平扩,因为没有会话粘性,这正契合 scaling parallel environments up to 512"的训l练形态在推理侧的对应做法;
  • 代价:每步N张图都要重传,带宽与 prefill 成本随 history_n线性增长,调小 history_n是常见的省 token 技巧。

1.4 小结

MAI-UI的工程独到之处不是模型本身,而是这套客户端契约:分辨率无关的999坐标空间 + XML标签协议(与后端解耦)+ 无状态多轮重放(与历史长度解耦)+ 三档 prompt解锁的grounding/导航/ask_user+MCP
三种形态一一一后续任何二次开发都沿着这四条线走,而不是去改模型契约。

0x02 提示词

2.1 提示词代码

以下是提示词代码。

MAI_MOBILE_SYS_PROMPT

MAI_MOBILE_SYS_PROMPT = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.## Output Format
For each function call, return the thinking process in <thinking> </thinking> tags, and a json object with function name and arguments within <tool_call></tool_call> XML tags:
```
<thinking>
...
</thinking>
<tool_call>
{"name": "mobile_use", "arguments": <args-json-object>}
</tool_call>
```## Action Space{"action": "click", "coordinate": [x, y]}
{"action": "long_press", "coordinate": [x, y]}
{"action": "type", "text": ""}
{"action": "swipe", "direction": "up or down or left or right", "coordinate": [x, y]} # "coordinate" is optional. Use the "coordinate" if you want to swipe a specific UI element.
{"action": "open", "text": "app_name"}
{"action": "drag", "start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
{"action": "system_button", "button": "button_name"} # Options: back, home, menu, enter
{"action": "wait"}
{"action": "terminate", "status": "success or fail"}
{"action": "answer", "text": "xxx"} # Use escape characters \\', \\", and \\n in text part to ensure we can parse the text in normal python string format.## Note
- Write a small plan and finally summarize your next action (with its target element) in one sentence in <thinking></thinking> part.
- Available Apps: `["Camera","Chrome","Clock","Contacts","Dialer","Files","Settings","Markor","Tasks","Simple Draw Pro","Simple Gallery Pro","Simple SMS Messenger","Audio Recorder","Pro Expense","Broccoli APP","OSMand","VLC","Joplin","Retro Music","OpenTracks","Simple Calendar Pro"]`.
You should use the `open` action to open the app as possible as you can, because it is the fast way to open the app.
- You must follow the Action Space strictly, and return the correct json object within <thinking> </thinking> and <tool_call></tool_call> XML tags.
""".strip()

MAI_MOBILE_SYS_PROMPT_NO_THINKING

MAI_MOBILE_SYS_PROMPT_NO_THINKING = """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.## Output Format
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
```
<tool_call>
{"name": "mobile_use", "arguments": <args-json-object>}
</tool_call>
```## Action Space{"action": "click", "coordinate": [x, y]}
{"action": "long_press", "coordinate": [x, y]}
{"action": "type", "text": ""}
{"action": "swipe", "direction": "up or down or left or right", "coordinate": [x, y]} # "coordinate" is optional. Use the "coordinate" if you want to swipe a specific UI element.
{"action": "open", "text": "app_name"}
{"action": "drag", "start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
{"action": "system_button", "button": "button_name"} # Options: back, home, menu, enter
{"action": "wait"}
{"action": "terminate", "status": "success or fail"}
{"action": "answer", "text": "xxx"} # Use escape characters \\', \\", and \\n in text part to ensure we can parse the text in normal python string format.## Note
- Available Apps: `["Camera","Chrome","Clock","Contacts","Dialer","Files","Settings","Markor","Tasks","Simple Draw Pro","Simple Gallery Pro","Simple SMS Messenger","Audio Recorder","Pro Expense","Broccoli APP","OSMand","VLC","Joplin","Retro Music","OpenTracks","Simple Calendar Pro"]`.
You should use the `open` action to open the app as possible as you can, because it is the fast way to open the app.
- You must follow the Action Space strictly, and return the correct json object within <thinking> </thinking> and <tool_call></tool_call> XML tags.
""".strip()

MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP

# Placeholder prompts for future features
MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP = Template("""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format
For each function call, return the thinking process in <thinking> </thinking> tags, and a json object with function name and arguments within <tool_call></tool_call> XML tags:
```
<thinking>
...
</thinking>
<tool_call>
{"name": "mobile_use", "arguments": <args-json-object>}
</tool_call>
```## Action Space{"action": "click", "coordinate": [x, y]}
{"action": "long_press", "coordinate": [x, y]}
{"action": "type", "text": ""}
{"action": "swipe", "direction": "up or down or left or right", "coordinate": [x, y]} # "coordinate" is optional. Use the "coordinate" if you want to swipe a specific UI element.
{"action": "open", "text": "app_name"}
{"action": "drag", "start_coordinate": [x1, y1], "end_coordinate": [x2, y2]}
{"action": "system_button", "button": "button_name"} # Options: back, home, menu, enter 
{"action": "wait"}
{"action": "terminate", "status": "success or fail"} 
{"action": "answer", "text": "xxx"} # Use escape characters \\', \\", and \\n in text part to ensure we can parse the text in normal python string format.
{"action": "ask_user", "text": "xxx"} # you can ask user for more information to complete the task.
{"action": "double_click", "coordinate": [x, y]}{% if tools -%}
## MCP Tools
You are also provided with MCP tools, you can use them to complete the task.
{{ tools }}If you want to use MCP tools, you must output as the following format:
```
<thinking>
...
</thinking>
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
```
{% endif -%}## Note
- Available Apps: `["Contacts", "Settings", "Clock", "Maps", "Chrome", "Calendar", "files", "Gallery", "Taodian", "Mattermost", "Mastodon", "Mail", "SMS", "Camera"]`.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in <thinking></thinking> part.
""".strip()
)

MAI_MOBILE_SYS_PROMPT_GROUNDING

MAI_MOBILE_SYS_PROMPT_GROUNDING = """
You are a GUI grounding agent. 
## Task
Given a screenshot and the user's grounding instruction. Your task is to accurately locate a UI element based on the user's instructions.
First, you should carefully examine the screenshot and analyze the user's instructions,  translate the user's instruction into a effective reasoning process, and then provide the final coordinate.
## Output Format
Return a json object with a reasoning process in <grounding_think></grounding_think> tags, a [x,y] format coordinate within <answer></answer> XML tags:
<grounding_think>...</grounding_think>
<answer>
{"coordinate": [x,y]}
</answer>
""".strip()

2.2 移动系统提示词差异一览

只有 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板支持 MCP 工具集成,且通过 Jinja2 条件语法实现动态插入;其余提示词版本均不包含 MCP 功能。

提示词 ID 核心用途 思考标签 操作空间 特殊功能
MAI_MOBILE_SYS_PROMPT 标准 GUI 代理 `` 必须 点击/长按/输入/滑动等全功能
MAI_MOBILE_SYS_PROMPT_NO_THINKING 快速响应 无思考标签 同上 省略思考,直接返回 JSON
MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板化+用户询问 可选 同上 ask_user、double_click、Jinja2 模板、MCP 工具集成
MAI_MOBILE_SYS_PROMPT_GROUNDING 纯定位专用 `` 仅元素识别 输出 [x,y] 坐标,无操作命令

2.3 工具集成差异

MCP 功能只在 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板层集成,其余版本需外部桥接。

  • 集成位置

    • MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 内置 MCP 工具调用入口(通过 Jinja2 模板动态注入)。
    • 其余版本无 MCP 工具入口,需外部调用。
  • 提示词层差异

    • 标准版:无 MCP 占位符,纯 JSON 输出。
    • MCP 版:模板内预留 {{mcp_tools}} 变量,运行时注入具体工具描述。
  • 运行时差异

    • 标准版:LLM 输出传统动作 JSON,由外部框架手动转发至 MCP。
    • MCP 版:渲染后提示词包含完整 MCP 工具 JSON,LLM 可直接调用。
  • 条件性集成(仅 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP)

    • 使用 Jinja2 模板语法 {%if tools -%}...{%endif -%} 实现动态集成
    • 独立 ## MCP Tools 区域存放 MCP 工具描述
    • 通过 {{tools}} 变量动态插入可用工具信息
    • 输出格式与标准移动操作不同:`` 内直接嵌入 MCP 函数调用

0x03 输出

3.1 输出格式区别

非 MCP 版本(MAI_MOBILE_SYS_PROMPT)

  • 统一格式:所有操作通过 mobile_use 函数调用

  • 固定结构:GUI 操作封装在 arguments 字段

  • 示例

    <thinking>...</thinking>
    <tool_call>
    {"name":"mobile_use","arguments":<args-json-object>}
    </tool_call>
    

MCP 版本(MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP)

  • 双重格式:支持标准 GUI 操作和 MCP 工具调用

  • 工具特定格式:MCP 工具调用使用实际函数名作为 name

  • 示例

    <thinking>...</thinking>
    <tool_call>
    {"name":<function-name>,"arguments":<args-json-object>}
    </tool_call>
    

下面代码把LLM的输出转换为结构化输出

def parse_action_to_structure_output(text: str) -> Dict[str, Any]:"""Parse model output text into structured action format.Args:text: Raw model output containing thinking and tool_call tags.Returns:Dictionary with keys:- "thinking": The model's reasoning process- "action_json": Parsed action with normalized coordinatesNote:Coordinates are normalized to [0, 1] range by dividing by SCALE_FACTOR."""text = text.strip()results = parse_tagged_text(text)thinking = results["thinking"]tool_call = results["tool_call"]action = tool_call["arguments"]# Normalize coordinates from SCALE_FACTOR range to [0, 1]if "coordinate" in action:coordinates = action["coordinate"]if len(coordinates) == 2:point_x, point_y = coordinateselif len(coordinates) == 4:x1, y1, x2, y2 = coordinatespoint_x = (x1 + x2) / 2point_y = (y1 + y2) / 2else:raise ValueError(f"Invalid coordinate format: expected 2 or 4 values, got {len(coordinates)}")point_x = point_x / SCALE_FACTORpoint_y = point_y / SCALE_FACTORaction["coordinate"] = [point_x, point_y]return {"thinking": thinking,"action_json": action,}

3.2 功能范围区别

非 MCP 版本

  • 有限操作集:仅预定义 GUI 操作(点击、滑动、输入等)
  • 移动设备专属:专注触摸屏界面交互
  • 固定动作空间:无法扩展新操作类型

MCP 版本

  • 扩展操作集:除 GUI 操作外,支持 MCP 工具
  • 系统级功能:可通过 MCP 工具执行复杂系统操作
  • 动态功能:依据配置工具动态扩展功能范围

3.3 实际应用场景

标准 GUI 操作

  • MCP 版本中标准 GUI 操作仍使用 mobile_use 函数
  • 与非 MCP 版本行为基本相同

MCP 工具调用

  • 需执行 MCP 工具时,使用工具名称作为函数名
  • 可执行复杂任务(系统配置、数据处理等)

代码实现中的处理

在 MAIUIMobileAgent 类中:

  • 若 self.tools 非空,使用 MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP 模板;
  • 通过 render(tools=tools_str) 将工具列表注入提示词;
  • 未配置工具时,回退到标准 MAI_MOBILE_SYS_PROMPT。

代码如下:

    @propertydef system_prompt(self) -> str:"""Generate the system prompt based on available MCP tools.Returns:System prompt string, with MCP tools section if tools are configured."""if self.tools:tools_str = "\n".join([json.dumps(tool, ensure_ascii=False) for tool in self.tools])return MAI_MOBILE_SYS_PROMPT_ASK_USER_MCP.render(tools=tools_str)return MAI_MOBILE_SYS_PROMPT

MCP 版本提供更灵活的操作能力,允许智能体在标准 GUI 操作与 MCP 工具间切换,从而执行更复杂任务;非 MCP 版本则专注纯粹移动界面操作。

0x04 MAIUINaivigationAgent

MAIUINaivigationAgent(移动端 GUI 导航智能体) 是整个 MAI-GUI 智能体的 “底座模块”—— 它封装了 LLM 初始化、历史界面上下文管理、多模态消息构建等核心能力,专门为移动端 GUI 自动化场景设计,能基于任务指令和多步历史界面截图,构建标准化的多模态消息发送给 LLM,为后续动作生成提供统一的输入基础。

4.1 核心特色

MAIUINaivigationAgent 的核心逻辑如下:初始化(配置 + LLM 客户端)→ 图片预处理(历史 + 当前截图统一格式)→ 消息构建(按固定结构拼接多模态内容),全流程为 LLM 提供标准化、结构化的输入。

特色维度 具体说明
历史上下文智能管理 支持配置history_n参数(默认 3),自动截取最近 N 步的界面截图作为历史上下文,既保留关键操作轨迹,又避免上下文过长导致 LLM 推理效率下降;仅加载history_n-1条历史截图 + 当前截图,精准控制上下文长度
多格式图片兼容处理 _prepare_images方法支持字节流、PIL Image 等多种图片输入格式,自动转换为 RGB 格式的 PIL Image,解决不同来源截图的格式兼容问题,适配移动端截图的多样化场景
MCP 工具集成能力 初始化时支持传入 MCP 工具列表,为后续 LLM 调用 MCP 工具(如执行设备操作)预留扩展接口,兼容 MCP 协议生态
标准化多模态消息构建 _build_messages方法按 “系统提示词→用户指令→历史截图 + 历史响应→当前截图” 的固定逻辑构建消息,严格对齐 LLM 多模态输入格式,确保不同历史长度下消息结构统一
高度可配置化 支持自定义温度、top_k、top_p 等 LLM 推理参数,以及历史上下文长度(history_n),可根据不同移动端任务(如简单点击 / 复杂表单填写)调整配置

4.2 定义

MAIUINaivigationAgent 的定义如下。

class MAIUINaivigationAgent(BaseAgent):"""Mobile automation agent using vision-language models.This agent processes screenshots and natural language instructions togenerate GUI actions for mobile device automation.Attributes:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use for predictions.runtime_conf: Configuration dictionary for runtime parameters.history_n: Number of history steps to include in context."""def __init__(self,llm_base_url: str,model_name: str,runtime_conf: Optional[Dict[str, Any]] = None,tools: Optional[List[Dict[str, Any]]] = None,):"""Initialize the MAIMobileAgent.Args:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use.runtime_conf: Optional configuration dictionary with keys:- history_n: Number of history images to include (default: 3)- max_pixels: Maximum pixels for image processing- min_pixels: Minimum pixels for image processing- temperature: Sampling temperature (default: 0.0)- top_k: Top-k sampling parameter (default: -1)- top_p: Top-p sampling parameter (default: 1.0)- max_tokens: Maximum tokens in response (default: 2048)tools: Optional list of MCP tool definitions. Each tool should be a dictwith 'name', 'description', and 'parameters' keys."""super().__init__()# Store MCP toolsself.tools = tools or []# Set default configurationdefault_conf = {"history_n": 3,"temperature": 0.0,"top_k": -1,"top_p": 1.0,"max_tokens": 2048,}self.runtime_conf = {**default_conf, **(runtime_conf or {})}self.llm_base_url = llm_base_urlself.model_name = model_nameself.llm = OpenAI(base_url=self.llm_base_url,api_key="empty",)# Extract frequently used config valuesself.temperature = self.runtime_conf["temperature"]self.top_k = self.runtime_conf["top_k"]self.top_p = self.runtime_conf["top_p"]self.max_tokens = self.runtime_conf["max_tokens"]self.history_n = self.runtime_conf["history_n"]

4.3 构建图像

_prepare_images 函数被用来构建图像。

    def _prepare_images(self, screenshot_bytes: bytes) -> List[Image.Image]:"""Prepare image list including history and current screenshot.Args:screenshot_bytes: Current screenshot as bytes.Returns:List of PIL Images (history + current)."""# Calculate how many history images to includeif len(self.history_images) > 0:max_history = min(len(self.history_images), self.history_n - 1)recent_history = self.history_images[-max_history:] if max_history > 0 else []else:recent_history = []# Add current image bytesrecent_history.append(screenshot_bytes)# Normalize input typeif isinstance(recent_history, bytes):recent_history = [recent_history]elif isinstance(recent_history, np.ndarray):recent_history = list(recent_history)elif not isinstance(recent_history, list):raise TypeError(f"Unidentified images type: {type(recent_history)}")# Convert all images to PIL formatimages = []for image in recent_history:if isinstance(image, bytes):image = Image.open(BytesIO(image))elif isinstance(image, Image.Image):passelse:raise TypeError(f"Expected bytes or PIL Image, got {type(image)}")if image.mode != "RGB":image = image.convert("RGB")images.append(image)return images

4.4 构建文字

    def _build_messages(self,instruction: str,images: List[Image.Image],) -> List[Dict[str, Any]]:"""Build the message list for the LLM API call.Args:instruction: Task instruction from user.images: List of prepared images.Returns:List of message dictionaries for the API."""messages = [{"role": "system","content": [{"type": "text", "text": self.system_prompt}],},{"role": "user","content": [{"type": "text", "text": instruction}],},]image_num = 0history_responses = self.history_responsesif len(history_responses) > 0:for history_idx, history_response in enumerate(history_responses):# Only include images for recent history (last history_n responses)if history_idx + self.history_n >= len(history_responses):# Add image before the assistant responseif image_num < len(images) - 1:cur_image = images[image_num]encoded_string = pil_to_base64(cur_image)messages.append({"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},}],})image_num += 1messages.append({"role": "assistant","content": [{"type": "text", "text": history_response}],})# Add current image (last one in images list)if image_num < len(images):cur_image = images[image_num]encoded_string = pil_to_base64(cur_image)messages.append({"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},}],})else:# No history, just add the current imagecur_image = images[0]encoded_string = pil_to_base64(cur_image)messages.append({"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},}],})return messages

4.5 流程

MAIUINaivigationAgent 多步循环流程图如下:

阿里8

特殊动作:

  • ask_user(question) →暂停,把问题返还给用户(设备-云协同里的用户交互)
  • mcp_call(tool,args) →调用外部MCP工具(如高德地图导航)
  • finish() 任务结束

也参见如下:MAI-UI-1

4.6 推理

核心作用

predict 是 MAI-GUI 智能体的核心决策与动作生成模块,是 GUI Agent 的 “决策大脑”,核心解决 “根据任务指令和当前界面状态,生成下一步具体 GUI 动作” 的问题,区别于单纯的元素定位模块。

predict 的核心功能是接收任务指令(如 “完成 APP 登录”)和当前界面观测信息(截图 + 可选的无障碍树),通过调用大语言模型生成并解析出下一步要执行的结构化 GUI 动作(如点击、滑动、输入等),同时记录完整的任务轨迹(Trajectory),是 GUI Agent 实现 “根据界面状态决策操作” 的核心环节。

predict 的流程闭环是:输入处理→消息构建→LLM 调用→响应解析→轨迹记录→结果输出,全流程覆盖异常处理,确保动作生成的稳定性。

核心特色

特色维度 具体说明
任务轨迹全链路记录 内置 traj_memory 轨迹记忆模块,每一步操作都会存储截图、模型响应、解析后的动作、推理过程等全量信息,支持任务溯源、调试和复盘
多维度界面观测输入 同时接收截图(视觉信息)和无障碍树(结构化 UI 信息),相比纯视觉输入更精准理解界面结构,适配复杂 GUI 场景
鲁棒的 LLM 调用与解析 ① 内置 3 次 API 重试机制,捕获并打印异常栈信息,提升调用稳定性;② 标准化解析模型响应为 thinking(推理过程)+ action_json(结构化动作),确保输出格式统一
任务目标持久化 首次调用时将任务指令存入轨迹记忆作为持久化目标,避免后续步骤丢失核心任务方向
日志可视化友好 对包含图片的消息做脱敏打印(mask_image_urls_for_logging),既保留日志完整性又避免 Base64 编码刷屏,便于调试

流程

predict 的流程如下

MAI-UI-2

时序图

时序图:用户 ⇔ Agent ⇔ vLLM (Navigation 场景)如下:

要点:

  • 每步都把历史 history_n张截图重新塞进 messages(无服务端会话状态,vLLM是无状态的 chat completions);
  • ask_user/mcp_call 是模型直接吐出的tool_call,调度由外层环境完成,agent 本身不做副作用;
  • 日志路径上的 base64 图片一定经过 mask_image_urls_for_logging 替换为 [IMAGE_DATA]。

阿里9

代码

    def predict(self,instruction: str,obs: Dict[str, Any],**kwargs: Any,) -> Tuple[str, Dict[str, Any]]:"""Predict the next action based on the current observation.Args:instruction: Task instruction/goal.obs: Current observation containing:- screenshot: PIL Image or bytes of current screen- accessibility_tree: Optional accessibility tree data**kwargs: Additional arguments including:- extra_info: Optional extra context stringReturns:Tuple of (prediction_text, action_dict) where:- prediction_text: Raw model response or error message- action_dict: Parsed action dictionary"""# Set task goal if not already setif not self.traj_memory.task_goal:self.traj_memory.task_goal = instruction# Process screenshotscreenshot_pil = obs["screenshot"]screenshot_bytes = safe_pil_to_bytes(screenshot_pil)# Prepare imagesimages = self._prepare_images(screenshot_bytes)# Build messagesmessages = self._build_messages(instruction, images)# Make API call with retry logicmax_retries = 3prediction = Noneaction_json = Nonefor attempt in range(max_retries):try:messages_print = mask_image_urls_for_logging(messages)print(f"Messages (attempt {attempt + 1}):\n{messages_print}")response = self.llm.chat.completions.create(model=self.model_name,messages=messages,max_tokens=self.max_tokens,temperature=self.temperature,top_p=self.top_p,frequency_penalty=0.0,presence_penalty=0.0,extra_body={"repetition_penalty": 1.0, "top_k": self.top_k},seed=42,)prediction = response.choices[0].message.content.strip()print(f"Raw response:\n{prediction}")# Parse responseparsed_response = parse_action_to_structure_output(prediction)thinking = parsed_response["thinking"]action_json = parsed_response["action_json"]print(f"Parsed response:\n{parsed_response}")breakexcept Exception as e:print(f"Error on attempt {attempt + 1}: {e}")traceback.print_exc()prediction = Noneaction_json = None# Return error if all retries failedif prediction is None or action_json is None:print("Max retry attempts reached, returning error flag.")return "llm client error", {"action": None}# Create and store trajectory steptraj_step = TrajStep(screenshot=screenshot_pil,accessibility_tree=obs.get("accessibility_tree"),prediction=prediction,action=action_json,conclusion="",thought=thinking,step_index=len(self.traj_memory.steps),agent_type="MAIMobileAgent",model_name=self.model_name,screenshot_bytes=screenshot_bytes,structured_action={"action_json": action_json},)self.traj_memory.steps.append(traj_step)return prediction, action_json

4.7 轨迹

TrajMemory / TrajStep 数据结构图

阿里10

派生视图

派生视图(BaseAgent上的@property,避免外部直接遍历steps)如下:

阿里11

序列化路径

  BaseAgent.save_traj() → {"task_goal", "task_id","steps": [{ screenshot_bytes, accessibility_tree, prediction,action, conclusion, thought,step_index, agent_type, model_name }, ...]}△ 注意:save 时只输出 screenshot_bytes,丢弃 PIL.Image 对象△ structured_action 字段不在 save_traj 输出里(只在内存中使用)BaseAgent.load_traj(traj_memory) →直接覆盖self.traj_memory(需要外部自行从dict重建TrajMemory)

要点:

  • Screenshot(PIL)+screenshot_bytes(bytes)双份并存:渲染走PIL、序列化/网络走 bytes,不要只保留一个;
  • thought、action <tool_call>,是解析器parse_tagged_text的两端落点; prediction存原始未解析的字符串,便于回放与debug,不要用解析后的结果覆盖它;
  • save_traj与TrajStep 字段不完全同构(structured_action 不导出),新增字段时要同步两处,否则round-trip会丢失。

代码

@dataclass
class TrajStep:"""Represents a single step in an agent's trajectory.Attributes:screenshot: PIL Image of the screen at this step.accessibility_tree: Accessibility tree data for the screen.prediction: Raw model prediction/response.action: Parsed action dictionary.conclusion: Conclusion or summary of the step.thought: Model's reasoning/thinking process.step_index: Index of this step in the trajectory.agent_type: Type of agent that produced this step.model_name: Name of the model used.screenshot_bytes: Original screenshot as bytes (for compatibility).structured_action: Structured action with metadata."""screenshot: Image.Imageaccessibility_tree: Optional[Dict[str, Any]]prediction: straction: Dict[str, Any]conclusion: strthought: strstep_index: intagent_type: strmodel_name: strscreenshot_bytes: Optional[bytes] = Nonestructured_action: Optional[Dict[str, Any]] = None@dataclass
class TrajMemory:"""Container for a complete trajectory of agent steps.Attributes:task_goal: The goal/instruction for this trajectory.task_id: Unique identifier for the task.steps: List of trajectory steps."""task_goal: strtask_id: strsteps: List[TrajStep] = field(default_factory=list)

0x05 MAIGroundingAgent

MAIGroundingAgent 是一款基于视觉 - 语言模型(VLM)的 GUI 定位智能体(Grounding Agent),该代码是 GUI Agent 的 “视觉定位模块”,核心解决 “从自然语言 + 截图中精准找到 UI 元素坐标” 的问题,是 GUI Agent 实现界面理解的核心环节

MAIGroundingAgent 的核心功能是接收自然语言指令(如 “点击登录按钮”)和 GUI 界面截图,通过调用大语言模型 API 解析指令意图、识别目标 UI 元素,并输出该元素的标准化坐标(归一化到 [0,1] 范围),为 GUI Agent 的后续操作(如点击、输入)提供精准的元素定位能力 —— 这是 GUI Agent 实现 “看懂界面” 的核心模块。

MAIGroundingAgent 的流程闭环如下:输入预处理→消息构建→LLM 调用→响应解析→结果输出,全流程覆盖异常处理,确保可用性。

5.1 核心特色

特色维度 具体说明
多模态输入处理 同时接收自然语言指令(文本)和界面截图(图像),适配 GUI 交互的视觉 + 语言双输入场景
标准化解析逻辑 固定解析模型输出中的 (推理过程)和 (坐标)标签,确保输出结构统一;坐标自动归一化(除以 SCALE_FACTOR),适配不同分辨率界面
鲁棒性设计 ① 内置 3 次 API 重试机制,应对网络 / 模型临时异常;② 兼容图片格式(自动转换为 RGB)、输入类型(支持 PIL Image / 字节流);③ 完善的异常捕获,失败时返回明确错误标识
可配置化推理 支持自定义 LLM 推理参数(temperature/top_k/top_p/max_tokens 等),可根据场景调整模型生成策略(如 temperature=0 保证输出确定性)
清晰的流程闭环 从 “输入处理→构建多模态消息→调用 LLM→解析响应→返回标准化结果” 形成完整闭环,输出同时包含模型推理过程和最终坐标,便于调试与溯源

5.2 定义

MAIGroundingAgent 如下。

class MAIGroundingAgent:"""GUI grounding agent using vision-language models.This agent processes a screenshot and natural language instruction tolocate a specific UI element and return its coordinates.Attributes:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use for predictions.runtime_conf: Configuration dictionary for runtime parameters."""def __init__(self,llm_base_url: str,model_name: str,runtime_conf: Optional[Dict[str, Any]] = None,):"""Initialize the MAIGroundingAgent.Args:llm_base_url: Base URL for the LLM API endpoint.model_name: Name of the model to use.runtime_conf: Optional configuration dictionary with keys:- max_pixels: Maximum pixels for image processing- min_pixels: Minimum pixels for image processing- temperature: Sampling temperature (default: 0.0)- top_k: Top-k sampling parameter (default: -1)- top_p: Top-p sampling parameter (default: 1.0)- max_tokens: Maximum tokens in response (default: 2048)"""# Set default configurationdefault_conf = {"temperature": 0.0,"top_k": -1,"top_p": 1.0,"max_tokens": 2048,}self.runtime_conf = {**default_conf, **(runtime_conf or {})}self.llm_base_url = llm_base_urlself.model_name = model_nameself.llm = OpenAI(base_url=self.llm_base_url,api_key="empty",)# Extract frequently used config valuesself.temperature = self.runtime_conf["temperature"]self.top_k = self.runtime_conf["top_k"]self.top_p = self.runtime_conf["top_p"]self.max_tokens = self.runtime_conf["max_tokens"]

5.3 数据流

Grounding Agent单步流程图如下:

阿里12

也可以参见如下:

MAI-UI-3

5.4 推理

    @propertydef system_prompt(self) -> str:"""Return the system prompt for grounding tasks."""return MAI_MOBILE_SYS_PROMPT_GROUNDINGdef _build_messages(self,instruction: str,image: Image.Image,) -> list:"""Build the message list for the LLM API call.Args:instruction: Grounding instruction from user.image: PIL Image of the screenshot.magic_prompt: Whether to use the magic prompt format.Returns:List of message dictionaries for the API."""encoded_string = pil_to_base64(image)messages = [{"role": "system","content": [{"type": "text","text": self.system_prompt,}],}]messages.append({"role": "user","content": [{"type": "text","text": instruction + "\n",},{"type": "image_url","image_url": {"url": f"data:image/png;base64,{encoded_string}"},},],})return messagesdef predict(self,instruction: str,image: Union[Image.Image, bytes],**kwargs: Any,) -> Tuple[str, Dict[str, Any]]:"""Predict the coordinate of the UI element based on the instruction.Args:instruction: Grounding instruction describing the UI element to locate.image: PIL Image or bytes of the screenshot.**kwargs: Additional arguments (unused).Returns:Tuple of (prediction_text, result_dict) where:- prediction_text: Raw model response or error message- result_dict: Dictionary containing:- "thinking": Model's reasoning process- "coordinate": Normalized [x, y] coordinate"""# Convert bytes to PIL Image if necessaryif isinstance(image, bytes):image = Image.open(BytesIO(image))if image.mode != "RGB":image = image.convert("RGB")# Build messagesmessages = self._build_messages(instruction, image)# Make API call with retry logicmax_retries = 3prediction = Noneresult = Nonefor attempt in range(max_retries):try:response = self.llm.chat.completions.create(model=self.model_name,messages=messages,max_tokens=self.max_tokens,temperature=self.temperature,top_p=self.top_p,frequency_penalty=0.0,presence_penalty=0.0,extra_body={"repetition_penalty": 1.0, "top_k": self.top_k},seed=42,)prediction = response.choices[0].message.content.strip()print(f"Raw response:\n{prediction}")# Parse responseresult = parse_grounding_response(prediction)print(f"Parsed result:\n{result}")breakexcept Exception as e:print(f"Error on attempt {attempt + 1}: {e}")prediction = Noneresult = None# Return error if all retries failedif prediction is None or result is None:print("Max retry attempts reached, returning error flag.")return "llm client error", {"thinking": None, "coordinate": None}return prediction, result

5.5 解析

def parse_grounding_response(text: str) -> Dict[str, Any]:"""Parse model output text containing grounding_think and answer tags.Args:text: Raw model output containing <grounding_think> and <answer> tags.Returns:Dictionary with keys:- "thinking": The model's reasoning process- "coordinate": Normalized [x, y] coordinateRaises:ValueError: If parsing fails or JSON is invalid."""text = text.strip()result: Dict[str, Any] = {"thinking": None,"coordinate": None,}# Extract thinking contentthink_pattern = r"<grounding_think>(.*?)</grounding_think>"think_match = re.search(think_pattern, text, re.DOTALL)if think_match:result["thinking"] = think_match.group(1).strip()# Extract answer contentanswer_pattern = r"<answer>(.*?)</answer>"answer_match = re.search(answer_pattern, text, re.DOTALL)if answer_match:answer_text = answer_match.group(1).strip()try:answer_json = json.loads(answer_text)coordinates = answer_json.get("coordinate", [])if len(coordinates) == 2:# Normalize coordinates from SCALE_FACTOR range to [0, 1]point_x = coordinates[0] / SCALE_FACTORpoint_y = coordinates[1] / SCALE_FACTORresult["coordinate"] = [point_x, point_y]else:raise ValueError(f"Invalid coordinate format: expected 2 values, got {len(coordinates)}")except json.JSONDecodeError as e:raise ValueError(f"Invalid JSON in answer: {e}")return result

0xEE 广告

继续给第二本书打广告。

TransFormer-封面


购买链接

0xFF 参考

从豆包手机谈起:端侧智能的愿景与路线图

阿里发布MAI-UI,一个“活”在屏幕里的全能AI助手!手机真能全自动了?

http://www.jsqmd.com/news/778743/

相关文章:

  • CSP-J2020直播获奖题解:用‘桶’代替排序,轻松搞定实时分数线(附完整C++代码)
  • CXL技术交流群精华:从Cachemem到MLD,那些协议细节与实战踩坑实录
  • 告别Trace导出烦恼:用CAPL的Logging功能搞定长时间压力测试日志(附分段存储技巧)
  • 猎聘发布2026新能源紧缺榜:主播比算法更缺人,这些城市逆袭 - 资讯焦点
  • 保姆级教程:从零到一搞定RV1106芯片的Linux SDK编译与烧录(避坑指南)
  • Palot:轻量级自动化工具,提升开发与运维效率
  • 我非常喜欢的linux终端提示符
  • Linux逆向分析入门:用objdump反编译一个C程序,从汇编看代码执行(附GCC调试选项)
  • AI Agent 爆破内存墙!Context Engineering 技术深度解析,让语言模型“过目不忘”!
  • Firefox 150.0.2 发布:修复多类问题,改进 3D 显示与搜索建议效果
  • 轻量级密钥管理工具aaas-vault:从.env到集中式安全管理的演进
  • Halcon三维点云匹配实战:用一枚硬币教会你工业无序抓取的核心步骤
  • ClawDen爬虫工具库:模块化设计与实战应用解析
  • STM32CubeMX DAC配置避坑指南:为什么你的输出电压不准?从Buffer、对齐方式到参考电压的深度解析
  • iNav GPS自动返航全攻略:从BN-880配置到RTH安全降落避坑指南
  • 机器人工程师必看:六轴机械臂末端姿态解算,为什么更推荐用ZYZ欧拉角而不是XYZ?
  • 山东青岛全品类文旅大盘点,十佳服务商旅游旅行研学团建接待一站式搞定# - 十大品牌榜
  • 别再只盯着Simulink了!用Modelica搞定多物理场仿真的5个实战理由
  • 2026年成都净化板厂家口碑推荐榜:成都净化板、中空玻镁净化板、岩棉净化板、洁净板、彩钢夹芯板选择指南 - 海棠依旧大
  • 宠物骨科医院推荐,宠物心脏病医院哪家靠谱 - 资讯焦点
  • 深入K210的KPU:从face_detect_320x240.kmodel入手,聊聊嵌入式端侧AI模型的部署与调优
  • AI Terminal:用自然语言驱动终端,提升开发运维效率
  • FPGA仿真避坑指南:Quartus调用ModelSim时,功能仿真和时序仿真结果对不上怎么办?
  • Fiscal CLI:用命令行和AI智能体自动化你的个人财务管理
  • 混合精度推理超快
  • CVPR2024论文复现平台:一站式集成代码与Demo,加速AI研究验证
  • 山海特色山东研学旅游榜单,青岛团建 + 研学双服务头部企业 - 十大品牌榜
  • 2026年苏州洁净棚厂家口碑推荐榜:苏州洁净棚、苏州模块化洁净棚、苏州 FFU 风机过滤单元、苏州洁净设备选择指南 - 海棠依旧大
  • STM32CubeIDE隐藏技巧:利用‘从.ioc创建’功能,轻松管理不同芯片固件库版本
  • Java/Go后端工程师的AI转型“捷径”:3-6个月掌握高薪AI应用开发,拒绝裸辞!