当前位置: 首页 > news >正文

【Parallel-R1 代码实现】sft

1. 为什么需要sft?

在Parallel-R1的论文中提到,对于大模型来说,在DAPO这种难数据集上,无法生成并行思维链;在GSM8K这种简单数据集上,可以生成并行思维链。
image
由此产生的推论:对于这种并行推理任务来说,模型的指令遵循能力很差,大模型如此,小模型只会更差。所以,必须要插入special token后,做sft后训练

2. sft的数据基本设置是什么?

  • User Prompt(SysTem Prompt,仿照parallel-r1,设置为空)
Solve the following problem step by step.
During the reasoning process, whenever you encounter a step that may benefit from multiple perspectives or independent reasoning, insert a parallel block at that point.follow this format:
1. Start with <Parallel> to begin parallel reasoning
2. Generate multiple <Path> blocks, each with a different approach
3. Within each <Path>, you can use tools:- Call tools with: <tool_call>{{"name": "tool_name", "arguments": "..."}}</tool_call>- Tool responses will appear as: <tool_response>response content</tool_response>
4. After all paths, generate <Summary> to combine insights
5. Provide your final answer with the format: Final Answer: <answer>Example:
<Parallel>
<Path>Approach 1: use Python<tool_call>{{"name": "python", "arguments": "123+456"}}</tool_call><tool_response>579</tool_response>Result: 579</Path>
<Path>Approach 2: direct calculation: 123+456=579</Path>
</Parallel>
<Summary>Both methods confirm the answer</Summary>
Final Answer: 579Remind:
1. You can only use tools once in each path.
2. You can only use one time of parallel reasoning.
3. You can only use one time of tool call in each path.
4. You are encouraged to use tools in each path to solve the problem.Problem: {problem}
  • sft的回答格式
    由于这里面是在并行的path中进行工具调用,因此没有办法像传统的工具调用格式一样,返回多轮对话。这里简要修改数据格式,使其返回单轮对话,并且使用<tool_call></tool_call><tool_response></tool_response>来包裹工具调用和返回结果。
    一个回答的case为:
Step 1: Find out how many seashells Mimi picked up.\nMimi picked up 2 dozen seashells. 1 dozen = 12, so 2 dozen = 24.
Step 2: Find out how many seashells Kyle found.\nKyle found twice as many shells as Mimi.<Parallel><Path>Approach 1: Use Python to calculate Kyle's seashells.<tool_call>{\"name\": \"python\", \"arguments\": \"2 * 24\"}</tool_call><tool_response>48</tool_response>Kyle found 48 seashells.</Path><Path>Approach 2: Calculate directly.\nMimi: 24 shells.\nKyle: 2 × 24 = 48 shells.</Path>
</Parallel>
<Summary>Both approaches confirm that Kyle found 48 seashells.</Summary>Step 3: Leigh grabbed one-third of the shells that Kyle found.<Parallel><Path>Approach 1: Use Python to calculate Leigh's seashells.<tool_call>{\"name\": \"python\", \"arguments\": \"48 / 3\"}</tool_call><tool_response>16.0</tool_response>Leigh had 16 seashells.</Path>
<Path>Approach 2: Calculate directly.\nKyle: 48 shells.\nLeigh: 48 ÷ 3 = 16 shells.
</Path>
</Parallel>
<Summary>Both approaches confirm that Leigh had 16 seashells.</Summary>Final Answer: 16"
  • 数据集
    仿照Parallel-r1,这里使用gsm8k作为sft冷启动的数据集来训练,主要目标是固定模型生成的格式。

3. sft的数据如何生成?

3.1 尝试1:构建一个multi-agent系统,调用真实的Python接口,仿照parallel中的attention mask来生成数据。

具体的思路为:

  1. 生成时,停止,进入并行生成
  2. 对于每一条路径来说,其消息队列互相独立,模仿parallel-r1中的attention mask
  3. 生成</tool_call>或者时停止。如果是</tool_call>,那么从中解析工具,使用Python沙箱执行工具,然后继续推理;如果是,那么当前路径生成结束
  4. 所有路径生成结束后,在最后插入标签,然后生成总结
  5. 生成时,停止生成,回到第一步继续生成

主要的问题在于:

  1. OpenAI接口,不能像Parallel-R1一样,遇到token停止后,最后一个token为停止token。
response = client.chat.completions.create(model="gpt-4.1",messages=[{"role": "user", "content": "生成包含'<Parallel>'的文本"}],max_tokens=50,stop=[end_token],
)

例如,执行上面的代码后,模型会输出生成包含。这使得在multi-agent系统中,判断到底是因为模型回答结束,还是因为遇到了特定token而停止变得困难。

  1. 使用Agent来模拟模型行为,经常会出现标签混乱的情况,例如,模型会在summary中生成一个parallel标签,导致并行成树状展开,这会导致后期进行RL变得不可行。如果对每个Agent进行单独prompt调优,工程量过大。

3.2 尝试2:直接使用LLM进行一次生成+筛选

LLM生成的Prompt为:

Solve the following problem step by step.
During the reasoning process, whenever you encounter a step that may benefit from multiple perspectives or independent reasoning, insert a parallel block at that point.follow this format:
1. Start with <Parallel> to begin parallel reasoning
2. Generate multiple <Path> blocks, each with a different approach
3. Within each <Path>, you can use Python tools:• Call tools with: <tool_call>{{"name": "python", "arguments": "<code>"}}</tool_call>• Tool responses will appear as: <tool_response>response content</tool_response>4. After all paths, generate <Summary> to combine insights
5. Provide your final answer with the format: Final Answer: <answer>Example:
<Parallel>
<Path>Approach 1: use Python<tool_call>{{"name": "python", "arguments": "123+456"}}</tool_call><tool_response>579</tool_response>Result: 579</Path>
<Path>Approach 2: direct calculation: 123+456=579</Path>
</Parallel>
<Summary>Both methods confirm the answer</Summary>
Final Answer: 579Remind:
1. You can only use tools once in each path.
2. You can only use one time of parallel reasoning.
3. You can only use one time of tool call in each path.
4. You are encouraged to use tools in each path to solve the problem.
5. You can mock the Python tools responseProblem: {PROBLEM}

让模型一次生成,中间的工具调用结果直接mock,不调用真实的python环境。
经过实际测试:gpt-4.1的指令遵循能力还不错,例如上面的case,就是使用prompt调用gpt-4.1一次生成的。目前初步采用这种方案

  • 数据筛选:明天

4. sft的训练怎么做?

5. sft的效果怎么样?

http://www.jsqmd.com/news/116073/

相关文章:

  • 微观交通流仿真软件:AIMSUN_(21).微观仿真与其他交通软件的集成
  • esxi手动添加vmfs分区
  • 大模型——基于浏览器收藏夹的知识库
  • CSP2025邮寄
  • MAX30102心率血氧传感器原理图设计,已量产(e-Health传感器)
  • 格式化输入输出
  • 什么最伤孩子视力?不是手机和电视,则是这些方面家长要注意了
  • 2025年海南和田玉推荐商家排名TOP10(三亚+海口首选攻略) - charlieruizvin
  • RadeGS——depth_order_loss/ranking_loss
  • 为什么近视越来越低龄化?保护孩子眼睛,又该从何做起?
  • RocketMQ 存储高可用与故障恢复深度剖析
  • 我靠这2个免费降AI率的工具,顺利通过论文AIGC查重!
  • 实用指南:在Mac上通过Multipass虚拟机中的Ubuntu系统使用Graphviz工具
  • 2025年海南和田玉商家排名TOP10(三亚+海口首选攻略) - charlieruizvin
  • 断路器,空开,漏电开关
  • 云服务器救援模式/单用户模式,恢复ssh连接
  • mysql | limit 用法详解及注意事项
  • Docker 安装 Tomcat
  • 食品级螺杆泵选型经验
  • mysql | 环境变量问题及其配置方法详解
  • 免费降AI率的工具怎么选?认准这2个降低AI率的工具,避免白忙活
  • XML Schema 复合元素 - 仅含文本
  • 精选著名未解决数学问题概览
  • 安卓隐私管理小技巧:一款无root的应用隐藏工具分享
  • mysql | 复制表结构和数据
  • 降AI率的工具怎么选?认准这2个免费降低AI率的工具,嘎嘎快!
  • 微观交通流仿真软件:AIMSUN_(9).公共交通系统仿真
  • mysql | 数据表中列(字段)的添加、修改和删除
  • Java 集合框架核心用法与实战技术笔记
  • 论文AI率高怎么办?认准这2个免费降低AI率的工具,嘎嘎快!