当前位置：首页 > news >正文

【Parallel-R1 代码实现】sft

news 2026/3/26 19:11:57

1. 为什么需要sft？

在Parallel-R1的论文中提到，对于大模型来说，在DAPO这种难数据集上，无法生成并行思维链；在GSM8K这种简单数据集上，可以生成并行思维链。

由此产生的推论：对于这种并行推理任务来说，模型的指令遵循能力很差，大模型如此，小模型只会更差。所以，必须要插入special token后，做sft后训练

2. sft的数据基本设置是什么？

User Prompt(SysTem Prompt,仿照parallel-r1，设置为空)

Solve the following problem step by step.
During the reasoning process, whenever you encounter a step that may benefit from multiple perspectives or independent reasoning, insert a parallel block at that point.follow this format:
1. Start with <Parallel> to begin parallel reasoning
2. Generate multiple <Path> blocks, each with a different approach
3. Within each <Path>, you can use tools:- Call tools with: <tool_call>{{"name": "tool_name", "arguments": "..."}}</tool_call>- Tool responses will appear as: <tool_response>response content</tool_response>
4. After all paths, generate <Summary> to combine insights
5. Provide your final answer with the format: Final Answer: <answer>Example:
<Parallel>
<Path>Approach 1: use Python<tool_call>{{"name": "python", "arguments": "123+456"}}</tool_call><tool_response>579</tool_response>Result: 579</Path>
<Path>Approach 2: direct calculation: 123+456=579</Path>
</Parallel>
<Summary>Both methods confirm the answer</Summary>
Final Answer: 579Remind:
1. You can only use tools once in each path.
2. You can only use one time of parallel reasoning.
3. You can only use one time of tool call in each path.
4. You are encouraged to use tools in each path to solve the problem.Problem: {problem}

sft的回答格式
由于这里面是在并行的path中进行工具调用，因此没有办法像传统的工具调用格式一样，返回多轮对话。这里简要修改数据格式，使其返回单轮对话，并且使用<tool_call></tool_call><tool_response></tool_response>来包裹工具调用和返回结果。
一个回答的case为：

Step 1: Find out how many seashells Mimi picked up.\nMimi picked up 2 dozen seashells. 1 dozen = 12, so 2 dozen = 24.
Step 2: Find out how many seashells Kyle found.\nKyle found twice as many shells as Mimi.<Parallel><Path>Approach 1: Use Python to calculate Kyle's seashells.<tool_call>{\"name\": \"python\", \"arguments\": \"2 * 24\"}</tool_call><tool_response>48</tool_response>Kyle found 48 seashells.</Path><Path>Approach 2: Calculate directly.\nMimi: 24 shells.\nKyle: 2 × 24 = 48 shells.</Path>
</Parallel>
<Summary>Both approaches confirm that Kyle found 48 seashells.</Summary>Step 3: Leigh grabbed one-third of the shells that Kyle found.<Parallel><Path>Approach 1: Use Python to calculate Leigh's seashells.<tool_call>{\"name\": \"python\", \"arguments\": \"48 / 3\"}</tool_call><tool_response>16.0</tool_response>Leigh had 16 seashells.</Path>
<Path>Approach 2: Calculate directly.\nKyle: 48 shells.\nLeigh: 48 ÷ 3 = 16 shells.
</Path>
</Parallel>
<Summary>Both approaches confirm that Leigh had 16 seashells.</Summary>Final Answer: 16"

数据集
仿照Parallel-r1，这里使用gsm8k作为sft冷启动的数据集来训练，主要目标是固定模型生成的格式。

3. sft的数据如何生成？

3.1 尝试1：构建一个multi-agent系统，调用真实的Python接口，仿照parallel中的attention mask来生成数据。

具体的思路为：

生成时，停止，进入并行生成
对于每一条路径来说，其消息队列互相独立，模仿parallel-r1中的attention mask
生成</tool_call>或者时停止。如果是</tool_call>，那么从中解析工具，使用Python沙箱执行工具，然后继续推理；如果是，那么当前路径生成结束
所有路径生成结束后，在最后插入
标签，然后生成总结
生成时，停止生成，回到第一步继续生成

主要的问题在于：

OpenAI接口，不能像Parallel-R1一样，遇到token停止后，最后一个token为停止token。

response = client.chat.completions.create(model="gpt-4.1",messages=[{"role": "user", "content": "生成包含'<Parallel>'的文本"}],max_tokens=50,stop=[end_token],
)

例如，执行上面的代码后，模型会输出生成包含。这使得在multi-agent系统中，判断到底是因为模型回答结束，还是因为遇到了特定token而停止变得困难。

使用Agent来模拟模型行为，经常会出现标签混乱的情况，例如，模型会在summary中生成一个parallel标签，导致并行成树状展开，这会导致后期进行RL变得不可行。如果对每个Agent进行单独prompt调优，工程量过大。

3.2 尝试2：直接使用LLM进行一次生成+筛选

LLM生成的Prompt为：

Solve the following problem step by step.
During the reasoning process, whenever you encounter a step that may benefit from multiple perspectives or independent reasoning, insert a parallel block at that point.follow this format:
1. Start with <Parallel> to begin parallel reasoning
2. Generate multiple <Path> blocks, each with a different approach
3. Within each <Path>, you can use Python tools:• Call tools with: <tool_call>{{"name": "python", "arguments": "<code>"}}</tool_call>• Tool responses will appear as: <tool_response>response content</tool_response>4. After all paths, generate <Summary> to combine insights
5. Provide your final answer with the format: Final Answer: <answer>Example:
<Parallel>
<Path>Approach 1: use Python<tool_call>{{"name": "python", "arguments": "123+456"}}</tool_call><tool_response>579</tool_response>Result: 579</Path>
<Path>Approach 2: direct calculation: 123+456=579</Path>
</Parallel>
<Summary>Both methods confirm the answer</Summary>
Final Answer: 579Remind:
1. You can only use tools once in each path.
2. You can only use one time of parallel reasoning.
3. You can only use one time of tool call in each path.
4. You are encouraged to use tools in each path to solve the problem.
5. You can mock the Python tools responseProblem: {PROBLEM}

让模型一次生成，中间的工具调用结果直接mock，不调用真实的python环境。
经过实际测试：gpt-4.1的指令遵循能力还不错，例如上面的case，就是使用prompt调用gpt-4.1一次生成的。目前初步采用这种方案