当前位置：首页 > news >正文

微调LLM提升工具调用能力的ShareGPT数据格式

news 2026/7/5 15:20:25

1. 引入

我们做LLM微调，最常见的是Alpaca格式的数据，Alpaca是最简单、最通用的指令微调格式。它没有对话历史，只有指令 (Instruction) + 输入 (Input) + 输出 (Output)，适合单轮指令学习。

如果我们想微调LLM更好的做工具调用等Agent功能，要怎么做呢？

2. ShareGPT格式

相比 Alpaca 格式的数据集，ShareGPT格式支持更多的角色种类，例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 conversations 列中。注意其中 human 和 observation 必须出现在奇数位置，gpt 和 function 必须出现在偶数位置。默认所有的 gpt 和 function 会被用于学习（参考1）。

如下是一个ShareGPT格式的多轮对话数据样例：

[{"system":"你是贴心中文助手，回答简洁通俗","conversations":[{"from":"human","value":"推荐一部国产动画"},{"from":"gpt","value":"《长安三万里》画面唯美，诗词氛围感拉满。"},{"from":"human","value":"还有短篇吗？"},{"from":"gpt","value":"《中国奇谭》，每一集独立小故事。"}]}]

3. 带工具调用的ShareGPT格式

根据参考2，我们可以看到这样的数据格式：

[{"conversations":[{"from":"human","value":"我需要为John Doe生成一张发票。他购买了2个苹果，每个$1，以及3根香蕉，每根$0.5。"},{"from":"function_call","value":"{\"name\": \"generate_invoice\", \"arguments\": {\"customer_name\": \"约翰·多伊\", \"items\": [{\"name\": \"苹果\", \"quantity\": 2, \"price\": 1}, {\"name\": \"香蕉\", \"quantity\": 3, \"price\": 0.5}]}}"},{"from":"observation","value":"{\"invoice_id\": \"INV12345\", \"customer_name\": \"约翰·多伊\", \"items\": [{\"name\": \"苹果\", \"quantity\": 2, \"price\": 1, \"total\": 2}, {\"name\": \"香蕉\", \"quantity\": 3, \"price\": 0.5, \"total\": 1.5}], \"total\": 3.5, \"status\": \"生成\"}"},{"from":"gpt","value":"发票已成功生成。发票编号为INV12345。约翰·多伊的总金额为$3.5。发票包含2个苹果，总金额为$2，以及3根香蕉，总金额为$1.5。"}],"tools":"[{\"name\": \"generate_invoice\", \"description\": \"生成发票\", \"parameters\": {\"type\": \"object\", \"properties\": {\"customer_name\": {\"type\": \"string\", \"description\": \"客户名称\"}, \"items\": {\"type\": \"array\", \"items\": {\"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\", \"description\": \"The item name\"}, \"quantity\": {\"type\": \"integer\", \"description\": \"The quantity of the item\"}, \"price\": {\"type\": \"number\", \"description\": \"The price per unit\"}}, \"required\": [\"name\", \"quantity\", \"price\"]}}}, \"required\": [\"customer_name\", \"items\"]}}, {\"name\": \"generate_password\", \"description\": \"生成随机密码\", \"parameters\": {\"type\": \"object\", \"properties\": {\"length\": {\"type\": \"integer\", \"description\": \"密码的长度\"}}, \"required\": [\"length\"]}}]"}]

这是一个带工具调用（Function Calling）的 ShareGPT 对话格式，专门用来训练大模型学会调用工具 / 函数（比如生成发票、查天气、调用 API）。

上面这个数据样例，抽象出来，它的结构是这样的：

[{"conversations":[对话流程：用户提问 → 模型调用工具 → 工具返回结果 → 模型回答],"tools":[告诉模型：你可以用哪些函数]}]

4. 总结

本文介绍了三种用于LLM微调的数据格式，各有侧重：

（1）.Alpaca格式：最简单通用的指令微调格式，包含“指令+输入+输出”三元组，适合单轮指令学习，但不支持对话历史。

（2）.ShareGPT格式：支持多轮对话，包含多种角色（human、gpt、observation、function等），对话以对象列表形式呈现在conversations列中，其中human和observation必须出现在奇数位置，gpt和function必须出现在偶数位置。

（3）.带工具调用的ShareGPT格式：在ShareGPT格式基础上扩展，专门用于训练大模型进行工具调用（Function Calling）。数据包含conversations（对话流程：用户提问→模型调用工具→工具返回结果→模型回答）和tools（可用的函数列表）两部分，使模型能够学习如何调用外部工具/API。

这三种格式从简单到复杂，分别适用于不同的微调场景：基础指令学习、多轮对话能力训练、以及工具调用/Agent功能开发。

5. 参考

https://github.com/hiyouga/LlamaFactory/blob/main/data/README_zh.md
https://github.com/hiyouga/LlamaFactory/blob/main/data/glaive_toolcall_zh_demo.json

查看全文

http://www.jsqmd.com/news/1128954/