当前位置: 首页 > news >正文

LLM复杂数值的提取计算场景示例

之前探索了使用LLM从长文本中提取简单数值并进行计算的示例。

https://blog.csdn.net/liliang199/article/details/159244753

这里进一步探索横跨两个文本的复杂数值的提取和计算示例。

所用资料和代码,修改和参考自网络资料。

1 文档获取

1.1 下载数据

这里从SEC EDGAR 获取苹果公司 2022 和 2023 年 10-K 的文本版本。

对应链接如下所示

aapl-20220924

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

aapl-20230930

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm

为简化分析,这里直接打开上述链接,选中所有文本复制,然后粘贴到本地。

分别存储为aapl-20220924.txt和aapl-20230930.txt

然后两文档合并,若两文档合计token在128K以内,则可直接拼接。

with open("aapl-20230930.txt", "r") as f: text_2023 = f.read() with open("aapl-20220924.txt", "r") as f: text_2022 = f.read() print(f"2023 长度: {len(text_2023)} 字符") print(f"2022 长度: {len(text_2022)} 字符")

输出如下所示

2023 长度: 203704 字符
2022 长度: 218592 字符

1.2 tokens量估计

这里使用tiktoken估计两个文档合并后的总token量,示例程序如下所示。

import tiktoken def num_tokens(text): enc = tiktoken.get_encoding("cl100k_base") return len(enc.encode(text)) tokens_2023 = num_tokens(text_2023) tokens_2022 = num_tokens(text_2022) print(f"2023 tokens: {tokens_2023}, 2022 tokens: {tokens_2022}, 合计: {tokens_2022 + tokens_2023}") if tokens_2022 + tokens_2023 < 120000: combined_text = "=== 苹果公司 2022 财年 10-K ===\n" + text_2022 + "\n\n=== 苹果公司 2023 财年 10-K ===\n" + text_2023 else: # 超出则需截断或使用 RAG combined_text = text_2022[:60000] + text_2023[:60000] # 简单截断,可能导致信息丢失

输出如下所示,92k tokens,在128k窗口内。

2023 tokens: 45185, 2022 tokens: 47670, 合计: 92855

2 提取计算

这里先说明需要提取的数据和计算指标。

提取数据分别来自两个不同的文档,比如,2022财年总营收、2023财年总营收。

部分计算指标会用到不同文档数据,比如,营收增长率、研发费用占营收比例变化。

2.1 提示词

这里采用提示词方式说明需要提取哪些数据,以及需要计算哪些指标。

提示词需清晰说明任务、给出计算要求,并指示使用函数调用。

这里还加入思维链指令,让模型先推理再填写函数参数。

prompt = f""" 你是一位经验丰富的财务分析师。以下是苹果公司 2022 和 2023 财年 10-K 年报的部分文本。 请仔细阅读,提取所需的财务数据,并完成以下计算。所有金额单位统一为 **百万美元**。 **需要提取的原始数据(必须从文本中查找):** - revenue_2023:2023 财年总营收 - revenue_2022:2022 财年总营收 - cogs_2023:2023 财年营业成本 - cogs_2022:2022 财年营业成本 - net_income_2023:2023 财年净利润 - net_income_2022:2022 财年净利润 - r_and_d_2023:2023 财年研发费用 - r_and_d_2022:2022 财年研发费用 - total_assets_2023:2023 财年末总资产 - total_liabilities_2023:2023 财年末总负债 - operating_cash_flow_2023:2023 财年经营活动现金流 - capital_expenditure_2023:2023 财年资本支出(通常为“购置固定资产”的现金流出) **需要计算的指标(请根据上面提取的数据计算,并填入 JSON):** - revenue_growth:营收增长率,格式如 "8.5%" - gross_margin_2023:2023 毛利率,格式如 "40.2%" - gross_margin_2022:2022 毛利率,格式如 "39.8%" - net_profit_margin_2023:2023 净利润率,格式如 "25.0%" - net_profit_margin_2022:2022 净利润率,格式如 "24.5%" - r_and_d_pct_change:研发费用占营收比例的变化(百分点),如 "+0.5pp" - debt_to_assets_2023:2023 资产负债率,格式如 "80.1%" - free_cash_flow_2023:2023 自由现金流(单位百万美元,数字) **请以 JSON 格式输出,包含以上所有字段**。输出的 JSON 对象必须包含上述所有键,且值为正确的数字或字符串(百分比用字符串表示,数字用数值表示)。 文本内容: {combined_text} 请一步步推理,然后输出 JSON。 """

2.2 LLM调用

在准备好提示词后,这里进一步调用 API 获取LLM的输出和推理过程,并解析结果。

response = client.chat.completions.create( model=model_name, # 支持 JSON 模式的模型 messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=1500, response_format={"type": "json_object"} # 强制输出 JSON ) # 提取返回的 JSON 字符串 content = response.choices[0].message.content reasoning_content = response.choices[0].message.reasoning_content print("模型返回的原始内容:", content) print(f"模型思考过程: ", reasoning_content) # 解析 JSON try: result = json.loads(content) print("\n解析后的结果:") print(json.dumps(result, indent=2, ensure_ascii=False)) except json.JSONDecodeError as e: print("JSON 解析失败:", e) # 可尝试用正则提取

输出示例如下,在思考过程(reasoning_content)中,

LLM先提取原始数据RawData,再在梳理计算指标,以及计算这些指标需要的数据。

在此基础上给出计算数据,并且进行了验证。

虽然整个过程没有提到公式,但后续验证代码示例说明所有指标均计算正确。

可能LLM已经内化这些计算公式,不需要显式提及。

模型返回的原始内容: {
"revenue_2023": 383285,
"revenue_2022": 394328,
"cogs_2023": 214137,
"cogs_2022": 223546,
"net_income_2023": 96995,
"net_income_2022": 99803,
"r_and_d_2023": 29915,
"r_and_d_2022": 26251,
"total_assets_2023": 352583,
"total_liabilities_2023": 290437,
"operating_cash_flow_2023": 110543,
"capital_expenditure_2023": 10959,
"revenue_growth": "-2.80%",
"gross_margin_2023": "44.13%",
"gross_margin_2022": "43.31%",
"net_profit_margin_2023": "25.30%",
"net_profit_margin_2022": "25.31%",
"r_and_d_pct_change": "+1.15pp",
"debt_to_assets_2023": "82.37%",
"free_cash_flow_2023": 99584
}
模型思考过程: The user wants me to extract financial data from Apple's 2022 and 2023 10-K reports and calculate specific financial metrics. I need to carefully read through the provided text to find the required data points.

Let me identify what I need to extract:

**Raw Data to Extract:**
1. revenue_2023: 2023 fiscal year total revenue
2. revenue_2022: 2022 fiscal year total revenue
3. cogs_2023: 2023 fiscal year cost of sales
4. cogs_2022: 2022 fiscal year cost of sales
5. net_income_2023: 2023 fiscal year net income
6. net_income_2022: 2022 fiscal year net income
7. r_and_d_2023: 2023 fiscal year R&D expense
8. r_and_d_2022: 2022 fiscal year R&D expense
9. total_assets_2023: 2023 fiscal year end total assets
10. total_liabilities_2023: 2023 fiscal year end total liabilities
11. operating_cash_flow_2023: 2023 fiscal year operating cash flow
12. capital_expenditure_2023: 2023 fiscal year capital expenditure (purchase of property, plant and equipment)

**Calculated Metrics:**
1. revenue_growth: Revenue growth rate
2. gross_margin_2023: 2023 gross margin percentage
3. gross_margin_2022: 2022 gross margin percentage
4. net_profit_margin_2023: 2023 net profit margin percentage
5. net_profit_margin_2022: 2022 net profit margin percentage
6. r_and_d_pct_change: R&D as % of revenue change (percentage points)
7. debt_to_assets_2023: 2023 debt-to-assets ratio (liabilities/assets)
8. free_cash_flow_2023: 2023 free cash flow (operating cash flow - capital expenditure)


I'll start by locating the 2023 fiscal year data in the 2023 10-K report. The Consolidated Statements of Operations shows total net sales of $383,285 million for the year ended September 30, 2023, with cost of sales at $214,137 million and net income of $96,995 million. Research and development expenses totaled $29,915 million. From the Consolidated Balance Sheets, I can see total assets were $352,583 million and total liabilities were $290,437 million as of September 30, 2023. The Consolidated Statements of Cash Flows shows operating cash flow of $110,543 million and capital expenditures (payments for acquisition of property, plant and equipment) of $10,959 million for 2023.

Now I need to find the corresponding 2022 data from the 2022 10-K report to complete my analysis.

From the 2022 10-K Consolidated Statements of Operations, total net sales were $394,328 million for the year ended September 24, 2022, with cost of sales at $223,546 million and net income of $99,803 million. Research and development expenses were $26,251 million. The Consolidated Balance Sheets show total assets of $352,755 million and total liabilities of $302,083 million as of September 24, 2022. Operating cash flow from the Consolidated Statements of Cash Flows was $122,151 million, and capital expenditures were $10,708 million for 2022.

Now I can calculate the key financial metrics. Revenue declined 2.81% year-over-year from $394,328 million to $383,285 million. Gross margin improved slightly from 43.31% in 2022 to 44.13% in 2023. Net profit margin decreased from 25.31% to 25.30%. R&D spending as a percentage of revenue increased from 6.66% to 7.81%, representing a 1.15 percentage point increase. The debt-to-assets ratio improved from 85.63% to 82.37%. Free cash flow for 2023 was $99,584 million after subtracting capital expenditures from operating cash flow.

I should verify these calculations against the source documents to ensure accuracy before finalizing the analysis.

From the 2023 10-K Consolidated Statements of Operations, I can confirm: Total net sales of $383,285 million, cost of sales at $214,137 million, net income of $96,995 million, and R&D expenses of $29,915 million. The Balance Sheet shows total assets of $352,583 million and total liabilities of $290,437 million. The Cash Flow Statement reports operating cash flow of $110,543 million and capital expenditures of $10,959 million.

For 2022, the 10-K shows: Total net sales of $394,328 million, cost of sales at $223,546 million, net income of $99,803 million, and R&D expenses of $26,251 million. The Balance Sheet indicates total assets of $352,755 million and total liabilities of $302,083 million. Operating cash flow was $122,151 million with capital expenditures of $10,708 million.

Now I'll calculate the key metrics. Revenue growth came in at -2.80%, reflecting the year-over-year decline. Gross margin improved to 44.13% in 2023 from 43.31% in 2022. Net profit margin remained relatively stable at 25.30% versus 25.31%. R&D intensity increased by 1.15 percentage points to 7.81% of revenue. The debt-to-assets ratio decreased to 82.37% from the prior year. Free cash flow for 2023 totaled $99,584 million after subtracting capital expenditures from operating cash flow.

解析后的结果:
{
"revenue_2023": 383285,
"revenue_2022": 394328,
"cogs_2023": 214137,
"cogs_2022": 223546,
"net_income_2023": 96995,
"net_income_2022": 99803,
"r_and_d_2023": 29915,
"r_and_d_2022": 26251,
"total_assets_2023": 352583,
"total_liabilities_2023": 290437,
"operating_cash_flow_2023": 110543,
"capital_expenditure_2023": 10959,
"revenue_growth": "-2.80%",
"gross_margin_2023": "44.13%",
"gross_margin_2022": "43.31%",
"net_profit_margin_2023": "25.30%",
"net_profit_margin_2022": "25.31%",
"r_and_d_pct_change": "+1.15pp",
"debt_to_assets_2023": "82.37%",
"free_cash_flow_2023": 99584
}

2.3 真实对比

这里通过与与真实财报数据进行比对,评估模型准确性。

示例代码如下

# 真实数据(单位百万美元) real_data = { "revenue_2023": 383285, "revenue_2022": 394328, "cogs_2023": 214137, "cogs_2022": 223546, "net_income_2023": 96995, "net_income_2022": 99803, "r_and_d_2023": 29915, "r_and_d_2022": 26251, "total_assets_2023": 352583, "total_liabilities_2023": 290437, "operating_cash_flow_2023": 110543, "capital_expenditure_2023": 10959 } # 真实计算值 real_metrics = { "revenue_growth": f"{(real_data['revenue_2023'] - real_data['revenue_2022'])/real_data['revenue_2022']*100:.2f}%", "gross_margin_2023": f"{(real_data['revenue_2023'] - real_data['cogs_2023'])/real_data['revenue_2023']*100:.2f}%", "gross_margin_2022": f"{(real_data['revenue_2022'] - real_data['cogs_2022'])/real_data['revenue_2022']*100:.2f}%", "net_profit_margin_2023": f"{real_data['net_income_2023']/real_data['revenue_2023']*100:.2f}%", "net_profit_margin_2022": f"{real_data['net_income_2022']/real_data['revenue_2022']*100:.2f}%", "r_and_d_pct_change": f"{(real_data['r_and_d_2023']/real_data['revenue_2023'] - real_data['r_and_d_2022']/real_data['revenue_2022'])*100:+.2f}pp", "debt_to_assets_2023": f"{real_data['total_liabilities_2023']/real_data['total_assets_2023']*100:.2f}%", "free_cash_flow_2023": real_data['operating_cash_flow_2023'] - real_data['capital_expenditure_2023'] } # 对比模型输出 for key in real_metrics: if key in result: pred = result[key] real = real_metrics[key] print(f"{key}: 预测 {pred} vs 真实 {real}") else: print(f"警告:模型输出缺少字段 {key}")

输出示例如下,输出显示,LLM计算结果与真实指标非常接近。

revenue_growth: 预测 -2.80% vs 真实 -2.80%
gross_margin_2023: 预测 44.13% vs 真实 44.13%
gross_margin_2022: 预测 43.31% vs 真实 43.31%
net_profit_margin_2023: 预测 25.30% vs 真实 25.31%
net_profit_margin_2022: 预测 25.31% vs 真实 25.31%
r_and_d_pct_change: 预测 +1.15pp vs 真实 +1.15pp
debt_to_assets_2023: 预测 82.37% vs 真实 82.37%
free_cash_flow_2023: 预测 99584 vs 真实 99584

苹果公司10-K 2022 & 2023财务数据如下

指标20232022
总营收$383,285 M$394,328 M
营业成本$214,137 M$223,546 M
净利润$96,995 M$99,803 M
研发费用$29,915 M$26,251 M
总资产$352,583 M$352,755 M (2022末)
总负债$290,437 M$302,083 M (2022末)
经营活动现金流$110,543 M$122,151 M
资本支出$10,959 M$10,708 M

数据来源链接如下

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm

reference

---

LLM数值提取-计算场景示例

https://blog.csdn.net/liliang199/article/details/159244753

LLM长上下文和数值类有效输出的关系探索

https://blog.csdn.net/liliang199/article/details/159175752

http://www.jsqmd.com/news/507812/

相关文章:

  • 2026 穿线支架管行业核心实力全维度测评 君诚集团稳居行业标杆首位 - 外贸老黄
  • 深度学习驱动的聚类算法:从理论到实践的全景解析
  • 办公写作软件真实数据曝光:2026写作软件前十强盘点及场景适配分析 - 深度智识库
  • AWS EC2实例上SSM-Agent的安装与故障排除指南
  • 人肉防火墙:用生理恐惧阻断DDoS攻击
  • Token 烧钱?OpenClaw 这几个配置让我省了一半开销
  • EasyAnimateV5效果展示:看看这些图片是如何“活”起来的
  • 围棋-html版本
  • 虾皮怎么选品比较好?虾皮选品的方法和技巧分享! - 跨境小媛
  • AMiner Research Labs公测,使用Google NotebookLM交互范式,新增「代码」工具,可一键复现算法论文框架及可供测试使用的伪代码
  • SpringSecurity相关jar包的介绍
  • php方案 PHP的Composer依赖解析
  • 电子资料_定制开发36:3️⃣维比例导引+LSTM目标轨迹预测 资料类型:全m代码 说明:演示了三维比例导引使用;以及采用LSTM网络预测目标轨迹,进而预测拦截命中点的演
  • 2026年太阳能风光互补路灯厂家推荐:学校球场/市政/智慧调光/多功能智慧路灯专业供应 - 品牌推荐官
  • 飞书多维表格与Dify集成实战:从零配置到数据自动填充
  • 2026年尾矿砂烘干机厂家推荐:沙子烘干机/砂石烘干机/烘干沙设备专业供应商精选 - 品牌推荐官
  • Qwen3-0.6B-FP8生产环境:支持服务器重启自动恢复的稳定服务部署
  • advisor复合电源模型。 采用新增构型方法修改的复合电源模型,比advisor书上那种在纯...
  • 卡券回收避坑指南:我用抖抖收的经验告诉你这些骗局要当心 - 抖抖收
  • Uniapp 实现 二手车价格评估 功能
  • Mac端mitmproxy实战:从安装到HTTPS请求监控全解析
  • 【若依框架】ruoyi前端视觉定制全攻略:从登录页到系统Logo的深度改造
  • 降雨量MK检验和Morlet小波分析附Matlab代码
  • 高性价比之选:BW手持测氧测爆仪优质供应商哪家好? - 品牌推荐大师
  • 基于深度强化学习的微能源网能量管理与优化策略研究:基于Q-learning和DQN的智能算法
  • 【datawhale】base-llm-基础-t1
  • AI 编程时代,程序员会被替代吗?我更关心的是如何应对
  • 一个大三学生,如何用 3 天做一个能写进简历的项目
  • CasRel模型内网穿透部署方案:安全提供本地模型服务
  • SEO_本地SEO优化的完整步骤与关键点介绍