当前位置：首页 > news >正文

大模型测评从入门到精通 - 初核心概念

news 2026/6/23 5:18:38

文章目录

- 核心概念 - 像专家一样思考
- - 2.1 测试用例 (Test Case) —— 评估的原子单位
  - - 单轮 vs 多轮测试用例
  - 2.2 指标 (Metric) —— 你的评估标尺
  - 2.3 数据集 (Dataset) —— 测试的弹药库
  - 2.4 评估方式 —— 三种运行模式
  - - 方式一：使用 `assert_test` (Pytest 风格)
    - 方式二：使用 `evaluate` 函数
    - 方式三：独立运行指标
  - 2.5 核心架构图

核心概念 - 像专家一样思考

大半夜的加完班更一贴吧，希望大家都像标题一样像专家一样思考莫给队友添负担

2.1 测试用例 (Test Case) —— 评估的原子单位

在 DeepEval 中，测试用例 (Test Case)是评估的最小单位。想象它是一份"试卷"，包含：

小白划重点🎯
把LLMTestCase想象成一个字典/JSON 对象，里面装着评估需要的所有信息。
DeepEval 会拿这些信息去"问"评判 LLM：“嘿，看看这个回答怎么样？”

题目(input): 用户的问题或指令
学生答案(actual_output): 你的 LLM 应用的输出
标准答案(expected_output): 期望的理想回答（可选）
参考资料(context): 提供给 LLM 的背景知识（可选）
检索结果(retrieval_context): RAG 系统检索到的文档（可选）
工具调用(tools_called): Agent 调用的工具列表（可选）
期望工具(expected_tools): 期望 Agent 调用的工具（可选）

fromdeepeval.test_caseimportLLMTestCase,ToolCall test_case=LLMTestCase(input="What if these shoes don't fit?",# 用户问题expected_output="You're eligible for a 30 day refund.",# 期望回答actual_output="We offer a 30-day full refund at no extra cost.",# 实际回答context=["All customers are eligible for a 30 day full refund at no extra cost."],# 上下文retrieval_context=["Only shoes can be refunded."],# 检索结果tools_called=[ToolCall(name="WebSearch")]# 调用的工具)

单轮 vs 多轮测试用例

fromdeepeval.test_caseimportLLMTestCase,ConversationalTestCase,LLMTestCase# 单轮测试：一问一答single_turn=LLMTestCase(input="What's the weather today?",actual_output="It's sunny and 25°C.")# 多轮测试：完整对话multi_turn=ConversationalTestCase(scenario="Customer asking about return policy",turns=[LLMTestCase(input="Hi, I want to return my shoes",actual_output="Sure, I can help with that..."),LLMTestCase(input="How long do I have?",actual_output="You have 30 days...")])

LLMTestCase是评估的基本单元，代表一次 LLM 交互：

参数	类型	必需	说明
`input`	str	✅	发送给 LLM 的输入
`actual_output`	str	✅	LLM 生成的输出
`expected_output`	str	❌	期望输出（参考标准）
`retrieval_context`	List[str]	❌	RAG 检索上下文
`context`	List[str]	❌	额外背景信息
`tools_called`	List[ToolCall]	❌	Agent 调用的工具
`expected_tools`	List[str]	❌	期望调用的工具

# 官网给的 demo 附加上吧fromdeepeval.test_caseimportLLMTestCase test_case=LLMTestCase(input="美国的现任总统是谁？",actual_output="乔·拜登是美国现任总统。",expected_output="乔·拜登",retrieval_context=["乔·拜登目前担任美国总统。"],)

2.2 指标 (Metric) —— 你的评估标尺

指标是评估的标准。DeepEval 提供 50+ 指标，分为几大类：

┌─────────────────────────────────────────────────────────────┐ │ DeepEval 指标家族 │ ├─────────────────────────────────────────────────────────────┤ │ 📊 RAG 指标 │ Faithfulness, Answer Relevancy, │ │ │ Contextual Precision/Recall/Relevancy│ ├─────────────────────────────────────────────────────────────┤ │ 🤖 Agent 指标 │ Task Completion, Tool Correctness, │ │ │ Step Efficiency, Plan Adherence │ ├─────────────────────────────────────────────────────────────┤ │ 💬 对话指标 │ Conversation Relevancy, Knowledge │ │ │ Retention, Role Adherence │ ├─────────────────────────────────────────────────────────────┤ │ 🛡️ 安全指标 │ Bias, Toxicity, PII Leakage, │ │ │ Misuse, Non-Advice │ ├─────────────────────────────────────────────────────────────┤ │ ⚙️ 通用指标 │ Hallucination, Summarization, │ │ │ JSON Correctness, Ragas │ ├─────────────────────────────────────────────────────────────┤ │ 🎨 自定义指标 │ G-Eval, DAGMetric │ └─────────────────────────────────────────────────────────────┘

每个指标返回：

score: 0-1 之间的分数
reason: 评分的理由说明
success: 是否通过阈值 (score >= threshold)

小白划重点🎯
score 是什么？就像考试分数，0 分最差，1 分最好。
threshold 是什么？及格线。比如 threshold=0.7，那分数 >= 0.7 才算通过。
reason 是什么？LLM 给出的评语，告诉你为什么给这个分数。这在调试时超级有用！
示例输出： Score: 0.85 Reason: The response directly answers the user's question about return policy and provides accurate information consistent with the context. Success: True

常见坑点⚠️
坑 1: 以为 score 是百分比 → 其实是 0-1 的小数，0.85 = 85%
坑 2: threshold 设得太高 → 建议从 0.5-0.7 开始，逐步调整
坑 3: 忽略 reason → reason 是调试神器，一定要看！

2.3 数据集 (Dataset) —— 测试的弹药库

数据集是测试用例的集合。你可以：

fromdeepeval.datasetimportEvaluationDataset,Golden# 创建数据集dataset=EvaluationDataset(goldens=[Golden(input="What is DeepEval?",expected_output="An LLM evaluation framework."),Golden(input="How to install?",expected_output="pip install deepeval")])# 从 CSV 加载dataset.add_goldens_from_csv_file(file_path="test_data.csv")# 从 JSON 加载dataset.add_goldens_from_json_file(file_path="test_data.json")# 从 Confident AI 云端拉取dataset.pull(alias="My Production Dataset")

2.4 评估方式 —— 三种运行模式

方式一：使用`assert_test`(Pytest 风格)

fromdeepevalimportassert_testfromdeepeval.test_caseimportLLMTestCasefromdeepeval.metricsimportAnswerRelevancyMetricdeftest_answer_relevancy():metric=AnswerRelevancyMetric(threshold=0.7)test_case=LLMTestCase(input="What is DeepEval?",actual_output="DeepEval is an open-source LLM evaluation framework.")assert_test(test_case,[metric])

运行：deepeval test run test_file.py

方式二：使用`evaluate`函数

fromdeepevalimportevaluatefromdeepeval.test_caseimportLLMTestCasefromdeepeval.metricsimportAnswerRelevancyMetric,FaithfulnessMetric test_cases=[LLMTestCase(input="Q1",actual_output="A1"),LLMTestCase(input="Q2",actual_output="A2")]metrics=[AnswerRelevancyMetric(),FaithfulnessMetric()]results=evaluate(test_cases=test_cases,metrics=metrics)

方式三：独立运行指标

fromdeepeval.metricsimportAnswerRelevancyMetricfromdeepeval.test_caseimportLLMTestCase metric=AnswerRelevancyMetric()test_case=LLMTestCase(input="Q1",actual_output="A1")metric.measure(test_case)print(f"Score:{metric.score}")print(f"Reason:{metric.reason}")