当前位置：首页 > news >正文

用 MLflow 系统化评估大语言模型：新手入门与工程实践

news 2026/6/18 20:35:27

1. 项目概述：为什么用 MLflow 来评估大语言模型，而不是手写脚本或 Excel 表格？

“Evaluating LLMs with MLflow: A Practical Beginner’s Guide”——这个标题一上来就点明了两个关键动作：评估（Evaluating）和集成（with MLflow），对象是当前最热门也最难把控的 LLM（大语言模型）。它不是讲怎么训练一个 LLM，也不是教你怎么调 API，而是聚焦在一个被大量初学者忽略、但实际工程落地中生死攸关的环节：如何系统化、可复现、可对比地验证一个 LLM 到底“好不好”。

我带过十几支从零起步做 LLM 应用的团队，发现一个惊人共性：80% 的人卡在“感觉模型输出还行”，但说不清“还行”具体指什么；60% 的人会写个 for 循环跑 10 个 prompt，手动记下几个 response，再复制粘贴到 Excel 里打分；剩下 20% 里，又有 15% 是靠 Jupyter Notebook 里一堆 print() 和 time.time() 拼凑出所谓“评估结果”。这些做法在 PoC 阶段勉强能用，一旦要上线、要迭代、要和 baseline 对比、要向产品/业务方汇报效果提升，立刻崩盘——你根本没法回答：“上个版本准确率 72%，这次为什么变成 74.3%？是 prompt 改了？数据变了？还是模型微调参数漂移了？”

MLflow 在这里不是锦上添花的“高级工具”，而是解决上述混乱局面的基础设施级答案。它把原本散落在 notebook、terminal、txt 文件、Excel 表格里的评估行为，强制收束进四个可追踪、可版本化、可协作的维度：实验（Experiments）、运行（Runs）、参数（Parameters）、指标（Metrics）与工件（Artifacts）。比如，你今天用 temperature=0.3 测试了 50 个客服问答样本，MLflow 会自动记录：这是第 7 次实验（experiment_id=7），本次运行（run_id=abc123）用了哪个 prompt 模板（存为 artifacts/prompt_v2.txt）、哪个测试集（test_set_v3.json）、哪些超参（temperature=0.3, top_k=5）、最终算出的三个核心指标（accuracy=0.743, avg_latency_ms=1240, token_cost_usd=0.0217）——全部结构化入库，点击就能回溯，导出就是标准报告。这不是“多此一举”，而是把“经验直觉”变成“工程事实”的分水岭。对刚接触 LLM 评估的新手来说，MLflow 最大的价值不是功能多炫酷，而是它用极低的学习成本，帮你建立一套不依赖个人记忆、不依赖临时文件、不依赖口头约定的效果验证纪律。你不需要懂 MLOps 架构，只要会写 Python，就能让每一次模型效果的判断，都经得起追问。

2. 核心设计思路：为什么选 MLflow 而非 Weights & Biases、ClearML 或自建数据库？

2.1 评估 LLM 的特殊性决定了工具选型逻辑

评估传统机器学习模型（如分类、回归）时，我们习惯用固定数据集 + 固定指标（accuracy、F1、RMSE）+ 固定 pipeline。但 LLM 评估完全不同：

输入高度非结构化：prompt 是自然语言，可能含变量插值（如 “请为{product}写一段{tone}风格的广告语”），每次运行的输入文本长度、复杂度差异极大；
输出不可预测性强：同一个 prompt，不同 temperature 下输出可能是严谨定义、诗意比喻或胡言乱语，人工标注成本高，自动指标（如 BLEU、ROUGE）又常与人类感知严重脱钩；
评估维度爆炸式增长：除了基础的“答得对不对”（faithfulness），还要看“有没有幻觉”（hallucination）、“是否拒绝不当请求”（refusal rate）、“响应是否简洁”（conciseness）、“多轮对话是否连贯”（coherence）、甚至“生成代码能否通过编译”（code_executability）……一个真实业务场景往往需要同时追踪 5~8 个异构指标；
环境依赖显性化：LLM 评估结果受 API 版本（gpt-4-turbo vs gpt-4-0613）、客户端 SDK 版本、网络延迟、重试策略等强影响，这些必须和指标一起被记录，否则对比毫无意义。

这就意味着，理想的评估工具必须满足四个硬性条件：

轻量启动：不能要求你先搭 Kubernetes 集群或配置 PostgreSQL；
灵活记录：能同时存字符串（prompt）、JSON（测试样本）、浮点数（latency）、布尔值（is_hallucinated）、甚至二进制文件（截图、音频）；
天然支持多维对比：当你要横向比较 3 个 prompt 模板 + 2 个模型 + 4 种 temperature 组合时，UI 必须能一键筛选、交叉透视；
离线友好：很多企业内网无法访问公网 SaaS 服务，工具必须支持本地 SQLite 或文件后端，且不牺牲核心功能。

2.2 MLflow 的四层能力精准匹配 LLM 评估需求

我们逐一对标这四点，看 MLflow 如何“刚刚好”：

第一层：极简部署，开箱即用
MLflow Tracking 默认使用本地文件系统（mlruns/目录）作为后端，一行命令mlflow ui就能拉起 Web 界面。你完全不需要安装数据库、配置用户权限、申请云资源。我实测过：在一台 8GB 内存的 MacBook 上，从pip install mlflow到看到第一个 run 出现在 UI 里，耗时 92 秒。相比之下，Weights & Biases（W&B）强制要求注册账号并联网同步，ClearML 默认依赖 MongoDB，自建数据库则需额外维护 schema 迁移和备份策略——对只想快速验证一个 prompt 效果的新手，这些全是认知负担。

第二层：Artifact 机制完美承载 LLM 的“混沌数据”
MLflow 不把数据硬塞进表格字段，而是用log_artifact()把任意文件存为工件。这意味着：

你可以把整个测试集 JSONL 文件（含 prompt、reference_answer、domain_tag）直接log_artifact("data/test_v1.jsonl")；
可以把 prompt 模板的 Jinja2 文件log_artifact("templates/qa_prompt.j2")；
甚至可以把人工标注的 Excel 打分表（含 reviewer_name、confidence_score、comment）作为 artifact 上传。
这些文件在 UI 中点击即可下载，版本与 run 绑定，彻底告别“那个 prompt 是在哪次 run 里用的？我记得存在 Desktop 的某个文件夹里……”。

第三层：Runs 的 Tag + Param + Metric 三维索引，支撑复杂对比
MLflow 允许你为每个 run 打任意 tag（字符串键值对），比如model_name="gpt-4-turbo"、eval_mode="human_reviewed"；设任意 param（字符串/数字），比如temperature=0.5、max_tokens=512；记录任意 metric（浮点数），比如faithfulness_score=0.82、avg_response_length=142.3。这三者组合，让你能用 SQL 式语法在 UI 中筛选：

model_name = "gpt-4-turbo" AND eval_mode = "auto_eval" AND temperature BETWEEN 0.3 AND 0.7
结果列表会自动按faithfulness_score降序排列，旁边清晰显示每个 run 的 latency、cost、样本数。这种灵活性，是任何把指标强行映射到固定列名的 Excel 表格无法企及的。

第四层：离线模式无损功能，适配所有生产环境
MLflow 的file://后端（默认）和sqlite:///mlflow.db后端，完全不依赖网络。你可以在客户内网服务器上跑评估脚本，所有数据只写入本地目录或 SQLite 文件，然后定期用mlflow server --backend-store-uri file:///path/to/mlruns拉起 UI 查看。我服务过一家金融客户，其合规要求所有 LLM 评估数据不得出内网，他们用 MLflow + SQLite 部署在隔离区，产品经理每天登录内网地址查看最新评估报告，全程零公网交互——这恰恰是 W&B/ClearML 类 SaaS 工具无法满足的底线需求。

提示：不要被 MLflow 的“MLOps”标签吓住。它本质是一个结构化日志框架，和 Python 的 logging 模块同源。你不需要理解 Model Registry 或 Projects，只要掌握mlflow.start_run()、log_param()、log_metric()、log_artifact()这四个 API，就已覆盖 95% 的 LLM 评估场景。

3. 实操全流程：从零搭建一个可复用的 LLM 评估流水线

3.1 环境准备与最小依赖安装

我们从最干净的环境开始。假设你已安装 Python 3.9+，执行以下命令（注意：不推荐全局 pip install，务必用虚拟环境）：

# 创建独立环境（避免污染主环境） python -m venv llm-eval-env source llm-eval-env/bin/activate # macOS/Linux # llm-eval-env\Scripts\activate # Windows # 安装核心依赖（仅 4 个包，无冗余） pip install mlflow==2.14.3 openai==1.35.1 jinja2==3.1.4 pandas==2.2.2 # 验证安装 python -c "import mlflow; print(mlflow.__version__)"

这里严格锁定版本号，是因为：

MLflow 2.14.x 是目前对 LLM 场景兼容最稳定的版本（2.15+ 引入了 Experiment Tags 新特性，但部分 artifact 读取逻辑有 breaking change）；
OpenAI 1.35.1 是最后一个支持openai.ChatCompletion.create()同步接口的版本（后续版本强制 async，对新手不友好）；
Jinja2 3.1.4 确保 prompt 模板渲染稳定（新版对空格处理更严格，易导致 prompt 格式错乱）；
Pandas 2.2.2 提供高效 JSONL 读写（pd.read_json(..., lines=True)），比原生 json 模块快 3 倍以上。

注意：不要安装mlflow-skinny！它阉割了 artifact 存储功能，而 LLM 评估极度依赖 artifact 记录原始数据。也不要pip install mlflow[gcp,aws]，这些云插件会引入大量无用依赖，增加环境冲突风险。

3.2 构建可复用的评估骨架：`evaluator.py`

我们不写一次性脚本，而是构建一个模块化类LLMEvaluator，它能被任何项目 import 复用。以下是核心骨架（完整代码约 280 行，此处展示关键结构）：

# evaluator.py import json import time import logging from pathlib import Path from typing import List, Dict, Any, Optional, Callable import mlflow import openai import pandas as pd from jinja2 import Template class LLMEvaluator: def __init__( self, model_name: str, api_key: str, base_url: Optional[str] = None, timeout: int = 60, max_retries: int = 3 ): self.model_name = model_name self.client = openai.OpenAI( api_key=api_key, base_url=base_url, timeout=timeout, max_retries=max_retries ) self.logger = logging.getLogger(__name__) def load_testset(self, filepath: str) -> List[Dict[str, Any]]: """安全加载 JSONL 测试集，自动处理编码和格式错误""" try: return pd.read_json(filepath, lines=True).to_dict('records') except Exception as e: self.logger.error(f"Failed to load testset {filepath}: {e}") raise def render_prompt(self, template_str: str, context: Dict[str, Any]) -> str: """用 Jinja2 渲染 prompt，自动处理 None 值和空格""" template = Template(template_str.strip()) # 安全渲染：None 值转为空字符串，避免模板报错 safe_context = {k: (v if v is not None else "") for k, v in context.items()} return template.render(**safe_context) def call_llm(self, prompt: str) -> Dict[str, Any]: """封装 LLM 调用，统一记录耗时、token、错误""" start_time = time.time() try: response = self.client.chat.completions.create( model=self.model_name, messages=[{"role": "user", "content": prompt}], temperature=0.0, # 评估时禁用随机性，确保可复现 max_tokens=1024 ) end_time = time.time() return { "response": response.choices[0].message.content.strip(), "input_tokens": response.usage.prompt_tokens, "output_tokens": response.usage.completion_tokens, "latency_sec": end_time - start_time, "error": None } except Exception as e: end_time = time.time() return { "response": "", "input_tokens": 0, "output_tokens": 0, "latency_sec": end_time - start_time, "error": str(e) } def evaluate_single_sample( self, sample: Dict[str, Any], prompt_template: str, metrics_fn: Callable[[str, str], Dict[str, float]] ) -> Dict[str, Any]: """评估单个样本：渲染 → 调用 → 计算指标 → 返回结构化结果""" # 1. 渲染 prompt prompt = self.render_prompt(prompt_template, sample) # 2. 调用 LLM llm_result = self.call_llm(prompt) # 3. 计算指标（传入 response 和 reference_answer） metrics = metrics_fn( llm_result["response"], sample.get("reference_answer", "") ) if "reference_answer" in sample else {} # 4. 合并所有信息 return { "sample_id": sample.get("id", "unknown"), "prompt": prompt, "response": llm_result["response"], "reference_answer": sample.get("reference_answer", ""), "metrics": metrics, "llm_call": llm_result } def run_evaluation( self, testset_path: str, prompt_template: str, metrics_fn: Callable[[str, str], Dict[str, float]], experiment_name: str = "llm_eval_default", run_name: Optional[str] = None ) -> str: """主评估入口：自动创建 MLflow Run，记录全过程""" # 初始化 MLflow mlflow.set_experiment(experiment_name) with mlflow.start_run(run_name=run_name) as run: # 记录基础参数 mlflow.log_param("model_name", self.model_name) mlflow.log_param("testset_path", testset_path) mlflow.log_param("prompt_template", prompt_template[:100] + "..." if len(prompt_template) > 100 else prompt_template) # 加载测试集 testset = self.load_testset(testset_path) mlflow.log_param("testset_size", len(testset)) # 保存 prompt 模板为 artifact template_path = Path("artifacts") / "prompt.j2" template_path.parent.mkdir(exist_ok=True) template_path.write_text(prompt_template) mlflow.log_artifact(str(template_path)) # 保存原始测试集 mlflow.log_artifact(testset_path) # 执行逐样本评估 results = [] for i, sample in enumerate(testset): self.logger.info(f"Evaluating sample {i+1}/{len(testset)}...") result = self.evaluate_single_sample(sample, prompt_template, metrics_fn) results.append(result) # 即时记录单样本指标（便于中断恢复） for metric_name, metric_value in result["metrics"].items(): mlflow.log_metric(f"sample_{metric_name}", metric_value, step=i) # 汇总统计 summary_metrics = self._calculate_summary_metrics(results) for metric_name, metric_value in summary_metrics.items(): mlflow.log_metric(metric_name, metric_value) # 保存完整结果为 JSONL artifact results_path = Path("artifacts") / "full_results.jsonl" pd.DataFrame(results).to_json(results_path, orient="records", lines=True, force_ascii=False) mlflow.log_artifact(str(results_path)) return run.info.run_id def _calculate_summary_metrics(self, results: List[Dict]) -> Dict[str, float]: """计算汇总指标：跳过 error 样本，加权平均 latency""" valid_results = [r for r in results if not r["llm_call"]["error"]] if not valid_results: return {"error_rate": 1.0} # 基础统计 error_count = len(results) - len(valid_results) error_rate = error_count / len(results) # 指标均值（对每个指标单独计算，避免 NaN 传播） all_metrics = {} for r in valid_results: for k, v in r["metrics"].items(): if k not in all_metrics: all_metrics[k] = [] all_metrics[k].append(v) summary = {"error_rate": error_rate} for metric_name, values in all_metrics.items(): summary[f"avg_{metric_name}"] = sum(values) / len(values) # 延迟加权平均（按 input_tokens 加权，更反映真实负载） weighted_latency = sum(r["llm_call"]["latency_sec"] * r["llm_call"]["input_tokens"] for r in valid_results) / \ sum(r["llm_call"]["input_tokens"] for r in valid_results) summary["weighted_avg_latency_sec"] = weighted_latency return summary

这个类的设计哲学是：把所有“脏活”封装掉，暴露最干净的接口。你只需要关注三件事：

testset_path：你的测试数据在哪；
prompt_template：你用什么模板生成 prompt；
metrics_fn：你用什么函数计算效果（后面详解）；
其余如 MLflow 初始化、artifact 保存、错误重试、统计聚合，全部由类内部处理。实测下来，一个新手 15 分钟就能基于这个骨架跑通第一次评估。

3.3 编写第一个评估指标函数：从“答得对不对”到“答得有多好”

LLM 评估最大的陷阱，是迷信单一指标。我们提供三个典型 metrics_fn 示例，覆盖不同成熟度阶段：

示例 1：基础 Exact Match（适合封闭域 QA）

def exact_match_metric(response: str, reference: str) -> Dict[str, float]: """严格字符串匹配，适用于答案唯一、格式固定的场景（如数学题、API 文档查询）""" if not response or not reference: return {"exact_match": 0.0} # 去除首尾空格、换行、多余空格，转小写 clean_resp = " ".join(response.strip().lower().split()) clean_ref = " ".join(reference.strip().lower().split()) return {"exact_match": 1.0 if clean_resp == clean_ref else 0.0}

示例 2：语义相似度（适合开放域生成）

from sentence_transformers import SentenceTransformer import numpy as np # 预加载模型（避免每次调用都加载） sim_model = SentenceTransformer("all-MiniLM-L6-v2") def semantic_similarity_metric(response: str, reference: str) -> Dict[str, float]: """计算 response 和 reference 的语义向量余弦相似度""" if not response or not reference: return {"semantic_similarity": 0.0} try: embeddings = sim_model.encode([response, reference], convert_to_tensor=True) cos_sim = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])) return {"semantic_similarity": float(cos_sim)} except Exception as e: return {"semantic_similarity": 0.0}

示例 3：多维人工评估模拟（适合产品上线前）

def multi_dimensional_metric(response: str, reference: str) -> Dict[str, float]: """模拟人工打分的 5 维度：准确性、完整性、简洁性、安全性、流畅性""" scores = {} # 准确性：基于关键词匹配（简化版，实际可用 NLI 模型） ref_keywords = set(reference.lower().split()) resp_keywords = set(response.lower().split()) scores["accuracy"] = len(ref_keywords & resp_keywords) / max(len(ref_keywords), 1) # 完整性：response 长度 / reference 长度（避免过短漏信息） if len(reference) > 0: scores["completeness"] = min(len(response) / len(reference), 1.0) else: scores["completeness"] = 1.0 if len(response) < 50 else 0.0 # 简洁性：惩罚过长响应（理想长度 1.2x reference） ideal_len = len(reference) * 1.2 scores["conciseness"] = max(0.0, 1.0 - abs(len(response) - ideal_len) / max(ideal_len, 1)) # 安全性：检测敏感词（真实场景应替换为专业内容安全 API） unsafe_words = ["hack", "crack", "bypass", "illegal"] scores["safety"] = 0.0 if any(word in response.lower() for word in unsafe_words) else 1.0 # 流畅性：基于标点符号密度（简化启发式） punct_count = sum(1 for c in response if c in ".!?。！？") scores["fluency"] = min(punct_count / max(len(response.split()), 1), 1.0) return scores

实操心得：不要一上来就追求“完美指标”。我建议新手按此路径演进：
第 1 周：用exact_match_metric跑封闭域任务（如公司内部知识库问答），快速建立 baseline；
第 2 周：加入semantic_similarity_metric，观察开放生成任务的语义一致性；
第 3 周：用multi_dimensional_metric模拟产品需求，把业务方关心的“简洁”“安全”等维度量化。
每次只改一个变量（比如只换 prompt，不换模型），才能真正归因效果变化。

3.4 运行第一次评估：完整命令与结果解读

现在，我们用一个真实例子跑起来。首先准备测试集test_qa.jsonl：

{"id": "q1", "question": "Python 中如何将字符串转换为整数？", "reference_answer": "使用 int() 函数，例如 int('123')"} {"id": "q2", "question": "解释 Python 中的 GIL 是什么？", "reference_answer": "GIL（全局解释器锁）是 CPython 解释器中的互斥锁，确保同一时刻只有一个线程执行 Python 字节码，避免内存管理竞争。"}

再准备 prompt 模板qa_prompt.j2：

你是一名资深 Python 工程师，请用中文准确回答以下问题。回答要简洁、专业，直接给出核心要点，不要解释原因或举例。 问题：{{ question }}

最后，编写执行脚本run_eval.py：

# run_eval.py from evaluator import LLMEvaluator from metrics import exact_match_metric # 假设 metrics.py 存放上面的函数 if __name__ == "__main__": # 初始化评估器（用你的 OpenAI Key） evaluator = LLMEvaluator( model_name="gpt-4-turbo", api_key="your-api-key-here" ) # 执行评估 run_id = evaluator.run_evaluation( testset_path="test_qa.jsonl", prompt_template=open("qa_prompt.j2").read(), metrics_fn=exact_match_metric, experiment_name="python_qa_eval", run_name="gpt4-turbo_v1_prompt" ) print(f"Evaluation completed! Run ID: {run_id}") print("Start MLflow UI with: mlflow ui --backend-store-uri ./mlruns")

执行命令：

python run_eval.py # 等待完成（通常 20~60 秒） mlflow ui --backend-store-uri ./mlruns

打开浏览器http://localhost:5000，你会看到：

左侧导航栏：Experiments→python_qa_eval→ 点击刚创建的 run；
Overview 标签页：清晰列出model_name=gpt-4-turbo、testset_size=2、error_rate=0.0、avg_exact_match=1.0；
Artifacts 标签页：能看到prompt.j2、test_qa.jsonl、full_results.jsonl三个文件，点击即可下载；
Metrics 标签页：avg_exact_match曲线（虽然只有 1 个点），以及sample_exact_match的两个时间序列点（对应 q1、q2）；
Params 标签页：prompt_template的截断预览，确认你用的是正确版本。

注意：如果看到avg_exact_match=0.0，别急着怀疑模型。先检查full_results.jsonl中的response字段——大概率是 prompt 模板里{{ question }}没被正确渲染（Jinja2 语法错误），或 API Key 无效导致返回空字符串。MLflow 的 artifact 机制让你能 10 秒内定位问题根源，而不是在 200 行日志里 grep。

4. 常见问题与避坑指南：那些文档里不会写的实战细节

4.1 问题排查速查表

现象	可能原因	排查步骤	解决方案
MLflow UI 打不开，报错`OSError: [Errno 98] Address already in use`	5000 端口被其他进程占用（如另一个 mlflow ui、Jupyter Lab）	`lsof -i :5000`（macOS/Linux）或`netstat -ano \| findstr :5000`（Windows）	`kill -9 <PID>`或换端口：`mlflow ui --port 5001`
Run 页面 Metrics 显示空白，但 Overview 里有数值	指标名含非法字符（如空格、括号、中文）	检查`log_metric("avg exact match", value)`中的 key	改为`log_metric("avg_exact_match", value)`，MLflow 只接受`[a-zA-Z0-9_-.]`
`full_results.jsonl`里`response`字段为空，但`error`为`None`	OpenAI API 返回了 content 为空的 response（常见于安全拦截）	在`call_llm()`中打印`response`全结构：`print(response.model_dump())`	检查`response.choices[0].finish_reason`，若为`"content_filter"`，说明触发内容安全策略，需调整 prompt 或联系 OpenAI
评估耗时远超预期（如单样本 2 分钟）	网络超时未生效，或重试次数过多	检查`openai.OpenAI(timeout=60)`是否生效；查看日志中`Retrying request`出现次数	降低`max_retries=1`，或在`call_llm()`中添加`if llm_result["error"]: break`强制退出循环
`mlflow.log_artifact()`报错`FileNotFoundError`	传入的路径是相对路径，但当前工作目录不是脚本所在目录	在`run_evaluation()`开头添加`print("Current working dir:", os.getcwd())`	统一用`Path(__file__).parent / "artifacts"`构建绝对路径

4.2 新手必踩的 3 个隐形坑

坑 1：在start_run()外调用log_*方法，指标静默丢失
现象：代码没报错，但 MLflow UI 里 metrics 为空。
原因：MLflow 的 logging 是 context-sensitive 的，所有log_metric()必须在with mlflow.start_run():代码块内执行。如果你写成：

mlflow.start_run() # ❌ 错误！没有 with mlflow.log_metric("x", 1) mlflow.end_run() # ❌ 错误！不推荐手动 end

指标很可能不写入。正确写法永远是with mlflow.start_run():，它会自动处理异常和 cleanup。

坑 2：测试集 JSONL 文件末尾多了一个空行，pandas.read_json(lines=True)报错
现象：ValueError: Expected object or value。
原因：JSONL 标准要求每行一个合法 JSON，空行是非法的。但很多人用 Excel 导出或手动编辑时会不小心加空行。
解决方案：在load_testset()中加固：

def load_testset(self, filepath: str) -> List[Dict[str, Any]]: lines = [] with open(filepath, 'r', encoding='utf-8') as f: for line_num, line in enumerate(f, 1): line = line.strip() if not line: # 跳过空行 continue try: lines.append(json.loads(line)) except json.JSONDecodeError as e: self.logger.warning(f"Invalid JSON at line {line_num}: {line[:50]}... Error: {e}") return lines

坑 3：prompt.j2模板里用了{% if xxx %}但上下文没传xxx，渲染失败返回空字符串
现象：prompt字段为空，LLM 收到空输入，返回随机内容。
原因：Jinja2 默认对未定义变量返回空字符串，不会报错。
解决方案：在render_prompt()中启用严格模式：

template = Template(template_str.strip(), undefined=jinja2.StrictUndefined) # 这样，如果 context 缺少 key，会抛出 jinja2.UndefinedError，立刻暴露问题

4.3 进阶技巧：让评估真正驱动迭代

MLflow 的价值不仅在于记录，更在于用数据驱动决策。分享两个我反复验证有效的技巧：

技巧 1：用search_runs()自动生成对比报告
当你跑了 10 次不同 prompt 的评估，手动对比太累。在 Python 中直接查询：

from mlflow import search_runs # 查找 python_qa_eval 实验中，所有 gpt-4-turbo 的 runs df = search_runs( experiment_ids=["python_qa_eval"], filter_string="params.model_name = 'gpt-4-turbo'", output_format="pandas" ) # 按 avg_exact_match 降序，只看 top 3 print(df.nlargest(3, "metrics.avg_exact_match")[["run_name", "params.prompt_template", "metrics.avg_exact_match"]])

输出就是一张 ready-to-send 的邮件正文，产品经理一眼看懂哪版 prompt 最优。

技巧 2：把 MLflow Run ID 注入 prompt，实现“可追溯生成”
在 prompt 模板末尾加上：

--- 本次评估 Run ID: {{ mlflow_run_id }}

然后在evaluate_single_sample()中，把run.info.run_id传入 context：

result = self.evaluate_single_sample( sample, prompt_template, metrics_fn, mlflow_run_id=run.info.run_id # ✅ 传入 )

这样，LLM 的 response 里会带上Run ID: abc123。当业务方质疑某次输出时，你只需搜索abc123，瞬间定位是哪次评估、哪个 prompt、哪个测试样本——把“玄学调参”变成“可审计工程”。

5. 后续扩展方向：从入门到构建企业级评估平台

当你熟练使用 MLflow 进行单模型评估后，自然会遇到新挑战。以下是三个经过验证的演进路径，按实施难度排序：

5.1 路径一：支持多模型并行评估（1 天工作量）

目标：一次运行，同时评估gpt-4-turbo、claude-3-haiku、llama-3-70b，自动对比。
关键改造：

修改LLMEvaluator.__init__()，支持传入client实例而非api_key；
新增MultiModelEvaluator类，接收多个LLMEvaluator实例；
run_evaluation()内部循环调用各 evaluator，用mlflow.start_run(run_name=f"{model_name}_eval")分开记录；
最终用search_runs()汇总所有模型的avg_exact_match。
收益：避免重复写 3 套几乎相同的脚本，评估效率提升 3 倍。

5.2 路径二：接入自动化人工评估（3 天工作量）

目标：把multi_dimensional_metric中的“人工打分”部分，对接内部标注平台 API。
关键改造：

在evaluate_single_sample()中，当metrics_fn返回{"needs_human_review": True}时，调用post_to_annotation_api()；
post_to_annotation_api()将prompt、response、reference_answer发送到标注平台，并返回task_id；
run_evaluation()结束后，轮询get_annotation_result(task_id)，直到所有标注完成；
最终log_metric()