当前位置：首页 > news >正文

从零开始：手把手教你跑通、分析和“解剖”大模型

news 2026/4/28 17:32:56

本文提供了一套从零开始、由浅入深的实践路径，指导读者如何下载后系统性地分析和学习大模型。首先，通过配置环境、运行示例代码让模型“跑起来”，验证环境配置。接着，推荐“三步走”学习策略：动手实践、理论结合、扩展能力。然后，通过创建脚本加载本地模型并简单对话，以及手动提取和可视化Transformer模型的注意力权重，帮助读者深入理解模型核心机制。最后，鼓励读者通过修改代码、探索不同注意力头和层级、使用复杂句子等方式，像科学家一样做实验，逐步“解剖”模型，真正掌握其工作原理。

这是我前些日子自己没事儿，坐在家里一直思考的一个问题，虽然写的越多，玩的越多。再返回来看，我到底一直在追求看什么？往回想一想。

当我们下载了一个大模型之后，如何系统地去分析和学习它呢？许多人面对一堆模型文件，常常感到无从下手。本文将提供一个从零开始、由浅入深的实践路径，带你一步步跑通模型、理解核心机制，并最终“解剖”它。

🚀 步骤零：快速入门 - 先让模型跑起来

在深入技术细节之前，首要任务是让模型在你的本地环境中成功运行。这不仅能给你带来最直观的感受，也能验证环境配置是否正确。

1. 推荐学习路径

对于初学者，我们推荐“三步走”策略：

动手实践 (先跑起来)：配置环境，加载本地模型并成功进行一次推理。
理论结合 (理解它)：结合运行结果，回顾和学习 Transformer、Tokenization 等核心概念。
扩展能力 (用好它)：探索 Prompt Engineering、RAG、模型量化等进阶应用。

2. 环境准备与运行

安装核心依赖

如果你的环境中尚未安装，请通过终端执行以下命令：

ounter(line pip install transformers torch sentencepiece

3. 创建脚本并运行

创建一个run_model.py文件，将以下代码粘贴进去。这段代码将加载你本地下载好的大模型，并与其进行一次简单的对话。

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line import torch from transformers import AutoModelForCausalLM, AutoTokenizer # 1. 设置模型路径 # ⚠️ 请将下方路径替换为你自己存放模型文件的本地路径 model_path = "请替换为你的模型本地路径/Qwen3-0.6B" # 例如: "/Users/username/models/Qwen3-0.6B" # 2. 加载 Tokenizer 和模型 print("正在加载 Tokenizer 和模型...") # trust_remote_code=True 是必要的，因为模型包含自定义代码 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) # 针对不同硬件的优化 # Apple Silicon (M1/M2/M3) 用户遇到 BFloat16 错误时，请使用 float16 # CPU 用户或无特殊需求时，可以使用 "auto" model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, # 兼容 MPS (Apple Silicon)；CPU 用户可改为 "auto" device_map="auto", # 自动将模型分配到 GPU 或 CPU trust_remote_code=True ) print("模型加载完成！") # 3. 准备输入 prompt = "你好, 介绍一下你自己" messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # 4. 生成文本 print("正在生成回答...") generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] # 5. 打印结果 print(" " + "="*20 + " 模型回答 " + "="*20) print(response) print("="*50)

运行脚本

在终端中执行：

ounter(line python run_model.py

成功看到模型的回答后，你就可以继续下一步，对模型进行更深度的分析了。

💡 常见问题与提示
Hugging Face 认证警告：如果你看到关于未登录的警告，而你只使用本地模型，可以安全忽略。如果你需要访问私有模型，请在代码开头使用from huggingface_hub import login; login()进行登录。
tied weights警告：这是一个常见的提示，通常无害，模型会正常加载。这是因为配置文件和模型检查点在权重处理上存在微小不一致。
MPS 上的 BFloat16 错误：如果你在苹果 M 系列芯片上遇到TypeError: BFloat16 is not supported on MPS，这是因为 MPS 加速后端目前不完全支持 BFloat16。解决方案就是如我们代码中所示，明确指定torch_dtype=torch.float16。

手动提取并可视化注意力权重

成功运行模型后，让我们回归本源，亲手提取并可视化 Transformer 模型最核心的部件——注意力权重（Attention Weights）。这能让你最深刻地理解模型是如何“思考”的。

核心思想

Transformer 的核心是注意力机制。一个词在生成下一个词或理解句子时，会给输入序列中的其他词分配不同的“关注度”，这就是注意力权重。通过将这些权重矩阵绘制成热力图，我们可以直观地看到模型在处理一个词时，它的“目光”主要聚焦在哪些其他的词上。

操作步骤

1. 安装绘图库

ounter(line pip install matplotlib seaborn

2. 创建inspect_attention.py文件

创建一个新的 Python 文件inspect_attention.py，并将下面的代码粘贴进去。这段代码会加载一个经典的bert-base-uncased模型（BERT 非常适合用来学习注意力机制），提取其内部的注意力权重，并用热力图进行可视化。

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line import torch import matplotlib.pyplot as plt import seaborn as sns from transformers import AutoTokenizer, AutoModel # --- 1. 加载模型和分词器 --- # BERT 是一个非常适合用于学习注意力机制的经典模型 model_name = "bert-base-uncased" print(f"正在加载 '{model_name}'...") # 关键：output_attentions=True 告诉模型在运行时输出注意力权重 model = AutoModel.from_pretrained(model_name, output_attentions=True) tokenizer = AutoTokenizer.from_pretrained(model_name) print("加载完成！") # --- 2. 准备输入文本 --- sentence = "The cat sat on the mat" inputs = tokenizer(sentence, return_tensors="pt") # 获取分词后的 tokens，用于在热力图上标记 tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) # --- 3. 模型推理并获取注意力权重 --- print(" 正在进行推理并提取注意力权重...") with torch.no_grad(): outputs = model(**inputs) # 'attentions' 是一个元组，包含了模型每一层的注意力权重 # 形状: (batch_size, num_heads, sequence_length, sequence_length) # 我们只关心第一层(attentions[0])的注意力权重用于演示 attention = outputs.attentions[0] print("注意力权重形状 (batch, heads, seq_len, seq_len):", attention.shape) # --- 4. 可视化其中一个注意力头 --- # 我们选择第一个注意力头进行可视化 (head 0) head_to_visualize = 0 attention_head = attention[0, head_to_visualize, :, :] # 将权重转换为浮点数，以便 seaborn 处理 attention_head = attention_head.float() # 使用 seaborn 和 matplotlib 创建一个热力图 plt.figure(figsize=(10, 8)) sns.heatmap(attention_head, xticklabels=tokens, yticklabels=tokens, cmap="viridis") plt.title(f"Attention Head #{head_to_visualize} for: '{sentence}'") plt.xlabel("Key Tokens (被关注的词)") plt.ylabel("Query Tokens (发起关注的词)") # 保存图像 output_filename = "attention_heatmap.png" plt.savefig(output_filename) print(f" 可视化完成！注意力热力图已保存为 '{output_filename}'。") print("热力图显示了每个'发起关注的词'(纵轴) 对其他'被关注的词'(横轴) 的关注程度。颜色越亮，关注度越高。")

3. 运行脚本并查看结果

ounter(line python inspect_attention.py

程序运行后，会生成一个名为attention_heatmap.png的图片。打开它，你就能看到一张热力图，清晰地展示了模型内部的注意力分布。

进阶提示：使用你自己的本地模型进行分析

当然可以！你完全可以将inspect_attention.py中的model_name替换为你本地已下载的大模型路径，就像run_model.py中那样。这可以让你分析任意 Transformer 模型的注意力机制。

以下是修改后的inspect_attention.py示例，展示了如何加载一个本地的 Qwen-like 模型并可视化其注意力：

注意事项：

模型加载器: 我们仍然可以使用AutoModel，因为它足以加载模型的核心结构以提取注意力权重。
必要参数: 对于 Qwen 等模型，trust_remote_code=True是必须的。同时，为了硬件兼容性，建议保留torch_dtype和device_map参数。

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line import torch import matplotlib.pyplot as plt import seaborn as sns from transformers import AutoTokenizer, AutoModel # --- 1. 设置模型路径和加载器 --- # ⚠️ 请将下方路径替换为你自己存放模型文件的本地路径 local_model_path = "请替换为你的模型本地路径/Qwen3-0.6B" # 例如: "/Users/username/models/Qwen3-0.6B" print(f"正在加载本地模型 '{local_model_path}'...") # 加载分词器 tokenizer = AutoTokenizer.from_pretrained( local_model_path, trust_remote_code=True # 对于自定义或新模型通常需要 ) # 加载模型，并确保输出注意力权重 model = AutoModel.from_pretrained( local_model_path, output_attentions=True, # 关键参数：告诉模型输出注意力权重 torch_dtype=torch.float16, # 根据你的硬件调整，例如 Apple Silicon 推荐 device_map="auto", # 自动分配到 GPU 或 CPU trust_remote_code=True # 对于自定义或新模型通常需要 ) print("模型加载完成！") # --- 2. 准备输入文本 --- sentence = "你好，请介绍一下你自己。" # 使用中文句子 inputs = tokenizer(sentence, return_tensors="pt").to(model.device) # 获取分词后的 tokens，用于在热力图上标记 tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) # --- 3. 模型推理并获取注意力权重 --- print("\n正在进行推理并提取注意力权重...") with torch.no_grad(): outputs = model(**inputs) # 'attentions' 是一个元组，包含了模型每一层的注意力权重 # 我们只关心第一层(attentions[0])的注意力权重用于演示 attention = outputs.attentions[0] print("注意力权重形状 (batch, heads, seq_len, seq_len):", attention.shape) # --- 4. 可视化其中一个注意力头 --- # 我们选择第一个注意力头进行可视化 (head 0) head_to_visualize = 0 attention_head = attention[0, head_to_visualize, :, :] # 将权重转换为浮点数，以便 seaborn 处理 attention_head = attention_head.float() # 使用 seaborn 和 matplotlib 创建一个热力图 plt.figure(figsize=(10, 8)) # 修复中文显示问题 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False sns.heatmap(attention_head, xticklabels=tokens, yticklabels=tokens, cmap="viridis") plt.title(f"Attention Head #{head_to_visualize} for: '{sentence}'") plt.xlabel("Key Tokens (被关注的词)") plt.ylabel("Query Tokens (发起关注的词)") # 保存图像 output_filename = "local_model_attention_heatmap.png" plt.savefig(output_filename) print(f"\n可视化完成！注意力热力图已保存为 '{output_filename}'。") print("热力图显示了每个'发起关注的词'(纵轴) 对其他'被关注的词'(横轴) 的关注程度。颜色越亮，关注度越高。")

重要提示:

因果注意力: 如果你分析的是Qwen这样的生成式模型，其注意力机制是“因果注意力”（Causal Attention），这意味着一个词只能关注它自己和它之前的词。因此，在生成的热力图中，你会看到对角线右上方的区域是深色的（代表注意力权重为零）。
中文字体: 上述代码中加入了plt.rcParams['font.sans-serif'] = ['SimHei']来处理matplotlib中文显示为方框的问题。你需要确保你的系统中有SimHei(黑体) 或其他可用的中文字体。