当前位置：首页 > news >正文

本地部署DeepSeek模型全攻略：从部署到压测一网打尽

news 2026/7/13 9:24:35

本地部署DeepSeek模型全攻略：从部署到压测一网打尽

1. 引言

DeepSeek系列模型（如DeepSeek-V2、DeepSeek-Coder等）凭借卓越的性能和开放的技术报告，受到了广泛关注。将这类大语言模型部署到本地，不仅可以保护数据隐私，还能按需定制推理服务。本文将从零开始，带你完成DeepSeek模型的本地部署、基础使用、压测概念解析以及实战压测方法。无论你是AI开发者、运维工程师，还是技术发烧友，都能从中获益。

阅读收获
学会在个人服务器/PC上部署DeepSeek模型
掌握调用模型进行文本生成的多种方式
理解压力测试的核心指标与流程
能够使用工具对模型服务进行性能压测

2. 本地部署DeepSeek模型

2.1 硬件与环境准备

以DeepSeek-V2-Lite（16B参数）为例，推荐配置：

组件	最低要求	推荐配置
GPU 显存	24GB（FP16）	2×24GB 或 1×80GB（A100）
系统内存	32GB	64GB+
硬盘空间	50GB（模型权重+依赖）	100GB（预留缓存）
操作系统	Ubuntu 20.04 / Windows 11	Ubuntu 22.04
Python版本	3.10	3.10 / 3.11

如果只有消费级显卡（如RTX 3090/4090 24GB），可以使用量化版本（INT8/INT4）降低显存需求。本文以DeepSeek-V2-Lite-Chat+Hugging Face Transformers为例。

2.2 安装依赖环境

# 创建虚拟环境（推荐）python-mvenv deepseek-envsourcedeepseek-env/bin/activate# Linux/Mac# deepseek-env\Scripts\activate # Windows# 安装PyTorch（根据CUDA版本选择，以下为CUDA 11.8）pipinstalltorch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118# 安装Transformers及相关库pipinstalltransformers accelerate sentencepiece bitsandbytes

2.3 下载模型权重

从Hugging Face Hub下载（需科学上网或设置镜像）：

fromtransformersimportAutoModelForCausalLM,AutoTokenizer model_name="deepseek-ai/DeepSeek-V2-Lite-Chat"tokenizer=AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)model=AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True,torch_dtype="auto",device_map="auto"# 自动分配到可用GPU)

离线下载：若服务器无法联网，可先下载到本地再上传。使用huggingface_hubCLI：
huggingface-cli download deepseek-ai/DeepSeek-V2-Lite-Chat --local-dir ./deepseek-model

2.4 快速验证推理

编写一个简单的脚本test_inference.py：

fromtransformersimportAutoModelForCausalLM,AutoTokenizer model_name="./deepseek-model"# 本地路径tokenizer=AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)model=AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True,device_map="auto",torch_dtype="auto")prompt="解释一下什么是压力测试？"inputs=tokenizer(prompt,return_tensors="pt").to(model.device)outputs=model.generate(**inputs,max_new_tokens=200)print(tokenizer.decode(outputs[0],skip_special_tokens=True))

运行python test_inference.py，若正常输出，说明部署成功。

3. 如何使用本地部署的模型

3.1 交互式命令行

利用transformers的管道，可快速搭建交互环境：

fromtransformersimportpipeline pipe=pipeline("text-generation",model=model,tokenizer=tokenizer)whileTrue:user_input=input(">>> ")ifuser_input.lower()in["exit","quit"]:breakout=pipe(user_input,max_new_tokens=256,do_sample=True)[0]["generated_text"]print(out.replace(user_input,"").strip())

3.2 封装成HTTP API（FastAPI）

将模型服务暴露为RESTful API，便于业务系统调用。

# serve.pyfromfastapiimportFastAPI,RequestfrompydanticimportBaseModelimportuvicornfromtransformersimportAutoModelForCausalLM,AutoTokenizer app=FastAPI()classGenerateRequest(BaseModel):prompt:strmax_tokens:int=200temperature:float=0.7# 加载模型（全局单例）tokenizer=AutoTokenizer.from_pretrained("./deepseek-model",trust_remote_code=True)model=AutoModelForCausalLM.from_pretrained("./deepseek-model",trust_remote_code=True,device_map="auto")@app.post("/generate")asyncdefgenerate(req:GenerateRequest):inputs=tokenizer(req.prompt,return_tensors="pt").to(model.device)outputs=model.generate(**inputs,max_new_tokens=req.max_tokens,temperature=req.temperature,do_sample=True)response=tokenizer.decode(outputs[0],skip_special_tokens=True)return{"result":response}if__name__=="__main__":uvicorn.run(app,host="0.0.0.0",port=8000)

启动服务：python serve.py
调用示例：curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt":"你好，请介绍一下自己","max_tokens":100}'

3.3 使用Ollama（更轻量）

对于量化后的DeepSeek模型（如DeepSeek-Coder 6.7B INT4），可使用Ollama一键部署：

# 安装Ollamacurl-fsSLhttps://ollama.com/install.sh|sh# 拉取并运行DeepSeek-Coder模型（社区贡献）ollama run deepseek-coder:6.7b-instruct-q4_K_M

Ollama会自动管理模型加载，并提供API和命令行交互。

4. 什么是压力测试（压测）

压力测试（Stress Testing）是性能测试的一种，通过模拟高并发、高负载的请求，检测系统在极限条件下的稳定性、吞吐量、响应延迟和资源消耗。对于本地部署的LLM服务，压测的目的是：

确认单卡/多卡推理的最大并发数
评估平均响应时间（RT）与每秒请求数（QPS/RPS）
找出性能瓶颈（GPU利用率、显存带宽、CPU预处理）
验证服务在负载下的可靠性（是否会OOM、崩溃）

关键指标

指标	含义
QPS	每秒成功处理的请求数量（Query Per Second）
Latency	请求从发出到收到完整响应的耗时（常用分位数：p50, p95, p99）
TPOT	每个token的生成时间（Time Per Output Token）
GPU显存	推理过程中占用的显存峰值
吞吐量	单位时间内生成的token总数

5. 如何对DeepSeek模型服务进行压测

5.1 压测流程（流程图）

5.2 常用压测工具对比

工具	特点	适用场景
Locust	Python编写，支持分布式，实时Web界面	灵活的自定义请求逻辑
JMeter	功能强大，支持多种协议，可生成图形报表	传统Web服务压测
wrk	轻量级高性能HTTP压测，但脚本能力弱	简单API快速测试
自定义脚本	使用`asyncio`+`aiohttp`手动实现	需要精细控制请求内容时

由于LLM API通常是非流式或流式响应，推荐使用Locust进行定制化压测。

5.3 使用Locust压测DeepSeek API

5.3.1 安装Locust

pipinstalllocust

5.3.2 编写压测脚本`locustfile.py`

fromlocustimportHttpUser,task,betweenimportjsonclassDeepSeekUser(HttpUser):wait_time=between(0.5,2)# 用户思考时间@taskdefgenerate(self):payload={"prompt":"请用一句话介绍深度学习","max_tokens":50,"temperature":0.8}headers={"Content-Type":"application/json"}# 注意：FastAPI服务运行在8000端口，路径为/generatewithself.client.post("/generate",data=json.dumps(payload),headers=headers,catch_response=True)asresponse:ifresponse.status_code==200:result=response.json()if"result"inresult:response.success()else:response.failure("Missing result field")else:response.failure(f"Status code{response.status_code}")

5.3.3 启动压测

# 启动Locust Web界面（默认8089端口）locust-flocustfile.py--host=http://localhost:8000

打开浏览器访问http://localhost:8089，设置并发用户数（如10，50，100）、生成速率，开始压测。

5.3.4 监控服务端状态

压测期间，另开终端监控GPU使用情况：

# 实时GPU监控watch-n1nvidia-smi# 或者使用gpustat（需安装）gpustat-i1

同时关注服务的CPU、内存占用（htop）。

5.3.5 结果分析示例

Locust会生成实时图表，关键指标包括：

平均响应时间：若模型推理需要5秒，并发10用户时平均响应时间可能变成15秒（排队效应）。
RPS：单卡DeepSeek-V2-Lite在A100上约2-3 RPS（生成128 tokens），需根据硬件调整预期。
失败率：超过显存或请求超时会导致失败。

优化建议：如果压测中出现显存不足，可使用vLLM或TGI框架优化推理吞吐；如果延迟过高，可以启用FlashAttention-2、PagedAttention等加速技术。

6. 高级压测场景：流式响应 & 长文本

如果模型服务支持流式输出（Server-Sent Events），可以使用aiohttp自定义脚本模拟并发用户，并测量首token延迟（TTFT）。这里给出一个简单的asyncio并发示例：

importaiohttpimportasyncioimporttimeasyncdefsend_request(session,prompt,idx):start=time.perf_counter()asyncwithsession.post("http://localhost:8000/generate_stream",json={"prompt":prompt})asresp:first_token_time=Noneasyncforlineinresp.content:ifnotfirst_token_time:first_token_time=time.perf_counter()print(f"[{idx}] First token delay:{first_token_time-start:.3f}s")total=time.perf_counter()-startprint(f"[{idx}] Total time:{total:.3f}s")asyncdefmain():prompts=["讲一个短故事"]*50asyncwithaiohttp.ClientSession()assession:tasks=[send_request(session,p,i)fori,pinenumerate(prompts)]awaitasyncio.gather(*tasks)asyncio.run(main())

7. 总结与注意事项

7.1 本文路线图回顾

7.2 重要提醒

模型选择：如果算力有限，优先选择量化版本（如DeepSeek-Coder-1.3B-Int8）或使用Ollama。
显存优化：开启bitsandbytes4bit量化加载模型：load_in_4bit=True。
压测安全：首次压测从低并发开始（如1,2,5,10），避免瞬间打爆服务。
生产部署：正式服务建议使用vLLM或Text Generation Inference（TGI），比原生Transformers吞吐量高5-10倍。