当前位置：首页 > news >正文

大模型推理引擎vLLM(9): vLLM 基本代码结构

news 2026/7/5 10:02:42

文章目录

1 整体结构
- 1.1 模块
- 1.2 周边
- 1.3 优化
2 模块
- 2.1 Entrypoint--入口
- 2.2 engine
- 2.3 schedule
- 2.4 KV Cache manager
- 2.5 evictor
- 2.6 Worker
- 2.7 Model executor
- 2.8 Modelling
- 2.9 Attention backend
参考文献

这篇博客是在看[EP01][精剪版] vllm源码讲解，基本代码结构这个学习视频时做的简单笔记，感兴趣的可以直接去看原视频。

1 整体结构

1.1 模块

Entrypoint (LLM, API server)
Engine
Scheduler
KV cache manager
Worker
Model executor (Model runner)
Modelling
Attention backend

1.2 周边

Preprocessing / Postprocessing (tokenizer, detokenizer, sampler, multimodal processor)
Distributed
’ torch.compile’
Observability
Config
Testing
CI / CD
Formatting

1.3 优化

Speculative decoding
Disaggregated profiling
Chunked prefetching
Cascade inference

2 模块

2.1 Entrypoint–入口

对于新手来说，去阅读examples文件夹下面的这些例子是有帮助的。
./vllm/examples/offline_inference/basic/basic.py

# SPDX-License-Identifier: Apache-2.0# SPDX-FileCopyrightText: Copyright contributors to the vLLM projectfromvllmimportLLM,SamplingParams# Sample prompts.prompts=["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is","Hello, my name is",]# Create a sampling params object.sampling_params=SamplingParams(temperature=0.8,top_p=0.95,max_tokens=16)defmain():# Create an LLM.llm=LLM(model="/volume/vllm_20260124/models/Qwen/Qwen3-8B/",tensor_parallel_size=1,dtype="float16",trust_remote_code=True,enforce_eager=True,block_size=16,enable_prefix_caching=False)# Generate texts from the prompts.# The output is a list of RequestOutput objects# that contain the prompt, generated text, and other information.outputs=llm.generate(prompts,sampling_params)# Print the outputs.print("\nGenerated Outputs:\n"+"-"*60)foroutputinoutputs:prompt=output.prompt generated_text=output.outputs[0].textprint(f"Prompt:{prompt!r}")print(f"Output:{generated_text!r}")print("-"*60)if__name__=="__main__":main()

LLM类在这里可以看到vllm/vllm/entrypoints/llm.py
API server可以在这里看到vllm/vllm/entrypoints/api_server.py

2.2 engine

engine是用来干活的，vllm/vllm/engine
vllm/vllm/engine/llm_engine.py是真正干活的，而vllm/vllm/engine/async_llm_engine.py是套了一个异步的壳子。

2.3 schedule

schedule在vllm/vllm/core中，
经过一次模型inference的过程叫一个step，一个大语言模型的一个request会经过多次inference，产生一个字的过程就是一个inference。
schedule要做的事就是在一个step中放什么request，

2.4 KV Cache manager

代码在，vllm/vllm/core/block_manager.py

2.5 evictor

vllm/vllm/core/evictor.py
现在用的驱逐算法就是LRU，就是保存东西的时候发现空间不够了，那就把之前的驱逐掉。
这个在Prefix caching中用到的。

2.6 Worker

schedule当成导师，那么worker就是手底下的一个个博士，然后一个一个woker执行schedule的命令，然后最终成果返回给了schedule，
worker就是牛马，
vllm/vllm/worker

2.7 Model executor

worker初始化好环境，然后model executor真正用来运行模型，
vllm/vllm/model_executor

2.8 Modelling

2.9 Attention backend

vllm/vllm/attention/backends
vllm/vllm/attention/backends/flash_attn.py

参考文献

https://www.youtube.com/watch?v=uclfcBc8hPE

http://www.jsqmd.com/news/399176/

相关文章：

大模型推理引擎vLLM(10): vLLM 分布式推理源码结构解析

大专数据可视化技术专业学习数据分析的价值

高职统计与会计核算专业学数据分析的价值分析

Manim CE v0.20.0 发布：动画构建更丝滑，随机性终于“可控”了！

2026年苏州可靠的家教机构怎么收费，家教/全托补习班/一对一家教试听课/上门家教/师范家教/全托冲刺，家教机构有哪些 - 品牌推荐师

k8s服务发现

Verify-in-the-Graph 利用交互式图表示增强实体消歧的复杂声明验证方法

nsq阅读（2）——diskqueue

golang sync包源码阅读

CausalMamba 面向时序谣言因果关系的可解释状态空间建模

质数筛小记

请不要再称数据库是CP或者AP (Please stop calling databases CP or AP)

TARD 基于测试时自适应的分布外谣言检测

堆 vs 胜者树 vs 败者树

nsq阅读（3）——nsqd

什么是“梯度消失”和“梯度爆炸”？

分布式事务综述

Golang http源码阅读

nsq阅读（1）——概述

向量数据库概述

python private属性

HyperLogLog原理

字节RPC框架kitex源码阅读（一）

参考文献崩了？一键生成论文工具千笔·专业学术智能体 VS Checkjie 专科生写作神器

gRPC阅读（1）—— 服务端

银行纷纷盯上了压岁钱，儿童金融会是银行的新蓝海吗？