当前位置：首页 > news >正文

【AI原生开发实战】6.1 LLM微服务架构设计

news 2026/6/17 23:53:37

学习目标

理解AI原生微服务与传统微服务的本质区别
掌握LLM推理服务的生命周期与服务边界划分方法
了解云原生LLM部署的核心组件与架构模式
理解模型即服务（MaaS）的设计理念

一、AI原生架构的范式转变

1.1 从「微服务化」到「AI原生」

传统企业在拥抱云原生时，往往是将现有应用容器化、编排化。这种「+容器」的方式虽然能获得弹性伸缩等云原生优势，但并未触及架构的本质。

AI原生架构则不同——它从一开始就将AI能力作为一等公民设计。以LLM推理服务为例，AI原生架构需要回答：

模型如何作为服务暴露？
输入预处理和输出后处理如何与推理解耦？
GPU资源如何与CPU服务协同？
如何处理长上下文和流式输出？

这些问题在传统微服务中几乎不存在，但在LLM服务中至关重要。

1.2 核心设计原则

AI原生微服务架构有三大核心设计原则：

模型即服务单元（MaaSU）：每个微服务封装完整的推理栈——从预处理到模型加载，从推理执行到后处理，再到可观测性埋点。这与传统微服务「单一职责」原则不同，LLM服务的每个环节都需要强耦合以保证效率。

生命周期驱动的服务边界：LLM推理有五个关键生命周期阶段——请求注入、上下文加载、KV缓存管理、解码调度、流式输出。服务边界应根据这些阶段动态划分，而非静态定义。

自适应流量治理：传统微服务基于QPS进行限流，而LLM服务需要考虑token数量、上下文长度、GPU显存占用等多维指标。

二、LLM推理生命周期与服务边界

2.1 推理的五阶段模型

LLM推理不是简单的「输入-输出」，而是包含多个关键阶段：

阶段一：请求注入

输入token序列进入系统
进行身份验证、限流检查
进入请求队列等待调度

这个阶段的核心资源是网络带宽和CPU预处理能力。如果输入是长文档，预处理（分词、编码）可能消耗可观的时间。

阶段二：上下文加载

将输入token转换为模型需要的嵌入向量
加载到GPU显存
构建初始的KV缓存

这个阶段消耗大量GPU显存，是瓶颈之一。输入越长，上下文加载时间越长。

阶段三：KV缓存管理

在自回归生成过程中，缓存所有层的Key和Value向量
供后续token生成时快速访问

KV缓存是LLM推理的关键创新。vLLM的PagedAttention正是通过优化KV缓存管理，大幅提升了吞吐。

阶段四：解码调度

从请求队列中选择下一个待处理的batch
执行attention计算和前馈网络
生成新token

这个阶段是计算密集的，通常也是推理时间的主要消耗。

阶段五：流式输出

将生成的token以流式方式返回给客户端
支持Server-Sent Events（SSE）或WebSocket

流式输出改善用户体验，让用户无需等待完整生成即可看到部分结果。

2.2 服务边界划分方法

传统的服务边界划分基于业务能力，而LLM服务的边界应基于资源特性和性能需求：

┌─────────────────────────────────────────────────────────┐ │ API Gateway │ │ (限流、鉴权、路由) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Preprocessor │ │ (分词、编码、请求转换) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Model Server │ │ (模型推理、KV缓存管理) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Postprocessor │ │ (解码、格式转换、流式封装) │ └─────────────────────────────────────────────────────────┘

这种分层设计的优势：

各层可独立扩缩容
预处理和后处理可在CPU上运行，降低GPU成本
便于针对性优化和故障排查

三、容器化与Kubernetes部署

3.1 模型服务的容器化

LLM推理服务的容器化比普通应用更复杂，需要考虑：

基础镜像选择：通常使用CUDA-enabled的PyTorch或TensorFlow镜像。

FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime WORKDIR /app # 安装推理框架 RUN pip install --no-cache-dir vllm transformers fastapi uvicorn # 复制模型和代码 COPY ./model /app/model COPY ./app /app/app # 暴露端口 EXPOSE 8000 # 启动命令 CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

GPU支持：需要在Kubernetes中配置GPU调度和设备插件。

apiVersion:v1kind:Podspec:containers:-name:llm-inferenceimage:llm-server:v1resources:limits:nvidia.com/gpu:1memory:"32Gi"cpu:"8"

3.2 Kubernetes弹性伸缩

HPA（Horizontal Pod Autoscaler）支持基于自定义指标的扩缩容：

apiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:llm-inference-hpaspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:llm-inferenceminReplicas:2maxReplicas:10metrics:-type:Resourceresource:name:cputarget:type:UtilizationaverageUtilization:70-type:Podspods:metric:name:gpu_utilization_ratiotarget:type:AverageValueaverageValue:"0.8"

对于LLM服务，推荐使用VPA（Vertical Pod Autoscaler）配合HPA：VPA调整单个Pod的资源配额，HPA调整Pod数量。

3.3 模型版本管理

生产环境通常需要同时运行多个模型版本，支持A/B测试和灰度发布：

apiVersion:v1kind:Servicemetadata:name:llm-servicespec:selector:app:llm-inferenceports:-port:80targetPort:8000---apiVersion:v1kind:Endpointsmetadata:name:llm-service-v1subsets:-addresses:-ip:10.0.0.1ports:-port:8000---apiVersion:v1kind:Endpointsmetadata:name:llm-service-v2subsets:-addresses:-ip:10.0.0.2ports:-port:8000

通过Istio的VirtualService可以控制流量分配：

apiVersion:networking.istio.io/v1alpha3kind:VirtualServicemetadata:name:llm-routingspec:hosts:-llm-servicehttp:-route:-destination:host:llm-service-v1weight:90-destination:host:llm-service-v2weight:10

四、推理框架选型

4.1 主流框架对比

框架	适用场景	核心优势	劣势
vLLM	高吞吐服务	PagedAttention，连续批处理	优化不够极致
TensorRT-LLM	极致性能	深度优化，TensorRT算子融合	NVIDIA专用
TGI	开源友好	简单易用，功能全面	性能中等
SGLang	结构化输出	RadixAttention，约束解码	生态较新

4.2 vLLM的架构设计

vLLM是当前最流行的开源推理框架，其核心创新是PagedAttention：

传统注意力机制的显存问题：在传统实现中，KV缓存需要连续的显存空间。对于多用户、多请求的场景，显存碎片化严重，利用率低。

PagedAttention的解决思路：将KV缓存划分为固定大小的「页」，类似操作系统的虚拟内存管理。这样即使不同请求的KV缓存分布在不同物理位置，也能高效访问。

4.3 框架选择的决策树

选择推理框架的决策树： ┌─────────────────────────┐ │ 你需要什么特性？ │ └───────────┬─────────────┘ │ ┌───────┴───────┐ ▼ ▼ ┌─────────┐ ┌──────────────┐ │极致性能 │ │ 快速部署/灵活 │ └────┬────┘ └───────┬──────┘ │ │ ▼ ▼ ┌─────────┐ ┌──────────────┐ │TensorRT │ │ vLLM/TGI │ │ -LLM │ └───────┬──────┘ └─────────┘ │ ┌─────┴─────┐ ▼ ▼ ┌────────┐ ┌──────┐ │ vLLM │ │ TGI │ │ 高吞吐 │ │ 易用 │ └────────┘ └──────┘

五、高可用架构设计

5.1 多副本负载均衡

apiVersion:v1kind:Servicemetadata:name:llm-servicespec:type:LoadBalancerselector:app:llm-inferenceports:-name:httpport:80targetPort:8000sessionAffinity:None

对于LLM服务，推荐使用客户端级session affinity，确保同一用户的请求路由到同一Pod，减少KV缓存未命中的开销。

5.2 跨可用区部署

apiVersion:v1kind:Podmetadata:name:llm-replica-1spec:affinity:podAntiAffinity:requiredDuringSchedulingIgnoredDuringExecution:-labelSelector:matchExpressions:-key:appoperator:Invalues:-llm-inferencetopologyKey:topology.kubernetes.io/zonetopologySpreadConstraints:-maxSkew:1topologyKey:topology.kubernetes.io/zonewhenUnsatisfiable:DoNotSchedulelabelSelector:matchLabels:app:llm-inference

5.3 故障自动恢复

apiVersion:v1kind:Podmetadata:name:llm-inferencespec:containers:-name:llm-serverlivenessProbe:httpGet:path:/healthport:8000initialDelaySeconds:60periodSeconds:10failureThreshold:3readinessProbe:httpGet:path:/readyport:8000initialDelaySeconds:30periodSeconds:5

六、监控与可观测性

6.1 核心指标体系

LLM服务的监控需要追踪以下关键指标：

# Prometheus指标定义llm_requests_total=Counter('llm_requests_total','Total number of LLM requests',['model','status'])llm_request_duration_seconds=Histogram('llm_request_duration_seconds','LLM request duration',['stage'],# prefill, decode, totalbuckets=[0.1,0.5,1.0,2.0,5.0,10.0])llm_tokens_total=Counter('llm_tokens_total','Total number of tokens processed',['type']# input, output)gpu_utilization=Gauge('gpu_utilization','GPU utilization ratio',['device'])kv_cache_usage=Gauge('kv_cache_usage_ratio','KV cache memory usage ratio',['device'])

6.2 日志与追踪

集成OpenTelemetry实现分布式追踪：

fromopentelemetryimporttracefromopentelemetry.sdk.traceimportTracerProviderfromopentelemetry.sdk.trace.exportimportBatchSpanProcessorfromopentelemetry.exporter.jaeger.thriftimportJaegerExporter trace.set_tracer_provider(TracerProvider())tracer=trace.get_tracer(__name__)# 在推理请求中添加spanwithtracer.start_as_current_span("llm_inference")asspan:span.set_attribute("prompt_length",len(prompt_tokens))span.set_attribute("max_tokens",max_new_tokens)# Prefill阶段withtracer.start_as_current_span("prefill")asprefill_span:prefill_start=time.time()prefill_output=model.prefill(prompt_tokens)prefill_span.set_attribute("duration",time.time()-prefill_start)# Decode阶段withtracer.start_as_current_span("decode")asdecode_span:output_tokens=model.decode(prefill_output,max_new_tokens)decode_span.set_attribute("output_length",len(output_tokens))