部署说明:
本示例使用SGlang作为大模型部署运行的框架,并且使用SGLang Router作为网关负载后端部署的大模型服务。示例使用模型为deepseek-v32,每台节点为8张H200GPU卡,一共三台实例。下面为具体内容
1、部署deepseek-v32的sts服务
apiVersion: apps/v1 kind: StatefulSet metadata:name: deepseek-v32-workernamespace: deepseek spec:serviceName: deepseek-v32-workerreplicas: 3selector:matchLabels:app: deepseek-v32-workermodel: deepseek-v32template:metadata:labels:app: deepseek-v32-workermodel: deepseek-v32spec:affinity:podAntiAffinity:requiredDuringSchedulingIgnoredDuringExecution:- labelSelector:matchLabels:app: deepseek-v32-workertopologyKey: kubernetes.io/hostnamecontainers:- name: sglangimage: lmsysorg/sglang:v0.4.1-cu124command: ["python3", "-m", "sglang.launch_server"]args:- "--model-path=/models/DeepSeek-V3.2"- "--tp=8"- "--dp=1"- "--quantization=fp8"- "--context-length=131072"- "--mem-fraction-static=0.88"- "--trust-remote-code"- "--port=8000"- "--enable-metrics"ports:- containerPort: 8000name: httpresources:limits:nvidia.com/gpu: "8"cpu: "128"memory: "1Ti"volumeMounts:- name: modelsmountPath: /models- name: dshmmountPath: /dev/shmvolumes:- name: modelspersistentVolumeClaim:claimName: deepseek-v32-pvc- name: dshmemptyDir:medium: MemorysizeLimit: "200Gi"tolerations:- key: nvidia.com/gpuoperator: Existseffect: NoSchedule --- apiVersion: v1 kind: Service metadata:name: deepseek-v32-workernamespace: deepseek spec:selector:app: deepseek-v32-workerports:- port: 8000name: http
2、部署Router Deploy(并通过K8S机制实现worker自动发下)
apiVersion: v1 kind: ServiceAccount metadata:name: sglang-routernamespace: deepseek --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata:name: sglang-routernamespace: deepseek rules: - apiGroups: [""]resources: ["pods"]verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata:name: sglang-routernamespace: deepseek subjects: - kind: ServiceAccountname: sglang-routernamespace: deepseek roleRef:kind: Rolename: sglang-routerapiGroup: rbac.authorization.k8s.io --- apiVersion: apps/v1 kind: Deployment metadata:name: deepseek-v32-routernamespace: deepseek spec:replicas: 2selector:matchLabels:app: deepseek-v32-routertemplate:metadata:labels:app: deepseek-v32-routerspec:serviceAccountName: sglang-routercontainers:- name: routerimage: lmsysorg/sglang:v0.4.1-cu124command: ["python3", "-m", "sglang_router.launch_router"]args:- "--service-discovery" # 启用 K8s 服务发现- "--selector=app=deepseek-v32-worker" # 匹配 Worker 标签- "--service-discovery-namespace=deepseek" # Namespace- "--service-discovery-port=8000" # Worker 端口- "--policy=cache_aware" # 缓存感知路由- "--cache-threshold=0.5"- "--port=8080"- "--host=0.0.0.0"ports:- containerPort: 8080name: routerresources:limits:cpu: "4"memory: "8Gi"livenessProbe:httpGet:path: /healthport: 8080initialDelaySeconds: 10periodSeconds: 5 --- apiVersion: v1 kind: Service metadata:name: deepseek-v32-routernamespace: deepseek spec:selector:app: deepseek-v32-routerports:- port: 8080targetPort: 8080type: ClusterIP --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata:name: deepseek-v32namespace: deepseekannotations:nginx.ingress.kubernetes.io/proxy-read-timeout: "1200"nginx.ingress.kubernetes.io/proxy-send-timeout: "1200" spec:rules:- host: deepseek-v32.yourdomain.comhttp:paths:- path: /pathType: Prefixbackend:service:name: deepseek-v32-routerport:number: 8080
总结:
通过sglang-router并配合K8S的自动发现机制,实现扩容模型实例时,能够自动感知,无需人工干预。
