统一 GPU 池结合队列与调度策略:实现 K8s 容器化下多模型服务的高效调度与资源池化
统一 GPU 池结合队列与调度策略:实现 K8s 容器化下多模型服务的高效调度与资源池化
引言
在云原生大模型平台中,通常需要同时部署多个不同规格的模型服务,这些模型对 GPU 资源的需求各不相同。如果每个模型独立分配 GPU 资源,会导致资源利用率低下。如何构建统一的 GPU 资源池,结合智能的队列与调度策略,实现多模型服务的高效调度与资源池化,是提升整体资源利用率的关键。
本文将深入探讨如何在 Kubernetes 容器化环境下,构建统一 GPU 资源池,结合队列与调度策略,实现多模型服务的高效调度与资源池化。
二、 GPU 资源池的调度策略
1.1 GPU 池架构设计
flowchart TB subgraph Models [模型服务] A[模型A - 7B] B[模型B - 13B] C[模型C - 70B] end subgraph ControlPlane [控制平面] D[中央调度器] E[队列管理器] F[优先级控制器] end subgraph NodePools [节点池] G[节点池-通用型] H[节点池-高性能型] I[节点池-推理专用型] end A --> D B --> D C --> D D --> E E --> F F --> G F --> H F --> I1.2 调度策略与资源配置
| 模型规格 | 所需显存 | 推荐节点 | 最大并发 | 调度优先级 |
|---|---|---|---|---|
| 7B 模型 | 13GB | A10G/A100 | 200 QPS | 中 |
| 13B 模型 | 26GB | A100/H100 | 100 QPS | 中高 |
| 70B 模型 | 140GB | A100 * 2/H100 * 2 | 20 QPS | 高 |
三、 模型规格的资源需求
2.1 不同规格模型的资源需求
apiVersion: inference.example.com/v1 kind: ModelServingSpec metadata: name: llama-7b spec: modelFamily: "llama" modelSize: "7b" resources: gpuMemory: "13Gi" cpu: "2" memory: "16Gi" autoscaling: minReplicas: 2 maxReplicas: 20 targetConcurrency: 100 placementConstraints: nodeTypes: ["a10g", "a100"] toleratePreemptible: true schedulingClass: "medium-priority" --- apiVersion: inference.example.com/v1 kind: ModelServingSpec metadata: name: llama-70b spec: modelFamily: "llama" modelSize: "70b" resources: gpuMemory: "140Gi" cpu: "8" memory: "64Gi" autoscaling: minReplicas: 1 maxReplicas: 5 targetConcurrency: 20 placementConstraints: nodeTypes: ["a100-80gb", "h100"] toleratePreemptible: false schedulingClass: "high-priority"2.2 资源池配置
apiVersion: gpu.example.com/v1 kind: ResourcePool metadata: name: general-purpose spec: nodeSelector: gpu.type: "a10g" capacity: totalGPUs: 32 availableGPUs: 28 taints: - key: "gpu-pool" value: "general-purpose" effect: "NoSchedule" tolerations: - key: "gpu-pool" value: "general-purpose" schedulingPolicy: binPack: true overcommitRatio: 1.2 --- apiVersion: gpu.example.com/v1 kind: ResourcePool metadata: name: high-performance spec: nodeSelector: gpu.type: "a100" capacity: totalGPUs: 16 availableGPUs: 12 schedulingPolicy: binPack: false spread: true overcommitRatio: 1.0四、 队列与调度
3.1 调度队列实现
package scheduler import ( "context" "container/heap" "fmt" "sync" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/klog/v2" ) type PriorityQueue struct { mu sync.Mutex items []*QueueItem itemMap map[string]*QueueItem } type QueueItem struct { Pod *corev1.Pod Priority int EnqueueAt metav1.Time index int } func NewPriorityQueue() *PriorityQueue { pq := &PriorityQueue{ items: make([]*QueueItem, 0), itemMap: make(map[string]*QueueItem), } heap.Init(pq) return pq } func (pq *PriorityQueue) Enqueue(pod *corev1.Pod, priority int) { pq.mu.Lock() defer pq.mu.Unlock() key := fmt.Sprintf("%s/%s", pod.Namespace, pod.Name) item := &QueueItem{ Pod: pod, Priority: priority, EnqueueAt: metav1.Now(), } heap.Push(pq, item) pq.itemMap[key] = item klog.Infof("Pod %s enqueued with priority %d", key, priority) } func (pq *PriorityQueue) Dequeue() (*corev1.Pod, bool) { pq.mu.Lock() defer pq.mu.Unlock() if len(pq.items) == 0 { return nil, false } item := heap.Pop(pq).(*QueueItem) key := fmt.Sprintf("%s/%s", item.Pod.Namespace, item.Pod.Name) delete(pq.itemMap, key) return item.Pod, true }3.2 调度策略配置
apiVersion: v1 kind: ConfigMap metadata: name: gpu-scheduler-config namespace: kube-system data: scheduler-config.yaml: | queues: - name: "high-priority" priority: 100 weight: 50 maxPending: 100 - name: "medium-priority" priority: 50 weight: 30 maxPending: 200 - name: "low-priority" priority: 10 weight: 20 maxPending: 500 schedulingPolicy: binPack: true topologyAware: true gangScheduling: true preemptionEnabled: true五、 Bin Packing 与 Topology Aware
4.1 Bin Packing 策略实现
package binpacking import ( "sort" corev1 "k8s.io/api/core/v1" ) type NodeScore struct { Node *corev1.Node Score float64 } func CalculateBinPackScore(node *corev1.Node, pod *corev1.Pod) float64 { // 计算剩余空间 totalGPU := getTotalGPUMemory(node) usedGPU := getUsedGPUMemory(node) remainingGPU := totalGPU - usedGPU // 计算填充率 fillRatio := float64(usedGPU) / float64(totalGPU) // 优先选择填充率高的节点 return fillRatio*0.7 + (1 - remainingGPU/totalGPU)*0.3 } func RankNodesByBinPack(nodes []*corev1.Node, pod *corev1.Pod) []*corev1.Node { scores := make([]NodeScore, 0, len(nodes)) for _, node := range nodes { if !canFit(node, pod) { continue } score := CalculateBinPackScore(node, pod) scores = append(scores, NodeScore{Node: node, Score: score}) } sort.Slice(scores, func(i, j int) bool { return scores[i].Score > scores[j].Score }) ranked := make([]*corev1.Node, 0, len(scores)) for _, s := range scores { ranked = append(ranked, s.Node) } return ranked }六、 最佳实践
- 资源超卖:在保证稳定性的前提下适度超卖
- 优先级调度:重要模型优先保证资源
- 队列隔离:不同优先级使用不同队列
- 拓扑感知:考虑 NVLink/PCIe 拓扑
- 抢占机制:高优任务可抢占低优任务
总结
统一 GPU 池的调度核心在于:按模型规格(7B/13B/70B)映射到不同 GPU 节点池,通过 Bin Packing 最大化单卡利用率,Topology Aware 减少跨卡通信。队列按优先级加权分配,70B 模型抢占队列头部。通过这种调度策略,可以将整体 GPU 利用率提升至 70% 以上。
