设备调度 - 沐曦 GPU
背景
当前 Koordinator 支持沐曦卡在 K8s 的使用,基于 koord-device-daemon 和 koordlet 组件上报异构 GPU 资源,将异构卡信息汇总到 Device 中供调度器进行拓扑调度。
使用方法
前置条件
沐曦卡GPU虚拟化的使用需要提前安装配置如下组件
- 沐曦 Driver
- 沐曦 metax-container-runtime
- 沐曦 metax-gpu-device (开启sGPU参数配置)
apiVersion: v1
kind: ConfigMap
metadata:
name: metax-device-config
data:
version: v1
cluster-config: |
mode: "native" # sgpu/shared/vgpu, cluster scope
nodes-config: |
- nodeName: "sample-node1"
mode: "sgpu" # sgpu/shared/vgpu, node scope
- Koordinator 相关组件(koordinator >= v1.8.0)
- 需要在 chart 的
scheduler.featureGates参数中加入DevicePluginAdaption=true以启用该功能所需的特性门控
- 需要在 chart 的
注: sGPU 是沐曦基于软件实现的算力切分方案,可以基于物理GPU创建最多16个虚拟GPU实例,主要面向基于容器的云端推理和小模型训练场景。
使用
- 确认沐曦卡已成功被 Device 识别,示例如下,通过
kubectl get device <node-name> -o yaml查看 Device 资源。
注意:
- node.koordinator.sh/gpu-vendor 对应的标签值为 metax
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
labels:
node.koordinator.sh/gpu-model: C500 # The model of the Cambricon card.
node.koordinator.sh/gpu-vendor: metax # The vendor of the card.
name: sample-node1
spec:
devices:
- health: true
id: 89033010-2354-0000-0000-000000000000 # The UUID of the Metax card (mocked value).
minor: 0 # The minor number of the Cambricon card.
resources:
koordinator.sh/gpu-core: "100" # The total compute capacity of the Cambricon card in percentage.
koordinator.sh/gpu-memory: "64Gi" # The total GPU memory of the Cambricon card.
koordinator.sh/gpu-memory-ratio: "100" # The total GPU memory ratio in percentage.
topology:
busID: 0000:38:00.0 # The PCI bus ID.
nodeID: 3 # The NUMA node ID.
pcieID: pci0000:3a # The PCI root ID.
socketID: -1
type: gpu
conditions:
- lastTransitionTime: "20256-03-15T10:00:00Z"
message: device is healthy
reason: DeviceHealthy
status: "True"
type: Healthy
status: {}
- 确认沐曦的节点上已正确注册沐曦卡资源信息,示例如下,通过
kubectl get node <node-name> -o yaml查看 node.status.allocatable/node.status.capacity 资源。
apiVersion: v1
kind: Node
metadata:
name: sample-node1
...
status:
allocatable:
...
koordinator.sh/gpu.shared: "100" # Report by koordinator
koordinator.sh/gpu-core: "100" # Report by koordinator
koordinator.sh/gpu-memory: 64Gi # Report by koordinator
koordinator.sh/gpu-memory-ratio: "100" # Report by koordinator
metax-tech.com/sgpu: "16" # Report by metax-gpu-device
capacity:
...
koordinator.sh/gpu.shared: "100" # Report by koordinator
koordinator.sh/gpu-core: "100" # Report by koordinator
koordinator.sh/gpu-memory: 24Gi # Report by koordinator
koordinator.sh/gpu-memory-ratio: "100" # Report by koordinator
metax-tech.com/sgpu: "16" # Report by metax-gpu-device
- 创建 Pod 申请使用沐曦卡,样例如下,示例中申请了虚拟卡场景。沐曦不支持配置多卡虚拟化。
注意:
- 此处镜像使用了乌班图基础镜像,用户可根据实际需求选择合适的镜像
- resources.limits 和 resources.requests 中均需要额外配置资源
- 虚拟卡场景,以 Metax C500 使用虚拟化为例。
- koordinator.sh/gpu.shared
- 使用虚拟卡填写 1 即可。
- koordinator.sh/gpu-core
- 用多少算力,填写多少即可,百分比表示。
- koordinator.sh/gpu-memory
- 用多少显存,填写多少即可
- metax-tech.com/sgpu
- 用几张虚拟卡,填写几即可
- metadata.annoations 中需要配置qos策略
fixed-sharesGPU 有固定的算力配额,且无法超过固定配额使用best-effortsGPU 不限制算力burst-sharesGPU 有固定的算力配额,若 GPU 卡还有空闲算力,就可以被 sGPU 使用
apiVersion: v1
kind: Pod
metadata:
labels:
app: demo-sleep
name: test-metax-sGPU
namespace: default
annotations:
metax-tech.com/sgpu-qos-policy: "fixed-share" # fixed-share/best-effort/burst-share
spec:
containers:
- command:
- sleep
- infinity
image: ubuntu:22.04
imagePullPolicy: IfNotPresent
name: demo-sleep
resources:
limits:
cpu: "32"
memory: 64Gi
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
metax-tech.com/sgpu: "1"
requests:
cpu: "32"
memory: 64Gi
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
metax-tech.com/sgpu: "1"
- 进入容器内部(
kubectl exec -it {pod-name} --bash),在容器内部执行ls /dev/mx*命令查看卡挂载情况。如果能够正常输出,表示卡已经成功分配到 Pod 中。