跳到主要内容
版本:v1.9 🚧

资源预留

资源预留是koord-scheduler的一种为某些特定Pod或负载预留节点资源的能力。

介绍

Pod是kubernetes节点资源分配的基础载体,他根据业务逻辑绑定对应的资源需求。但是我们可能分为一些还没创建的特定Pod和负载分配资源,例如:

  1. 抢占:已经存在的抢占规则不能保证只有正在抢占中的Pod才能分配抢占的资源,我们期望调度器能锁定资源,防止这些资源被有相同或更高优先级的其他Pod抢占。
  2. 重调度:在重调度场景下,最好能保证在Pod被重调度之前保留足够的资源。否则,被重调度的Pod可能再也没法运行,然后对应的应用可能就会崩溃。
  3. 水平扩容:为了能更精准地进行水平扩容,我们希望能为扩容的Pod副本分配节点资源。
  4. 资源预分配:即使当前的资源还不可用,我们可能想为将来的资源需求提前预留节点资源。

为了增强kubernetes的资源调度能力,koord-scheduler提供了一个名字叫Reservation的调度API,允许我们为一些当前还未创建的特定的Pod和负载,提前预留节点资源。

image

更多信息,请看 设计文档:资源预留

设置

前提

  • Kubernetes >= 1.18
  • Koordinator >= 0.6

安装步骤

请确保Koordinator的组件已经在你的集群中正确安装,如果还未正确安装,请参考安装说明

配置

资源预留功能默认启用,你无需对koord-scheduler配置做任何修改,即可使用。

使用指南

快速上手

  1. 使用如下yaml文件预留资源:reservation-demo
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo
spec:
template: # set resource requirements
metadata:
namespace: default
spec:
containers:
- args:
- '-c'
- '1'
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources: # reserve 500m cpu and 800Mi memory
requests:
cpu: 500m
memory: 800Mi
schedulerName: koord-scheduler # use koord-scheduler
owners: # set the owner specifications
- object: # owner pods whose name is `default/pod-demo-0`
name: pod-demo-0
namespace: default
ttl: 1h # set the TTL, the reservation will get expired 1 hour later
$ kubectl create -f reservation-demo.yaml
reservation.scheduling.koordinator.sh/reservation-demo created
  1. 跟踪reservation-demo的状态,直到它变成可用状态。
$ kubectl get reservation reservation-demo -o wide
NAME PHASE AGE NODE TTL EXPIRES
reservation-demo Available 88s node-0 1h
  1. 使用如下YAML文件部署一个Pod:Pod-demo-0
apiVersion: v1
kind: Pod
metadata:
name: pod-demo-0 # match the owner spec of `reservation-demo`
spec:
containers:
- args:
- '-c'
- '1'
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: 200m
memory: 400Mi
restartPolicy: Always
schedulerName: koord-scheduler # use koord-scheduler
$ kubectl create -f pod-demo-0.yaml
pod/pod-demo-0 created
  1. 检查Pod-demo-0的调度状态。
$ kubectl get pod pod-demo-0 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-demo-0 1/1 Running 0 32s 10.17.0.123 node-0 <none> <none>

Pod-demo-0将会和reservation-demo被调度到同一个节点。

  1. 检查reservation-demo的状态。
$ kubectl get reservation reservation-demo -oyaml
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo
creationTimestamp: "YYYY-MM-DDT05:24:58Z"
uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
...
spec:
owners:
- object:
name: pod-demo-0
namespace: default
template:
spec:
containers:
- args:
- -c
- "1"
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources:
requests:
cpu: 500m
memory: 800Mi
schedulerName: koord-scheduler
ttl: 1h
status:
allocatable: # total reserved
cpu: 500m
memory: 800Mi
allocated: # current allocated
cpu: 200m
memory: 400Mi
conditions:
- lastProbeTime: "YYYY-MM-DDT05:24:58Z"
lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
reason: Scheduled
status: "True"
type: Scheduled
- lastProbeTime: "YYYY-MM-DDT05:24:58Z"
lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
reason: Available
status: "True"
type: Ready
currentOwners:
- name: pod-demo-0
namespace: default
uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
nodeName: node-0
phase: Available

现在我们可以看到reservation-demo预留了500m cpu和 800Mi内存, Pod-demo-0从预留的资源中分配了200m cpu and 400Mi内存。

  1. 清理reservation-demo的预留资源。
$ kubectl delete reservation reservation-demo
reservation.scheduling.koordinator.sh "reservation-demo" deleted
$ kubectl get pod pod-demo-0
NAME READY STATUS RESTARTS AGE
pod-demo-0 1/1 Running 0 110s

在预留资源被删除后,Pod-demo-0依然正常运行。

高级特性

最新的API可以在这里查看: reservation_types

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo
spec:
# pod template (required): Reserve resources and play pod/node affinities according to the template.
# The resource requirements of the pod indicates the resource requirements of the reservation
template:
metadata:
namespace: default
spec:
containers:
- args:
- '-c'
- '1'
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources:
requests:
cpu: 500m
memory: 800Mi
# scheduler name (required): use koord-scheduler to schedule the reservation
schedulerName: koord-scheduler
# owner spec (required): Specify what kinds of pods can allocate resources of this reservation.
# Currently support three kinds of owner specifications:
# - object: specify the name, namespace, uid of the owner pods
# - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind
# - labelSelector: specify the matching labels are matching expressions of the owner pods
owners:
- object:
name: pod-demo-0
namespace: default
- labelSelector:
matchLabels:
app: app-demo
# TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period.
# If not set, use `24h` as default.
ttl: 1h
# Expires (optional): Expired timestamp when the reservation is expected to expire.
# If both `expires` and `ttl` are set, `expires` is checked first.
expires: "YYYY-MM-DDTHH:MM:SSZ"

字段: allocateOnce

  • 类型: *bool
  • 默认值: true
  • 描述: 当设置为 true 时,预留资源仅对第一个成功分配的属主可用,之后不再对其他属主可分配。当设置为 false 时,只要有足够的资源,预留资源可以被多个属主分配。

字段: allocatePolicy

  • 类型: ReservationAllocatePolicy
  • 可选值: Aligned, Restricted
  • 描述: 指定预留资源的分配策略。
    • Aligned: Pod 优先从 Reservation 分配资源。如果 Reservation 的剩余资源不足,可以从节点分配,但需要严格遵循 Pod 的资源规范。这避免了 Pod 同时使用多个 Reservation 的问题。
    • Restricted: Pod 请求的资源与 Reservation 预留的资源重叠部分只能从 Reservation 分配。Pod 中声明但 Reservation 中未预留的资源可以从节点分配。Restricted 包含 Aligned 的语义。

字段: preAllocation

  • 类型: bool
  • 默认值: false
  • 描述: 当 preAllocation 设置为 true 时,预留资源可以绑定到节点上已调度的 Pod。预留将预占这些运行中 Pod 的资源。当被绑定的 Pod 退出后,预留自动从绑定状态转变为预留已释放资源的普通预留。

这对于资源迁移和优雅的 Pod 重新调度等场景非常有用,可以在 Pod 退出之前确保资源连续性。

字段: preAllocationPolicy

  • 类型: PreAllocationPolicy
  • 描述: 当 preAllocation 设置为 true 时,定义预分配的策略。该字段允许精细控制如何选择可预分配的 Pod 以及是否可以预分配多个 Pod。

PreAllocationPolicy 结构体包含以下字段:

  • mode (类型: PreAllocationMode, 默认值: Default):

    • Default: 使用 Reservation Spec 中的 Owner 匹配器来选择可预分配的 Pod。这是默认行为。
    • Cluster: 使用集群级别的标签/注解选择器来识别可预分配的 Pod。该模式适用于多租户集群,可预分配的 Pod 可能属于不同的属主,需要集中管理。
  • enableMultiple (类型: bool, 默认值: false):

    • 当为 false 时,只能为单个 Reservation 预分配一个 Pod。
    • 当为 true 时,可以预分配多个 Pod 来满足 Reservation 的资源需求。当没有单个 Pod 能提供所有所需资源时(由于资源碎片化),这非常有用。

Cluster 模式的标签和注解:

使用 Cluster 模式时,调度器使用以下标签和注解来识别可预分配的 Pod:

标签/注解描述
pod.koordinator.sh/is-pre-allocatable用于标识可预分配 Pod 的标签。设置为 "true" 表示 Pod 可被预分配。
pod.koordinator.sh/pre-allocatable-priority用于设置预分配优先级的注解。数值越高表示优先级越高。值应为数字字符串。

字段: unschedulable

  • 类型: bool
  • 默认值: false
  • 描述: 控制新 Pod 对预留资源的可调度性。默认情况下,预留是可调度的。当设置为 true 时,没有新的 Pod 可以分配此预留资源。

字段: taints

  • 类型: []corev1.Taint
  • 描述: 指定预留资源的污点。Pod 必须容忍这些污点才能分配预留资源。

案例:多个属主在同一个节点预留资源

  1. 检查每个节点的可分配资源。
$ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory
NAME CPU MEMORY
node-0 7800m 28625036Ki
node-1 7800m 28629692Ki
...
$ kubectl describe node node-1 | grep -A 8 "Allocated resources"
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 780m (10%) 7722m (99%)
memory 1216Mi (4%) 14044Mi (50%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)

如上图,node-1节点还保留7.0 cpu and 26Gi memory未分配。

  1. 用如下YAML文件预留资源:reservation-demo-big
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo-big
spec:
allocateOnce: false # 允许资源预留被多个owners分配
template:
metadata:
namespace: default
spec:
containers:
- args:
- '-c'
- '1'
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources: # reserve 6 cpu and 20Gi memory
requests:
cpu: 6
memory: 20Gi
nodeName: node-1 # set the expected node name to schedule at
schedulerName: koord-scheduler
owners: # set multiple owners
- object: # owner pods whose name is `default/pod-demo-0`
name: pod-demo-1
namespace: default
- labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources
matchLabels:
app: app-demo
ttl: 1h
$ kubectl create -f reservation-demo-big.yaml
reservation.scheduling.koordinator.sh/reservation-demo-big created
  1. 跟踪reservation-demo-big的状态,直到他变成可用状态。
$ kubectl get reservation reservation-demo-big -o wide
NAME PHASE AGE NODE TTL EXPIRES
reservation-demo-big Available 37s node-1 1h

reservation-demo-big将被调度到Pod模板中设置的nodeName属性节点:node-1

  1. 用如下YAML文件创建一次部署:app-demo
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-demo
spec:
replicas: 2
selector:
matchLabels:
app: app-demo
template:
metadata:
name: stress
labels:
app: app-demo # match the owner spec of `reservation-demo-big`
spec:
schedulerName: koord-scheduler # use koord-scheduler
containers:
- name: stress
image: polinux/stress
args:
- '-c'
- '1'
command:
- stress
resources:
requests:
cpu: 2
memory: 10Gi
limits:
cpu: 4
memory: 20Gi
$ kubectl create -f app-demo.yaml
deployment.apps/app-demo created
  1. 检查app-demo的Pod调度结果.
k get pod -l app=app-demo -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
app-demo-798c66db46-ctnbr 1/1 Running 0 2m 10.17.0.124 node-1 <none> <none>
app-demo-798c66db46-pzphc 1/1 Running 0 2m 10.17.0.125 node-1 <none> <none>

app-demo的Pod将会被调度到reservation-demo-big所在的节点。

  1. 检查reservation-demo-big的状态。
$ kubectl get reservation reservation-demo-big -oyaml
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo-big
creationTimestamp: "YYYY-MM-DDT06:28:16Z"
uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
...
spec:
allocateOnce: false # 允许资源预留被多个owners分配
owners:
- object:
name: pod-demo-0
namespace: default
template:
spec:
containers:
- args:
- -c
- "1"
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources:
requests:
cpu: 500m
memory: 800Mi
schedulerName: koord-scheduler
ttl: 1h
status:
allocatable:
cpu: 6
memory: 20Gi
allocated:
cpu: 4
memory: 20Gi
conditions:
- lastProbeTime: "YYYY-MM-DDT06:28:17Z"
lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
reason: Scheduled
status: "True"
type: Scheduled
- lastProbeTime: "YYYY-MM-DDT06:28:17Z"
lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
reason: Available
status: "True"
type: Ready
currentOwners:
- name: app-demo-798c66db46-ctnbr
namespace: default
uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
- name: app-demo-798c66db46-pzphc
namespace: default
uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz
nodeName: node-1
phase: Available

现在我们能看到reservation-demo-big预留了6 cpu和20Gi内存,app-demo从预留的资源中分配了4 cpu and 20Gi内存,预留资源的分配不会增加节点资源的请求容量,否则node-1的请求资源总容量将会超过可分配的资源容量。而且当有足够的未分配的预留资源时,这些预留资源可以被同时分配给多个属主。

案例:使用 Default 模式的 PreAllocation

本案例演示如何使用 Default 模式的 PreAllocation,Reservation 绑定到匹配属主规范的已调度 Pod。

  1. 部署一个启用 preAllocation 的 Reservation:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-prealloc
spec:
preAllocation: true
template:
metadata:
namespace: default
spec:
containers:
- name: placeholder
resources:
requests:
cpu: 500m
memory: 800Mi
schedulerName: koord-scheduler
owners:
- labelSelector:
matchLabels:
app: my-app
ttl: 2h
  1. 调度器会找到匹配属主规范的运行中 Pod 并将 Reservation 绑定到它。当 Pod 终止时,Reservation 将转换为预留已释放资源供后续 Pod 使用。

案例:使用 Cluster 模式的 PreAllocation

本案例演示如何使用 Cluster 模式的 PreAllocation,该模式使用集群级别的选择器来识别可预分配的 Pod。

  1. 首先,为可预分配的 Pod 添加标签:
apiVersion: v1
kind: Pod
metadata:
name: batch-job-1
labels:
pod.koordinator.sh/is-pre-allocatable: "true"
annotations:
pod.koordinator.sh/pre-allocatable-priority: "100"
spec:
containers:
- name: batch-job
image: busybox
resources:
requests:
cpu: 2
memory: 4Gi
  1. 创建一个 Cluster 模式的 Reservation:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-cluster-mode
spec:
preAllocation: true
preAllocationPolicy:
mode: Cluster
template:
metadata:
namespace: default
spec:
containers:
- name: placeholder
resources:
requests:
cpu: 2
memory: 4Gi
schedulerName: koord-scheduler
owners:
- labelSelector:
matchLabels:
app: critical-app
ttl: 4h
  1. 调度器会选择优先级最高(基于 pre-allocatable-priority 注解)的可预分配 Pod 并将 Reservation 绑定到它。

案例:使用多 Pod 的 PreAllocation

本案例演示如何使用 enableMultiple 的 PreAllocation,当没有单个 Pod 能满足 Reservation 需求时,从多个 Pod 累积资源。

  1. 将多个 Pod 标记为可预分配:
# Pod 1 - 1 CPU, 2Gi 内存
apiVersion: v1
kind: Pod
metadata:
name: batch-job-1
labels:
pod.koordinator.sh/is-pre-allocatable: "true"
annotations:
pod.koordinator.sh/pre-allocatable-priority: "100"
spec:
containers:
- name: batch-job
image: busybox
resources:
requests:
cpu: 1
memory: 2Gi
---
# Pod 2 - 1 CPU, 2Gi 内存
apiVersion: v1
kind: Pod
metadata:
name: batch-job-2
labels:
pod.koordinator.sh/is-pre-allocatable: "true"
annotations:
pod.koordinator.sh/pre-allocatable-priority: "90"
spec:
containers:
- name: batch-job
image: busybox
resources:
requests:
cpu: 1
memory: 2Gi
  1. 创建一个需要 2 CPU 和 4Gi 内存并启用 enableMultiple 的 Reservation:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-multi-pods
spec:
preAllocation: true
preAllocationPolicy:
mode: Cluster
enableMultiple: true
template:
metadata:
namespace: default
spec:
containers:
- name: placeholder
resources:
requests:
cpu: 2
memory: 4Gi
schedulerName: koord-scheduler
owners:
- labelSelector:
matchLabels:
app: high-priority-app
ttl: 4h
  1. 调度器会预分配 batch-job-1batch-job-2(按 pre-allocatable-priority 注解排序优先级)来满足 Reservation 的资源需求。当这些 Pod 终止时,Reservation 将转换为预留已释放资源。

PreAllocation 的调度器配置

调度器可以为 PreAllocation 行为配置额外的选项。在调度器配置中添加以下配置:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: koord-scheduler
plugins:
reservation:
enabled:
- name: Reservation
pluginConfig:
- name: Reservation
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: ReservationArgs
preAllocationConfig:
# 启用集群范围的预分配模式
enableClusterMode: true
# 自定义标识可预分配 Pod 的标签键(可选)
clusterLabelKey: pod.koordinator.sh/is-pre-allocatable
# 自定义 Pod 优先级的注解键(可选)
clusterPriorityAnnotationKey: pod.koordinator.sh/pre-allocatable-priority
# 尽可能在不使用可预分配 Pod 的情况下放置 Reservation
preferNoPreAllocatedPods: true

配置选项:

选项描述
enableClusterMode启用集群范围的预分配模式
clusterLabelKey自定义标识可预分配候选者的标签键
clusterPriorityAnnotationKey自定义可预分配 Pod 优先级的注解键
preferNoPreAllocatedPods启用后,如果节点有足够资源,优先不使用可预分配 Pod 放置 Reservation