版本：v1.8 🚧

资源预留

资源预留是koord-scheduler的一种为某些特定Pod或负载预留节点资源的能力。

介绍

Pod是kubernetes节点资源分配的基础载体，他根据业务逻辑绑定对应的资源需求。但是我们可能分为一些还没创建的特定Pod和负载分配资源，例如：

抢占：已经存在的抢占规则不能保证只有正在抢占中的Pod才能分配抢占的资源，我们期望调度器能锁定资源，防止这些资源被有相同或更高优先级的其他Pod抢占。
重调度：在重调度场景下，最好能保证在Pod被重调度之前保留足够的资源。否则，被重调度的Pod可能再也没法运行，然后对应的应用可能就会崩溃。
水平扩容：为了能更精准地进行水平扩容，我们希望能为扩容的Pod副本分配节点资源。
资源预分配：即使当前的资源还不可用，我们可能想为将来的资源需求提前预留节点资源。

为了增强kubernetes的资源调度能力，koord-scheduler提供了一个名字叫Reservation的调度API,允许我们为一些当前还未创建的特定的Pod和负载，提前预留节点资源。

更多信息，请看设计文档：资源预留。

设置

前提

Kubernetes >= 1.18
Koordinator >= 0.6

安装步骤

请确保Koordinator的组件已经在你的集群中正确安装，如果还未正确安装，请参考安装说明。

配置

资源预留功能默认启用，你无需对koord-scheduler配置做任何修改，即可使用。

使用指南

快速上手

使用如下yaml文件预留资源：reservation-demo。

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
spec:
  template: # set resource requirements
    metadata:
      namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources: # reserve 500m cpu and 800Mi memory
            requests:
              cpu: 500m
              memory: 800Mi
      schedulerName: koord-scheduler # use koord-scheduler
  owners: # set the owner specifications
    - object: # owner pods whose name is `default/pod-demo-0`
        name: pod-demo-0
        namespace: default
  ttl: 1h # set the TTL, the reservation will get expired 1 hour later

$ kubectl create -f reservation-demo.yaml
reservation.scheduling.koordinator.sh/reservation-demo created

跟踪reservation-demo的状态，直到它变成可用状态。

$ kubectl get reservation reservation-demo -o wide
NAME               PHASE       AGE   NODE     TTL  EXPIRES
reservation-demo   Available   88s   node-0   1h

使用如下YAML文件部署一个Pod：Pod-demo-0。

apiVersion: v1
kind: Pod
metadata:
  name: pod-demo-0 # match the owner spec of `reservation-demo`
spec:
  containers:
    - args:
        - '-c'
        - '1'
      command:
        - stress
      image: polinux/stress
      imagePullPolicy: Always
      name: stress
      resources:
        limits:
          cpu: '1'
          memory: 1Gi
        requests:
          cpu: 200m
          memory: 400Mi
  restartPolicy: Always
  schedulerName: koord-scheduler # use koord-scheduler

$ kubectl create -f pod-demo-0.yaml
pod/pod-demo-0 created

检查Pod-demo-0的调度状态。

$ kubectl get pod pod-demo-0 -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
pod-demo-0   1/1     Running   0          32s   10.17.0.123   node-0   <none>           <none>

Pod-demo-0将会和reservation-demo被调度到同一个节点。

检查reservation-demo的状态。

$ kubectl get reservation reservation-demo -oyaml
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
  creationTimestamp: "YYYY-MM-DDT05:24:58Z"
  uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  ...
spec:
  owners:
  - object:
      name: pod-demo-0
      namespace: default
  template:
    spec:
      containers:
      - args:
        - -c
        - "1"
        command:
        - stress
        image: polinux/stress
        imagePullPolicy: Always
        name: stress
        resources:
          requests:
            cpu: 500m
            memory: 800Mi
      schedulerName: koord-scheduler
  ttl: 1h
status:
  allocatable: # total reserved
    cpu: 500m
    memory: 800Mi
  allocated: # current allocated
    cpu: 200m
    memory: 400Mi
  conditions:
  - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
    lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
    reason: Scheduled
    status: "True"
    type: Scheduled
  - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
    lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
    reason: Available
    status: "True"
    type: Ready
  currentOwners:
  - name: pod-demo-0
    namespace: default
    uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  nodeName: node-0
  phase: Available

现在我们可以看到reservation-demo预留了500m cpu和 800Mi内存, Pod-demo-0从预留的资源中分配了200m cpu and 400Mi内存。

清理reservation-demo的预留资源。

$ kubectl delete reservation reservation-demo
reservation.scheduling.koordinator.sh "reservation-demo" deleted
$ kubectl get pod pod-demo-0
NAME         READY   STATUS    RESTARTS   AGE
pod-demo-0   1/1     Running   0          110s

在预留资源被删除后，Pod-demo-0依然正常运行。

高级特性

最新的API可以在这里查看： reservation_types。

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
spec:
  # pod template (required): Reserve resources and play pod/node affinities according to the template.
  # The resource requirements of the pod indicates the resource requirements of the reservation
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources:
            requests:
              cpu: 500m
              memory: 800Mi
      # scheduler name (required): use koord-scheduler to schedule the reservation
      schedulerName: koord-scheduler
  # owner spec (required): Specify what kinds of pods can allocate resources of this reservation.
  # Currently support three kinds of owner specifications:
  # - object: specify the name, namespace, uid of the owner pods
  # - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind
  # - labelSelector: specify the matching labels are matching expressions of the owner pods
  owners:
    - object:
        name: pod-demo-0
        namespace: default
    - labelSelector:
        matchLabels:
          app: app-demo
  # TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period.
  # If not set, use `24h` as default.
  ttl: 1h
  # Expires (optional): Expired timestamp when the reservation is expected to expire.
  # If both `expires` and `ttl` are set, `expires` is checked first.
  expires: "YYYY-MM-DDTHH:MM:SSZ"

字段: `allocateOnce`

类型: *bool
默认值: true
描述: 当设置为 true 时,预留资源仅对第一个成功分配的属主可用,之后不再对其他属主可分配。当设置为 false 时,只要有足够的资源,预留资源可以被多个属主分配。

字段: `allocatePolicy`

类型: ReservationAllocatePolicy
可选值: Aligned, Restricted
描述: 指定预留资源的分配策略。
- Aligned: Pod 优先从 Reservation 分配资源。如果 Reservation 的剩余资源不足,可以从节点分配,但需要严格遵循 Pod 的资源规范。这避免了 Pod 同时使用多个 Reservation 的问题。
- Restricted: Pod 请求的资源与 Reservation 预留的资源重叠部分只能从 Reservation 分配。Pod 中声明但 Reservation 中未预留的资源可以从节点分配。Restricted 包含 Aligned 的语义。

字段: `preAllocation`

类型: bool
默认值: false
描述: 当 preAllocation 设置为 true 时,预留资源可以绑定到节点上已调度的 Pod。预留将预占这些运行中 Pod 的资源。当被绑定的 Pod 退出后,预留自动从绑定状态转变为预留已释放资源的普通预留。

这对于资源迁移和优雅的 Pod 重新调度等场景非常有用,可以在 Pod 退出之前确保资源连续性。

字段: `unschedulable`

类型: bool
默认值: false
描述: 控制新 Pod 对预留资源的可调度性。默认情况下,预留是可调度的。当设置为 true 时,没有新的 Pod 可以分配此预留资源。

字段: `taints`

类型: []corev1.Taint
描述: 指定预留资源的污点。Pod 必须容忍这些污点才能分配预留资源。

案例：多个属主在同一个节点预留资源

检查每个节点的可分配资源。

$ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory
NAME     CPU     MEMORY
node-0   7800m   28625036Ki
node-1   7800m   28629692Ki
...
$ kubectl describe node node-1 | grep -A 8 "Allocated resources"
  Allocated resources:
    (Total limits may be over 100 percent, i.e., overcommitted.)
    Resource                     Requests     Limits
    --------                     --------     ------
    cpu                          780m (10%)   7722m (99%)
    memory                       1216Mi (4%)  14044Mi (50%)
    ephemeral-storage            0 (0%)       0 (0%)
    hugepages-1Gi                0 (0%)       0 (0%)
    hugepages-2Mi                0 (0%)       0 (0%)

如上图，node-1节点还保留7.0 cpu and 26Gi memory未分配。

用如下YAML文件预留资源：reservation-demo-big。

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo-big
spec:
  allocateOnce: false # 允许资源预留被多个owners分配
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources: # reserve 6 cpu and 20Gi memory
            requests:
              cpu: 6
              memory: 20Gi
      nodeName: node-1 # set the expected node name to schedule at
      schedulerName: koord-scheduler
  owners: # set multiple owners
    - object: # owner pods whose name is `default/pod-demo-0`
        name: pod-demo-1
        namespace: default
    - labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources
        matchLabels:
          app: app-demo
  ttl: 1h

$ kubectl create -f reservation-demo-big.yaml
reservation.scheduling.koordinator.sh/reservation-demo-big created

跟踪reservation-demo-big的状态，直到他变成可用状态。

$ kubectl get reservation reservation-demo-big -o wide
NAME                   PHASE       AGE   NODE     TTL  EXPIRES
reservation-demo-big   Available   37s   node-1   1h

reservation-demo-big将被调度到Pod模板中设置的nodeName属性节点:node-1。

用如下YAML文件创建一次部署：app-demo。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app-demo
  template:
    metadata:
      name: stress
      labels:
        app: app-demo # match the owner spec of `reservation-demo-big`
    spec:
      schedulerName: koord-scheduler # use koord-scheduler
      containers:
      - name: stress
        image: polinux/stress
        args:
          - '-c'
          - '1'
        command:
          - stress
        resources:
          requests:
            cpu: 2
            memory: 10Gi
          limits:
            cpu: 4
            memory: 20Gi

$ kubectl create -f app-demo.yaml
deployment.apps/app-demo created

检查app-demo的Pod调度结果.

k get pod -l app=app-demo -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
app-demo-798c66db46-ctnbr   1/1     Running   0          2m    10.17.0.124   node-1   <none>           <none>
app-demo-798c66db46-pzphc   1/1     Running   0          2m    10.17.0.125   node-1   <none>           <none>

app-demo的Pod将会被调度到reservation-demo-big所在的节点。

检查reservation-demo-big的状态。

$ kubectl get reservation reservation-demo-big -oyaml
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo-big
  creationTimestamp: "YYYY-MM-DDT06:28:16Z"
  uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  ...
spec:
  allocateOnce: false # 允许资源预留被多个owners分配
  owners:
  - object:
      name: pod-demo-0
      namespace: default
  template:
    spec:
      containers:
      - args:
        - -c
        - "1"
        command:
        - stress
        image: polinux/stress
        imagePullPolicy: Always
        name: stress
        resources:
          requests:
            cpu: 500m
            memory: 800Mi
      schedulerName: koord-scheduler
  ttl: 1h
status:
  allocatable:
    cpu: 6
    memory: 20Gi
  allocated:
    cpu: 4
    memory: 20Gi
  conditions:
  - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
    lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
    reason: Scheduled
    status: "True"
    type: Scheduled
  - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
    lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
    reason: Available
    status: "True"
    type: Ready
  currentOwners:
  - name: app-demo-798c66db46-ctnbr
    namespace: default
    uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  - name: app-demo-798c66db46-pzphc
    namespace: default
    uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz
  nodeName: node-1
  phase: Available

现在我们能看到reservation-demo-big预留了6 cpu和20Gi内存，app-demo从预留的资源中分配了4 cpu and 20Gi内存，预留资源的分配不会增加节点资源的请求容量，否则node-1的请求资源总容量将会超过可分配的资源容量。而且当有足够的未分配的预留资源时，这些预留资源可以被同时分配给多个属主。

资源预留

介绍​

设置​

前提​

安装步骤​

配置​

使用指南​

快速上手​

高级特性​

字段: allocateOnce​

字段: allocatePolicy​

字段: preAllocation​

字段: unschedulable​

字段: taints​

案例：多个属主在同一个节点预留资源​

介绍

设置

前提