Version: v1.8

Resource Reservation

Resource Reservation is an ability of koord-scheduler for reserving node resources for specific pods or workloads.

Introduction

Pods are fundamental for allocating node resources in Kubernetes, which bind resource requirements with business logic. However, we may allocate resources for specific pods or workloads not created yet in the scenarios below:

Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. We expect that the scheduler can "lock" resources preventing from allocation of other pods even if they have the same or higher priorities.
De-scheduling: For the descheduler, it is better to ensure sufficient resources before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
Horizontal scaling: To achieve more deterministic horizontal scaling, we expect to allocate node resources for the replicas to scale.
Resource Pre-allocation: We may want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable.

To enhance the resource scheduling of Kubernetes, koord-scheduler provides a scheduling API named Reservation, which allows us to reserve node resources for specified pods or workloads even if they haven't get created yet.

For more information, please see Design: Resource Reservation.

Setup

Prerequisite

Kubernetes >= 1.18
Koordinator >= 0.6

Installation

Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to Installation.

Configurations

Resource Reservation is Enabled by default. You can use it without any modification on the koord-scheduler config.

Use Resource Reservation

Quick Start

Deploy a reservation reservation-demo with the YAML file below.

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
spec:
  template: # set resource requirements
    metadata:
      namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources: # reserve 500m cpu and 800Mi memory
            requests:
              cpu: 500m
              memory: 800Mi
      schedulerName: koord-scheduler # use koord-scheduler
  owners: # set the owner specifications
    - object: # owner pods whose name is `default/pod-demo-0`
        name: pod-demo-0
        namespace: default
  ttl: 1h # set the TTL, the reservation will get expired 1 hour later

$ kubectl create -f reservation-demo.yaml
reservation.scheduling.koordinator.sh/reservation-demo created

Watch the reservation status util it becomes available.

$ kubectl get reservation reservation-demo -o wide
NAME               PHASE       AGE   NODE     TTL  EXPIRES
reservation-demo   Available   88s   node-0   1h

Deploy a pod pod-demo-0 with the YAML file below.

apiVersion: v1
kind: Pod
metadata:
  name: pod-demo-0 # match the owner spec of `reservation-demo`
spec:
  containers:
    - args:
        - '-c'
        - '1'
      command:
        - stress
      image: polinux/stress
      imagePullPolicy: Always
      name: stress
      resources:
        limits:
          cpu: '1'
          memory: 1Gi
        requests:
          cpu: 200m
          memory: 400Mi
  restartPolicy: Always
  schedulerName: koord-scheduler # use koord-scheduler

$ kubectl create -f pod-demo-0.yaml
pod/pod-demo-0 created

Check the scheduled result of the pod pod-demo-0.

$ kubectl get pod pod-demo-0 -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
pod-demo-0   1/1     Running   0          32s   10.17.0.123   node-0   <none>           <none>

pod-demo-0 is scheduled at the same node with reservation-demo.

Check the status of the reservation reservation-demo.

$ kubectl get reservation reservation-demo -oyaml
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
  creationTimestamp: "YYYY-MM-DDT05:24:58Z"
  uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  ...
spec:
  owners:
  - object:
      name: pod-demo-0
      namespace: default
  template:
    spec:
      containers:
      - args:
        - -c
        - "1"
        command:
        - stress
        image: polinux/stress
        imagePullPolicy: Always
        name: stress
        resources:
          requests:
            cpu: 500m
            memory: 800Mi
      schedulerName: koord-scheduler
  ttl: 1h
status:
  allocatable: # total reserved
    cpu: 500m
    memory: 800Mi
  allocated: # current allocated
    cpu: 200m
    memory: 400Mi
  conditions:
  - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
    lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
    reason: Scheduled
    status: "True"
    type: Scheduled
  - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
    lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
    reason: Available
    status: "True"
    type: Ready
  currentOwners:
  - name: pod-demo-0
    namespace: default
    uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  nodeName: node-0
  phase: Available

Now we can see the reservation reservation-demo has reserved 500m cpu and 800Mi memory, and the pod pod-demo-0 allocates 200m cpu and 400Mi memory from the reserved resources.

Cleanup the reservation reservation-demo.

$ kubectl delete reservation reservation-demo
reservation.scheduling.koordinator.sh "reservation-demo" deleted
$ kubectl get pod pod-demo-0
NAME         READY   STATUS    RESTARTS   AGE
pod-demo-0   1/1     Running   0          110s

After the reservation deleted, the pod pod-demo-0 is still running.

Advanced Configurations

The latest API can be found in reservation_types.

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
spec:
  # pod template (required): Reserve resources and play pod/node affinities according to the template.
  # The resource requirements of the pod indicates the resource requirements of the reservation
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources:
            requests:
              cpu: 500m
              memory: 800Mi
      # scheduler name (required): use koord-scheduler to schedule the reservation
      schedulerName: koord-scheduler
  # owner spec (required): Specify what kinds of pods can allocate resources of this reservation.
  # Currently support three kinds of owner specifications:
  # - object: specify the name, namespace, uid of the owner pods
  # - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind
  # - labelSelector: specify the matching labels are matching expressions of the owner pods
  owners:
    - object:
        name: pod-demo-0
        namespace: default
    - labelSelector:
        matchLabels:
          app: app-demo
  # TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period.
  # If not set, use `24h` as default.
  ttl: 1h
  # Expires (optional): Expired timestamp when the reservation is expected to expire.
  # If both `expires` and `ttl` are set, `expires` is checked first.
  expires: "YYYY-MM-DDTHH:MM:SSZ"

Field: `allocateOnce`

Type: *bool
Default: true
Description: When set to true, the reserved resources are only available for the first owner who allocates successfully and are not allocatable to other owners anymore. When set to false, the reservation can be allocated by multiple owners as long as there are sufficient resources.

Field: `allocatePolicy`

Type: ReservationAllocatePolicy
Optional values: Aligned, Restricted
Description: Specifies the allocation policy for the reservation.
- Aligned: The Pod allocates resources from the Reservation first. If the remaining resources of the Reservation are insufficient, it can be allocated from the node, but it is required to strictly follow the resource specifications of the Pod. This avoids the problem that a Pod uses multiple Reservations at the same time.
- Restricted: The resources requested by the Pod that overlap with the resources reserved by the Reservation can only be allocated from the Reservation. Resources declared in Pods but not reserved in Reservations can be allocated from Nodes. Restricted includes the semantics of Aligned.

Field: `preAllocation`

Type: bool
Default: false
Description: When preAllocation is set to true, the reservation can bind to already scheduled pods on nodes. The reservation will pre-allocate the resources from these running pods. When the bound pod terminates, the reservation automatically transitions from binding state to a normal reservation that reserves the freed resources.

This is useful for scenarios like resource migration and graceful pod rescheduling, where you want to ensure resource continuity before a pod exits.

Field: `preAllocationPolicy`

Type: PreAllocationPolicy
Description: Defines the policy for pre-allocation when preAllocation is set to true. This field allows fine-grained control over how pre-allocatable pods are selected and whether multiple pods can be pre-allocated.

The PreAllocationPolicy struct contains the following fields:

mode (type: PreAllocationMode, default: Default):
- Default: Uses the Owner matchers from the Reservation Spec to select pre-allocatable pods. This is the default behavior.
- Cluster: Uses cluster-wide label/annotation selectors to identify pre-allocatable pods. This mode is useful in multi-tenant clusters where pre-allocatable pods may belong to different owners and should be managed centrally.
enableMultiple (type: bool, default: false):
- When false, only a single pod can be pre-allocated for the reservation.
- When true, multiple pods can be pre-allocated to satisfy the reservation's resource requirements. This is useful when no single pod can provide all the required resources due to resource fragmentation.

Cluster Mode Labels and Annotations:

When using Cluster mode, the scheduler identifies pre-allocatable pods using the following labels and annotations:

Label/Annotation	Description
`pod.koordinator.sh/is-pre-allocatable`	Label to identify pre-allocatable pods. Set to `"true"` to mark a pod as pre-allocatable.
`pod.koordinator.sh/pre-allocatable-priority`	Annotation to set the priority for pre-allocation. Higher values indicate higher priority. The value should be a numeric string.

Field: `unschedulable`

Type: bool
Default: false
Description: Controls reservation schedulability of new pods. By default, reservation is schedulable. When set to true, no new pods can allocate this reservation.

Field: `taints`

Type: []corev1.Taint
Description: Specifies the reservation's taints. Pods must tolerate these taints to allocate the reserved resources.

Example: Reserve on Specified Node, with Multiple Owners

Check the resources allocatable of each node.

$ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory
NAME     CPU     MEMORY
node-0   7800m   28625036Ki
node-1   7800m   28629692Ki
...
$ kubectl describe node node-1 | grep -A 8 "Allocated resources"
  Allocated resources:
    (Total limits may be over 100 percent, i.e., overcommitted.)
    Resource                     Requests     Limits
    --------                     --------     ------
    cpu                          780m (10%)   7722m (99%)
    memory                       1216Mi (4%)  14044Mi (50%)
    ephemeral-storage            0 (0%)       0 (0%)
    hugepages-1Gi                0 (0%)       0 (0%)
    hugepages-2Mi                0 (0%)       0 (0%)

As above, the node node-1 has about 7.0 cpu and 26Gi memory unallocated.

Deploy a reservation reservation-demo-big with the YAML file below.

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo-big
spec:
  allocateOnce: false # allow the reservation to be allocated by multiple owners
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources: # reserve 6 cpu and 20Gi memory
            requests:
              cpu: 6
              memory: 20Gi
      nodeName: node-1 # set the expected node name to schedule at
      schedulerName: koord-scheduler
  owners: # set multiple owners
    - object: # owner pods whose name is `default/pod-demo-0`
        name: pod-demo-1
        namespace: default
    - labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources
        matchLabels:
          app: app-demo
  ttl: 1h

$ kubectl create -f reservation-demo-big.yaml
reservation.scheduling.koordinator.sh/reservation-demo-big created

Watch the reservation status util it becomes available.

$ kubectl get reservation reservation-demo-big -o wide
NAME                   PHASE       AGE   NODE     TTL  EXPIRES
reservation-demo-big   Available   37s   node-1   1h

The reservation reservation-demo-big is scheduled at the node node-1, which matches the nodeName set in pod template.

Deploy a deployment app-demo with the YAML file below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app-demo
  template:
    metadata:
      name: stress
      labels:
        app: app-demo # match the owner spec of `reservation-demo-big`
    spec:
      schedulerName: koord-scheduler # use koord-scheduler
      containers:
      - name: stress
        image: polinux/stress
        args:
          - '-c'
          - '1'
        command:
          - stress
        resources:
          requests:
            cpu: 2
            memory: 10Gi
          limits:
            cpu: 4
            memory: 20Gi

$ kubectl create -f app-demo.yaml
deployment.apps/app-demo created

Check the scheduled result of the pods of deployment app-demo.

k get pod -l app=app-demo -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
app-demo-798c66db46-ctnbr   1/1     Running   0          2m    10.17.0.124   node-1   <none>           <none>
app-demo-798c66db46-pzphc   1/1     Running   0          2m    10.17.0.125   node-1   <none>           <none>

Pods of deployment app-demo are scheduled at the same node with reservation-demo-big.

Check the status of the reservation reservation-demo-big.

$ kubectl get reservation reservation-demo-big -oyaml
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo-big
  creationTimestamp: "YYYY-MM-DDT06:28:16Z"
  uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  ...
spec:
  allocateOnce: false # allow the reservation to be allocated by multiple owners
  owners:
  - object:
      name: pod-demo-0
      namespace: default
  template:
    spec:
      containers:
      - args:
        - -c
        - "1"
        command:
        - stress
        image: polinux/stress
        imagePullPolicy: Always
        name: stress
        resources:
          requests:
            cpu: 500m
            memory: 800Mi
      schedulerName: koord-scheduler
  ttl: 1h
status:
  allocatable:
    cpu: 6
    memory: 20Gi
  allocated:
    cpu: 4
    memory: 20Gi
  conditions:
  - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
    lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
    reason: Scheduled
    status: "True"
    type: Scheduled
  - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
    lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
    reason: Available
    status: "True"
    type: Ready
  currentOwners:
  - name: app-demo-798c66db46-ctnbr
    namespace: default
    uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  - name: app-demo-798c66db46-pzphc
    namespace: default
    uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz
  nodeName: node-1
  phase: Available

Now we can see the reservation reservation-demo-big has reserved 6 cpu and 20Gi memory, and the pods of deployment app-demo allocates 4 cpu and 20Gi memory from the reserved resources. The allocation for reserved resources does not increase the requested of node resources, otherwise the total request of node-1 would exceed the node allocatable. Moreover, a reservation can be allocated by multiple owners when there are enough reserved resources unallocated.

Example: PreAllocation with Default Mode

This example demonstrates how to use PreAllocation with the default mode, where the reservation binds to a scheduled pod that matches the owner specification.

Deploy a reservation with preAllocation enabled:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-prealloc
spec:
  preAllocation: true
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - name: placeholder
          resources:
            requests:
              cpu: 500m
              memory: 800Mi
      schedulerName: koord-scheduler
  owners:
    - labelSelector:
        matchLabels:
          app: my-app
  ttl: 2h

The scheduler will find a running pod matching the owner specification and bind the reservation to it. When the pod terminates, the reservation will transition to reserve the freed resources for future pods.

Example: PreAllocation with Cluster Mode

This example demonstrates how to use PreAllocation with Cluster mode, which uses cluster-wide selectors to identify pre-allocatable pods.

First, label the pods that can be pre-allocated:

apiVersion: v1
kind: Pod
metadata:
  name: batch-job-1
  labels:
    pod.koordinator.sh/is-pre-allocatable: "true"
  annotations:
    pod.koordinator.sh/pre-allocatable-priority: "100"
spec:
  containers:
    - name: batch-job
      image: busybox
      resources:
        requests:
          cpu: 2
          memory: 4Gi

Create a reservation with Cluster mode:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-cluster-mode
spec:
  preAllocation: true
  preAllocationPolicy:
    mode: Cluster
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - name: placeholder
          resources:
            requests:
              cpu: 2
              memory: 4Gi
      schedulerName: koord-scheduler
  owners:
    - labelSelector:
        matchLabels:
          app: critical-app
  ttl: 4h

The scheduler will select the highest-priority pre-allocatable pod (based on the pre-allocatable-priority annotation) and bind the reservation to it.

Example: PreAllocation with Multiple Pods

This example demonstrates how to use PreAllocation with enableMultiple to accumulate resources from multiple pods when no single pod can satisfy the reservation requirements.

Label multiple pods as pre-allocatable:

# Pod 1 - 1 CPU, 2Gi memory
apiVersion: v1
kind: Pod
metadata:
  name: batch-job-1
  labels:
    pod.koordinator.sh/is-pre-allocatable: "true"
  annotations:
    pod.koordinator.sh/pre-allocatable-priority: "100"
spec:
  containers:
    - name: batch-job
      image: busybox
      resources:
        requests:
          cpu: 1
          memory: 2Gi
---
# Pod 2 - 1 CPU, 2Gi memory
apiVersion: v1
kind: Pod
metadata:
  name: batch-job-2
  labels:
    pod.koordinator.sh/is-pre-allocatable: "true"
  annotations:
    pod.koordinator.sh/pre-allocatable-priority: "90"
spec:
  containers:
    - name: batch-job
      image: busybox
      resources:
        requests:
          cpu: 1
          memory: 2Gi

Create a reservation requiring 2 CPU and 4Gi memory with enableMultiple:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-multi-pods
spec:
  preAllocation: true
  preAllocationPolicy:
    mode: Cluster
    enableMultiple: true
  template:
    metadata:
      namespace: default
    spec:
      containers:
        - name: placeholder
          resources:
            requests:
              cpu: 2
              memory: 4Gi
      schedulerName: koord-scheduler
  owners:
    - labelSelector:
        matchLabels:
          app: high-priority-app
  ttl: 4h

The scheduler will pre-allocate both batch-job-1 and batch-job-2 (prioritizing by the pre-allocatable-priority annotation) to satisfy the reservation's resource requirements. When these pods terminate, the reservation will transition to reserve the freed resources.

Scheduler Configuration for PreAllocation

The scheduler can be configured with additional options for PreAllocation behavior. Add the following configuration to the scheduler config:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: koord-scheduler
    plugins:
      reservation:
        enabled:
          - name: Reservation
    pluginConfig:
      - name: Reservation
        args:
          apiVersion: kubescheduler.config.k8s.io/v1
          kind: ReservationArgs
          preAllocationConfig:
            # Enable cluster-wide pre-allocation mode
            enableClusterMode: true
            # Custom label key for identifying pre-allocatable pods (optional)
            clusterLabelKey: pod.koordinator.sh/is-pre-allocatable
            # Custom annotation key for prioritizing pods (optional)
            clusterPriorityAnnotationKey: pod.koordinator.sh/pre-allocatable-priority
            # Prefer placing reservations without using pre-allocatable pods when possible
            preferNoPreAllocatedPods: true

Configuration Options:

Option	Description
`enableClusterMode`	Enable cluster-wide pre-allocation mode
`clusterLabelKey`	Customizable label key for identifying pre-allocatable candidates
`clusterPriorityAnnotationKey`	Customizable annotation key for prioritizing pre-allocatable pods
`preferNoPreAllocatedPods`	When enabled, prefer placing reservations without using pre-allocatable pods if the node has sufficient resources

Resource Reservation

Introduction​

Setup​

Prerequisite​

Installation​

Configurations​

Use Resource Reservation​

Quick Start​

Advanced Configurations​

Field: allocateOnce​

Field: allocatePolicy​

Field: preAllocation​

Field: preAllocationPolicy​

Field: unschedulable​

Field: taints​

Example: Reserve on Specified Node, with Multiple Owners​

Example: PreAllocation with Default Mode​

Example: PreAllocation with Cluster Mode​

Example: PreAllocation with Multiple Pods​

Scheduler Configuration for PreAllocation​