Version: v1.8 🚧

Job Level Preemption

Introduction

In large-scale cluster environments, high-priority jobs (e.g., critical AI training tasks) often need to preempt resources from lower-priority workloads when sufficient resources are not available. However, traditional pod-level preemption in Kubernetes cannot guarantee that all member pods of a distributed job will seize resources together, leading to invalid preemption.

To solve this, Koordinator provides job-level preemption, which ensures that:

Preemption is triggered at the job (GangGroup) level.
Only when all member pods can be co-scheduled after eviction will preemption occur.
Resources are reserved via nominatedNode for all members to maintain scheduling consistency.

This capability works seamlessly with PodGroup/GangGroup semantics.

Prerequisites

Kubernetes >= 1.18
Koordinator >= 1.7.0

Verify Preemption is Enabled

Although job-level preemption is enabled by default as of koordinator ≥ 1.7.0, it's recommended to confirm the Coscheduling plugin configuration.

Check Scheduler Configuration

Retrieve the current koord-scheduler-config:

kubectl -n koordinator-system get cm koord-scheduler-config -o yaml

Ensure the Coscheduling plugin has enablePreemption: true:

pluginConfig:
  - name: Coscheduling
    args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: CoschedulingArgs
      enablePreemption: true

If changes are made, restart the koord-scheduler pod to apply them.

Usage Example

Environment Setup

To demonstrate job-level preemption, we will simulate a resource-constrained environment and trigger preemption from a high-priority job. Assume the cluster has 2 worker nodes, each with:

CPU: 4 cores
Memory: 16 GiB
No other running workloads initially

Our procedure is:

Fill both nodes with low-priority pods consuming all CPU.
Submit a high-priority gang job that cannot fit.
Observe how Koordinator evicts low-priority pods to make space.

Define PriorityClasses

You must define priority classes to enable preemption logic.

# High-Priority Class (for preemptors)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Used for critical AI training jobs that can preempt others."

# Low-Priority Class (for victims)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Used for non-critical jobs that can be preempted."

Apply them

kubectl apply -f priorityclasses.yaml

Verify

kubectl get priorityclass

NAME              VALUE        GLOBAL-DEFAULT   AGE
high-priority     1000000      false            1m
low-priority      1000         false            1m

Deploy Low-Priority Pods to Consume Resources

Create 2 low-priority pods (1 per node), each requesting 4 CPU cores → fully occupying both nodes.

apiVersion: v1
kind: Pod
metadata:
  name: lp-pod-1
  namespace: default
spec:
  schedulerName: koord-scheduler
  priorityClassName: low-priority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 4
        memory: 40Mi
      requests:
        cpu: 4
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always
---
apiVersion: v1
kind: Pod
metadata:
  name: lp-pod-2
  namespace: default
spec:
  schedulerName: koord-scheduler
  priorityClassName: low-priority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 4
        memory: 40Mi
      requests:
        cpu: 4
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always

Apply them

kubectl apply -f low-priority-pods.yaml

Check

kubectl get pods -o wide

NAME        READY   STATUS    RESTARTS   AGE     IP            NODE.          NOMINATED NODE   READINESS GATES
lp-pod-1    1/1     Running   0          2m      10.244.1.10   cn-beijing.1   <none>           <none>
lp-pod-2    1/1     Running   0          2m      10.244.1.11   cn-beijing.2   <none>           <none>

At this point, no CPU remains available on either node.

Create a High-Priority Gang Job to Trigger Preemption

Now submit a 2-pod high-priority job that requires 3 CPU per pod — total demand exceeds current capacity.

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: hp-training-job
  namespace: default
spec:
  minMember: 2
  scheduleTimeoutSeconds: 300

apiVersion: v1
kind: Pod
metadata:
  name: hp-worker-1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:
  schedulerName: koord-scheduler
  priorityClassName: high-priority
  preemptionPolicy: PreemptLowerPriority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 3
        memory: 40Mi
      requests:
        cpu: 3
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always
---
apiVersion: v1
kind: Pod
metadata:
  name: hp-worker-2
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:  
  schedulerName: koord-scheduler
  priorityClassName: high-priority
  preemptionPolicy: PreemptLowerPriority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 3
        memory: 40Mi
      requests:
        cpu: 3
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always

Apply them

kubectl apply -f high-priority-job.yaml

After a few seconds, Koordinator will evict one pod per node to free up resources.

Verify Preemption Outcome

Check Victim Pods Were Evicted

kubectl get pods -o wide

NAME           READY   STATUS        RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
hp-worker-1    0/1     Pending       0          90s     <none>        <none>         cn-beijing.1     <none>
hp-worker-2    0/1     Pending       0          90s     <none>        <none>         cn-beijing.2     <none>
lp-pod-1       0/1     Terminating   0          5m      10.244.1.10   cn-beijing.1   <none>           <none>
lp-pod-2       1/1     Terminating   0          5m      10.244.1.11   cn-beijing.2   <none>           <none>

Pods lp-pod-1 and lp-pod-2 are being terminated to make room and high-priority pods are nominated. 2. Inspect one victim:

kubectl get pod lp-pod-1 -o yaml

status:
  conditions:
    - type: DisruptionTarget
      status: "True"
      lastTransitionTime: "2025-10-12T11:23:45Z"
      reason: PreemptionByScheduler
      message: >-
        koord-scheduler: preempting to accommodate higher priority pods, preemptor:
        default/hp-training-job, triggerpod: default/hp-worker-1

Confirm Binding After Termination

kubectl get pods -o wide

NAME           READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
hp-worker-1    1/1     Running   0          3m      10.244.1.14   cn-beijing.1   <none>           <none>
hp-worker-2    1/1     Running   0          3m      10.244.2.15   cn-beijing.2   <none>           <none>

Job Level Preemption

Introduction​

Prerequisites​

Verify Preemption is Enabled​

Check Scheduler Configuration​

Usage Example​

Environment Setup​

Define PriorityClasses​

Deploy Low-Priority Pods to Consume Resources​

Create a High-Priority Gang Job to Trigger Preemption​

Verify Preemption Outcome​