Job Level Preemption
Introduction
In large-scale cluster environments, high-priority jobs (e.g., critical AI training tasks) often need to preempt resources from lower-priority workloads when sufficient resources are not available. However, traditional pod-level preemption in Kubernetes cannot guarantee that all member pods of a distributed job will seize resources together, leading to invalid preemption.
To solve this, Koordinator provides job-level preemption, which ensures that:
- Preemption is triggered at the job (GangGroup) level.
- Only when all member pods can be co-scheduled after eviction will preemption occur.
- Resources are reserved via nominatedNodefor all members to maintain scheduling consistency.
This capability works seamlessly with PodGroup/GangGroup semantics.
Prerequisites
- Kubernetes >= 1.18
- Koordinator >= 1.7.0
Verify Preemption is Enabled
Although job-level preemption is enabled by default as of koordinator ≥ 1.7.0, it's recommended to confirm the Coscheduling plugin configuration.
Check Scheduler Configuration
Retrieve the current koord-scheduler-config:
kubectl -n koordinator-system get cm koord-scheduler-config -o yaml
Ensure the Coscheduling plugin has enablePreemption: true:
pluginConfig:
  - name: Coscheduling
    args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: CoschedulingArgs
      enablePreemption: true
If changes are made, restart the koord-scheduler pod to apply them.
Usage Example
Environment Setup
To demonstrate job-level preemption, we will simulate a resource-constrained environment and trigger preemption from a high-priority job. Assume the cluster has 2 worker nodes, each with:
- CPU: 4 cores
- Memory: 16 GiB
- No other running workloads initially
Our procedure is:
- Fill both nodes with low-priority pods consuming all CPU.
- Submit a high-priority gang job that cannot fit.
- Observe how Koordinator evicts low-priority pods to make space.
Define PriorityClasses
- You must define priority classes to enable preemption logic.
# High-Priority Class (for preemptors)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Used for critical AI training jobs that can preempt others."
# Low-Priority Class (for victims)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Used for non-critical jobs that can be preempted."
- Apply them
kubectl apply -f priorityclasses.yaml
- Verify
kubectl get priorityclass
NAME              VALUE        GLOBAL-DEFAULT   AGE
high-priority     1000000      false            1m
low-priority      1000         false            1m
Deploy Low-Priority Pods to Consume Resources
- Create 2 low-priority pods (1 per node), each requesting 4 CPU cores → fully occupying both nodes.
apiVersion: v1
kind: Pod
metadata:
  name: lp-pod-1
  namespace: default
spec:
  schedulerName: koord-scheduler
  priorityClassName: low-priority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 4
        memory: 40Mi
      requests:
        cpu: 4
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always
---
apiVersion: v1
kind: Pod
metadata:
  name: lp-pod-2
  namespace: default
spec:
  schedulerName: koord-scheduler
  priorityClassName: low-priority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 4
        memory: 40Mi
      requests:
        cpu: 4
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always
- Apply them
kubectl apply -f low-priority-pods.yaml
- Check
kubectl get pods -o wide
NAME        READY   STATUS    RESTARTS   AGE     IP            NODE.          NOMINATED NODE   READINESS GATES
lp-pod-1    1/1     Running   0          2m      10.244.1.10   cn-beijing.1   <none>           <none>
lp-pod-2    1/1     Running   0          2m      10.244.1.11   cn-beijing.2   <none>           <none>
At this point, no CPU remains available on either node.
Create a High-Priority Gang Job to Trigger Preemption
- Now submit a 2-pod high-priority job that requires 3 CPU per pod — total demand exceeds current capacity.
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: hp-training-job
  namespace: default
spec:
  minMember: 2
  scheduleTimeoutSeconds: 300
apiVersion: v1
kind: Pod
metadata:
  name: hp-worker-1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:
  schedulerName: koord-scheduler
  priorityClassName: high-priority
  preemptionPolicy: PreemptLowerPriority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 3
        memory: 40Mi
      requests:
        cpu: 3
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always
---
apiVersion: v1
kind: Pod
metadata:
  name: hp-worker-2
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:  
  schedulerName: koord-scheduler
  priorityClassName: high-priority
  preemptionPolicy: PreemptLowerPriority
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 3
        memory: 40Mi
      requests:
        cpu: 3
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always
- Apply them
kubectl apply -f high-priority-job.yaml
After a few seconds, Koordinator will evict one pod per node to free up resources.
Verify Preemption Outcome
- Check Victim Pods Were Evicted
kubectl get pods -o wide
NAME           READY   STATUS        RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
hp-worker-1    0/1     Pending       0          90s     <none>        <none>         cn-beijing.1     <none>
hp-worker-2    0/1     Pending       0          90s     <none>        <none>         cn-beijing.2     <none>
lp-pod-1       0/1     Terminating   0          5m      10.244.1.10   cn-beijing.1   <none>           <none>
lp-pod-2       1/1     Terminating   0          5m      10.244.1.11   cn-beijing.2   <none>           <none>
Pods lp-pod-1 and lp-pod-2 are being terminated to make room and high-priority pods are nominated. 2. Inspect one victim:
kubectl get pod lp-pod-1 -o yaml
status:
  conditions:
    - type: DisruptionTarget
      status: "True"
      lastTransitionTime: "2025-10-12T11:23:45Z"
      reason: PreemptionByScheduler
      message: >-
        koord-scheduler: preempting to accommodate higher priority pods, preemptor:
        default/hp-training-job, triggerpod: default/hp-worker-1
- Confirm Binding After Termination
kubectl get pods -o wide
NAME           READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
hp-worker-1    1/1     Running   0          3m      10.244.1.14   cn-beijing.1   <none>           <none>
hp-worker-2    1/1     Running   0          3m      10.244.2.15   cn-beijing.2   <none>           <none>