Job
Job Schedulingโ
A batch of pods that must be scheduled together is called a Job.
PodGroupโ
Sometimes, the batch of pods is completely homogeneous and only needs to accumulate to a specified minimum number before scheduling is successful. In this case, we can describe the minMember through a separate PodGroup, and then associate its member pods through pod Labels. Here is a PodGroup with a minimum cumulative number of 2 and its member pods.
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-example
namespace: default
spec:
minMember: 2
apiVersion: v1
kind: pod
metadata:
name: pod-example1
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: gang-example
spec:
schedulerName: koord-scheduler
...
---
apiVersion: v1
kind: pod
metadata:
name: pod-example2
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: gang-example
spec:
schedulerName: koord-scheduler
...
GangGroupโ
In other cases, the pods that must be scheduled together may not be homogeneous and must complete the minimum number of accumulations separately. In this case, Koordinator supports associating different PodGroups to form a GangGroup through PodGroup Label. Here is a GangGroup with two PodGroups:
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-example1
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
spec:
minMember: 1
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-example2
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
spec:
minMember: 2
Job-Level Preemptionโ
When a pod cannot be scheduled due to insufficient resources, Kube-Scheduler attempts to evict lower-priority pods to make room for it. This is traditional pod-level reemption. However, when a Job cannot be scheduled due to insufficient resources, the scheduler must make enough space for the entire Job to be scheduled. This type of preemption is called Job-level preemption.
Preemption Algorithmโ
The job that initiates preemption is called the preemptor, and the preempted pod is called the victim. The overall workflow of job-level preemption is as follows:
- Unschedulable pod โ Enters PostFilter phase
- Is it a Job? โ Yes โ Fetch all member pods
- Check Job Preemption Eligibility:
pods.spec.preemptionPolicyโ Never- No terminating victims on the currently nominated nodes of all member pods (prevent redundant preemption)
- Find candidate nodes where preemption may help
- Perform dry-run to simulate removal of potential victims (low priority pods)
- Select optimal node + minimal-cost victim set (job-aware cost model)
- Execute preemption:
- Delete victims (by setting DisruptionTarget condition and invoking the deletion API)
- Clear
status.nominatedNodeof other lower-priority nominated pods on the target nodes. - Set
status.nominatedNodefor all member pods.
- Preemption successful โ The pod enters the scheduling queue, waiting for victims to terminate.
Preemption Reason for Victimโ
When a victim is preempted, Koord-Scheduler adds an entry to victim.status.conditions to indicate which job preempted it and triggers graceful termination.
apiVersion: v1
kind: pod
metadata:
name: victim-1
namespace: default
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-09-17T08:41:35Z"
message: 'koord-scheduler: preempting to accommodate higher priority pods, preemptor:
default/hello-job, triggerpod: default/preemptor-pod-2'
reason: PreemptionByScheduler
status: "True"
type: DisruptionTarget
The above shows that default/victim-1 was preempted by the high-priority job hello-job. Member Pods of hello-job can be retrieved via the following command:
$ kubectl get po -n default -l pod-group.scheduling.sigs.k8s.io=hello-job
hello-job-pod-1 0/1 Pending 0 5m
hello-job-pod-2 0/1 Pending 0 5m
Nominated Node for Preemptorโ
After a Job preemption succeeds, in addition to evicting the victim pods, the scheduler must also reserve the reclaimed resources in its internal cache. In Kubernetes, this is achieved using pod.status.nominatedNode. In Koordinator, koord-scheduler sets the .status.nominatedNode field for all member pods of the preempting job to reflect this resource reservation.
apiVersion: v1
kind: pod
metadata:
name: preemptor-pod-1
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: hello-job
status:
nominatednodeName: example-node
phase: Pending
---
apiVersion: v1
kind: pod
metadata:
name: preemptor-pod-2
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: hello-job
status:
nominatednodeName: example-node
phase: Pending
The above shows that the two pods of hello-job have successfully completed preemption and are nominated for scheduling to example-node.
Network-Topology Awareโ
In large-scale AI training scenarios, especially for large language models (LLMs), efficient inter-pod communication is critical to training performance. Model parallelism techniques such as Tensor Parallelism (TP), Pipeline Parallelism (PP), and Data Parallelism (DP) require frequent and high-bandwidth data exchange across GPUsโoften spanning multiple nodes. Under such workloads, network topology becomes a key performance bottleneck, where communication latency and bandwidth are heavily influenced by the physical network hierarchy (e.g., NVLink, block, spine).

To optimize training efficiency, pods within a GangGroup is required or preferred to be scheduled to nodes that reside in the same or nearby high-performance network domains, minimizing inter-node hops and maximizing throughput. For example, in a spine-block architecture, scheduling all member pods under the same block or spine switch significantly reduces communication latency compared to distributing them across different spines.
Topology-Aware Scheduling Requirementsโ
While Kubernetesโ native scheduler supports basic topology constraints via PodAffinity, it operates on a per-Pod basis and lacks gang scheduling semantics, making it ineffective for coordinated placement of tightly coupled workloads. Koord-Scheduler abstracts PodGroup and GangGroup concept to providing all-or-nothing semantics, enabling collective scheduling of interdependent pods. Moreover, to meet the demands of modern AI training, we extend it with Network-Topology Aware Schedulingโa capability that intelligently selects optimal nodes based on network hierarchy.
This feature ensures:
- When cluster resources are sufficient, pods with network topology scheduling requirements will be scheduled to a topology domain with better performance (e.g., lower latency, higher bandwidth) according to user-specified strategies.
- When cluster resources are insufficient, the scheduler will seize resources for the GangGroup based on network topology constraints through job-level preemption, and record the resource nominations in the
.status.nominatedNodefield to ensure consistent placement.
Cluster Network Topologyโ
Nodes are labeled with their network topology positions using tools like NVIDIAโs topograph:
apiVersion: v1
kind: Node
metadata:
name: node-0
labels:
network.topology.nvidia.com/accelerator: nvl1
network.topology.nvidia.com/block: s1
network.topology.nvidia.com/spine: s2
network.topology.nvidia.com/datacenter: s3
Administrators define the topology hierarchy via a ClusterNetworkTopology CR named default:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
name: default
spec:
networkTopologySpec:
- labelKey:
- network.topology.nvidia.com/spine
topologyLayer: SpineLayer
- labelKey:
- network.topology.nvidia.com/block
parentTopologyLayer: SpineLayer
topologyLayer: BlockLayer
- parentTopologyLayer: BlockLayer
topologyLayer: NodeTopologyLayer
The topology forms a tree structure, where each layer represents a level of aggregation in the network (e.g., Node โ block โ spine).
The status.detailStatus field of ClusterNetworkTopology is automatically maintained by Koordinator, reflecting the actual network topology structure and node distribution in the cluster. It presents a hierarchical view from the top-level (cluster) down to individual nodes. Each entry in detailStatus represents an instance of a specific topology layer, with key fields:
topologyInfo: The current layer's type and name (e.g.,SpineLayer,s1).parentTopologyInfo: The parent layerโs information.childTopologyNames: List of child domains in the next lower layer.nodeNum: Number of nodes within this topology domain.
The follwing is an example of clusterNetworkTopology.status.detailStatus:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
name: default
spec:
networkTopologySpec:
- labelKey:
- network.topology.nvidia.com/spine
topologyLayer: SpineLayer
- labelKey:
- network.topology.nvidia.com/block
parentTopologyLayer: SpineLayer
topologyLayer: BlockLayer
- parentTopologyLayer: BlockLayer
topologyLayer: NodeTopologyLayer
status:
detailStatus:
- childTopologyLayer: SpineLayer
childTopologyNames:
- s1
- s2
nodeNum: 8
topologyInfo:
topologyLayer: ClusterTopologyLayer
topologyName: ""
- childTopologyLayer: BlockLayer
childTopologyNames:
- b2
- b1
nodeNum: 4
parentTopologyInfo:
topologyLayer: ClusterTopologyLayer
topologyName: ""
topologyInfo:
topologyLayer: SpineLayer
topologyName: s1
- childTopologyLayer: NodeTopologyLayer
nodeNum: 2
parentTopologyInfo:
topologyLayer: SpineLayer
topologyName: s1
topologyInfo:
topologyLayer: BlockLayer
topologyName: b2
- childTopologyLayer: NodeTopologyLayer
nodeNum: 2
parentTopologyInfo:
topologyLayer: SpineLayer
topologyName: s1
topologyInfo:
topologyLayer: BlockLayer
topologyName: b1
- childTopologyLayer: BlockLayer
childTopologyNames:
- b3
- b4
nodeNum: 4
parentTopologyInfo:
topologyLayer: ClusterTopologyLayer
topologyName: ""
topologyInfo:
topologyLayer: SpineLayer
topologyName: s2
- childTopologyLayer: NodeTopologyLayer
nodeNum: 2
parentTopologyInfo:
topologyLayer: SpineLayer
topologyName: s2
topologyInfo:
topologyLayer: BlockLayer
topologyName: b3
- childTopologyLayer: NodeTopologyLayer
nodeNum: 2
parentTopologyInfo:
topologyLayer: SpineLayer
topologyName: s2
topologyInfo:
topologyLayer: BlockLayer
topologyName: b4
Based on the above status, the cluster has a two-tier spine-block architecture:
ClusterTopologyLayer
โโโ SpineLayer: s1
โ โโโ BlockLayer: b1
โ โ โโโ NodeTopologyLayer: 2 nodes
โ โโโ BlockLayer: b2
โ โโโ NodeTopologyLayer: 2 nodes
โโโ SpineLayer: s2
โโโ BlockLayer: b3
โ โโโ NodeTopologyLayer: 2 nodes
โโโ BlockLayer: b4
โโโ NodeTopologyLayer: 2 nodes
Network Topology Specโ
When users want to configure the network topology gather strategy for GangGroup, its PodGroup can be annotated as follows:
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-master
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["default/gang-master", "default/gang-worker"]
gang.scheduling.koordinator.sh/network-topology-spec: |
{
"gatherStrategy": [
{
"layer": "SpineLayer",
"strategy": "PreferGather"
},
{
"layer": "BlockLayer",
"strategy": "PreferGather"
},
{
"layer": "AcceleratorLayer",
"strategy": "PreferGather"
}
]
}
spec:
minMember: 1
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-worker
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["default/gang-master", "default/gang-worker"]
gang.scheduling.koordinator.sh/network-topology-spec: |
{
"gatherStrategy": [
{
"layer": "SpineLayer",
"strategy": "PreferGather"
},
{
"layer": "BlockLayer",
"strategy": "PreferGather"
},
{
"layer": "AcceleratorLayer",
"strategy": "PreferGather"
}
]
}
spec:
minMember: 2
The above PodGroup indicates that the Pods belonging to it firstly try to be in an accelerator interconnection domain, and then try to be in a Block, and then try to be in a Spine network.
Sometimes, due to the strict demand for communication bandwidth, users may want to place all member Pods of a GangGroup under the same Spine. In this case, you can modify the PodGroup as follows:
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-master
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["default/gang-master", "default/gang-worker"]
gang.scheduling.koordinator.sh/network-topology-spec: |
{
"gatherStrategy": [
{
"layer": "spineLayer",
"strategy": "MustGather"
}
]
}
spec:
minMember: 1
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: gang-worker
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["default/gang-master", "default/gang-worker"]
gang.scheduling.koordinator.sh/network-topology-spec: |
{
"gatherStrategy": [
{
"layer": "spineLayer",
"strategy": "MustGather"
}
]
}
spec:
minMember: 2
Network Topology Pod Indexโ
In distributed training, assigning an index to each Pod is essential for establishing communication patterns in data-parallel (DP) job. The index determines the logical order of Pods in collective operations. For example, for a GangGroup with DP=2, the member pods can be annotated as:
apiVersion: v1
kind: Pod
metadata:
name: pod-example1
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: gang-example
annotations:
gang.scheduling.koordinator.sh/network-topology-index: "1"
spec:
schedulerName: koord-scheduler
...
---
apiVersion: v1
kind: Pod
metadata:
name: pod-example2
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: gang-example
annotations:
gang.scheduling.koordinator.sh/network-topology-index: "2"
spec:
schedulerName: koord-scheduler
...
Topology Gather Algorithmโ
The network topology gather algorithm is to find the best nodes for the M Pods, given the M member Pods belonging to a parallel-aware GangGroup, all the Nodes that can place the Pods, the network topology location of each node. The overall computation process can be described step by step as follows:
The member Pods of the
GangGroupof the training task are generally homogeneous. We randomly select one from the member Pods as the representative Pod.From the bottom to the top of the network topology hierarchy, recursively calculate the number of member Pods that each topology node can provide as
offerslots. Theofferslotsthat a Node can accommodate can be achieved by iteratively callingNodeInfo.AddPod,fwk.RunPreFilterExtensionsAddPod, andfwk.RunFilterWithNominatedNode.Among all the topological nodes that can accommodate all the member Pods of the
GangGroup, select those with the lowest level as ourcandidate topological nodes.
Among the candidate topological nodes selected in 3, according to the
binpackprinciple, the candidate topological nodes whose offerslot is closest to the offerslot required by the job are selected as our final topological node solution. As shown in the figure below, we select Node5-Node8 as the final scheduling result of the job.
What's Nextโ
- Gang Scheduling: Learn how to enable gang scheduling for your application.
- Network Topology Aware Scheduling: Learn how to enable network topology aware scheduling for gang.
- Job Level Preemption: Learn how to use Job Level Preemption