Version: v1.9 🚧

Koord-Queue

Introduction

Koord-Queue is a native Kubernetes job queuing system designed for the Koordinator ecosystem. It manages job admission and ordering across multiple queues, integrating deeply with Koordinator's ElasticQuota for resource fairness and multi-tenant isolation. Key capabilities include:

Multi-queue management with Priority, Block, and Intelligent queuing policies.
Deep ElasticQuota integration to avoid duplicate quota configurations and enable elastic resource sharing.
Pre-scheduling to reduce scheduler pressure by queuing jobs before they create pods.
Multi-framework support including TFJob, PyTorchJob, Spark, Argo Workflow, Ray, and native Kubernetes Jobs.
Admission check framework compatible with Kueue's AdmissionCheck API.

Setup

Prerequisite

Kubernetes >= 1.22
Koordinator >= 1.5 (for ElasticQuota integration)

Installation

Install Koord-Queue using Helm:

# Option 1: Install from Helm repository
helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
  --namespace koord-queue \
  --create-namespace

Verify the installation:

$ kubectl get deployment -n koord-queue
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
koord-queue-controllers    1/1     1            1           30s
koord-queue                1/1     1            1           30s

$ kubectl get crd | grep scheduling.x-k8s.io
queues.scheduling.x-k8s.io          2024-01-01T00:00:00Z
queueunits.scheduling.x-k8s.io      2024-01-01T00:00:00Z

Configurations

Koord-Queue uses ElasticQuotaV2 mode by default.

Default Configuration

# Image registry (default: Aliyun Beijing)
global:
  imagePrefix: registry.cn-beijing.aliyuncs.com

controller:
  image:
    repository: koordinator-sh/koord-queue
    tag: v1.8.0

extension:
  koord-queue-controllers:
    repository: koordinator-sh/koord-queue-controllers
    tag: v1.8.0
  batchjob:
    enable: true    # Native Kubernetes Job support
  tf:
    enable: false
  pytorch:
    enable: false
  argo:
    enable: false
  spark:
    enable: false
  ray:
    enable: false
  mpi:
    enable: false

pluginConfigs:
  apiVersion: scheduling.k8s.io/v1
  kind: KoordQueueConfiguration
  plugins:
    - name: Priority
    - name: ElasticQuotaV2

ElasticQuotaV2 Mode (Default)

Uses individual ElasticQuota CRs (scheduling.sigs.k8s.io/v1alpha1). This is the recommended mode for Koordinator users.

Use Koord-Queue

Quick Start with ElasticQuota

This example uses Koordinator's ElasticQuota for elastic resource management.

Create an ElasticQuota

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: team-a
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/parent: ""
    quota.scheduling.koordinator.sh/is-parent: "false"
spec:
  max:
    cpu: "4"
    memory: 8Gi
  min:
    cpu: "4"
    memory: 8Gi

$ kubectl apply -f elastic-quota.yaml

Queue Auto-creation

When using ElasticQuotaV2, the plugin automatically creates a Queue CR in the koord-queue namespace for each ElasticQuota resource. The auto-created Queue has the same name as the ElasticQuota (e.g., team-a), with a default priority: 1000 and queuePolicy: Priority. You do not need to manually create a Queue for each ElasticQuota.

If you want to customize the Queue policy, you can set the koord-queue/queue-policy label on the ElasticQuota:

metadata:
  labels:
    koord-queue/queue-policy: Priority  # Options: Priority, Block, Intelligent

Submit Jobs and verify queuing

Koord-Queue's Job Extensions automatically create QueueUnit resources for submitted jobs. To submit a Kubernetes Job managed by Koord-Queue, set spec.suspend: true and add the quota label. Save the following two-document YAML as jobs.yaml and apply it at once:

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/name: team-a
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: test
        image: busybox:stable
        command: ["/bin/sh", "-c", "sleep 30"]
        resources:
          requests:
            cpu: "4"
            memory: 8Gi
          limits:
            cpu: "4"
            memory: 8Gi
      restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
  name: my-job-blocked
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/name: team-a
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: test
        image: busybox:stable
        command: ["/bin/sh", "-c", "sleep 30"]
        resources:
          requests:
            cpu: "4"
            memory: 8Gi
          limits:
            cpu: "4"
            memory: 8Gi
      restartPolicy: Never

$ kubectl apply -f jobs.yaml

The team-a quota has max.cpu: "4" and max.memory: "8Gi", which is exactly enough for one job. The ElasticQuotaV2 plugin tracks quota usage based on running Pod resource consumption. Once my-job's Pod is Running and consuming the full quota, my-job-blocked will be held in the queue:

# Wait for my-job's pod to reach Running state first
$ kubectl wait --for=condition=Ready pod -l job-name=my-job -n default --timeout=120s

$ kubectl get queueunit my-job-blocked -n default
NAME             PHASE   PRIORITY   ADMISSIONS   JOBTYPE
my-job-blocked                                   Job

The QueueUnit stays in Enqueued phase because team-a has already reached its max quota. Once my-job completes and resources are released, my-job-blocked will be dequeued automatically.

Job Suspension by Type

Different job types use different fields for suspension:

Job Type	API Version	Suspension Field	Example	Status
Kubernetes Job	`batch/v1`	`.spec.suspend`	`spec.suspend: true`	Supported
TFJob	`kubeflow.org/v1`	`.spec.runPolicy.suspend`	`spec.runPolicy.suspend: true`	Supported
PyTorchJob	`kubeflow.org/v1`	`.spec.runPolicy.suspend`	`spec.runPolicy.suspend: true`	Supported
Argo Workflow	`argoproj.io/v1alpha1`	Add `koord-queue-suspend` template	See example below	Supported
SparkApplication	`sparkoperator.k8s.io/v1beta2`	`.spec.suspend`		WIP
XGBoostJob	`kubeflow.org/v1`	`.spec.runPolicy.suspend`		Not Supported Yet
PaddleJob	`kubeflow.org/v1`	`.spec.runPolicy.suspend`		Not Supported Yet

Argo Workflow Example:

For Argo Workflow, Koord-Queue uses a special suspend template named koord-queue-suspend. The workflow must meet the following conditions to be managed by the queue:

Contains a template named koord-queue-suspend with a suspend field
The workflow has a suspend node in running state, OR spec.suspend is set to true

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: my-workflow
  annotations:
   koord-queue/min-resources: |
     cpu: 5
     memory: 5Gi
spec:
  suspend: true
  templates:
    # Add this suspend template for queue management
    - name: koord-queue-suspend
      suspend: {}
    # Your actual workflow templates
    - name: main
      container:
        image: python:3.9
        command: [python, -c, "print('Hello from workflow')"]
  entrypoint: main

How it works:

When a Workflow is submitted, Koord-Queue checks if it should be managed by:

Scanning all templates for a koord-queue-suspend template with a suspend field
Checking if any workflow node is of type Suspend and in Running phase
Or checking if spec.suspend is set to true

When the QueueUnit is dequeued, the Extension Server will remove the suspend condition, allowing the workflow to proceed.

TFJob Example:

For TFJob, set spec.runPolicy.suspend: true to enable queue management:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  labels:
    quota.scheduling.koordinator.sh/name: team-a-queue
spec:
  runPolicy:
    suspend: true

PyTorchJob Example:

For PyTorchJob, set spec.runPolicy.suspend: true to enable queue management:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    quota.scheduling.koordinator.sh/name: team-a-queue
spec:
  runPolicy:
    suspend: true

How it works for Kubeflow Jobs:

When a TFJob or PyTorchJob is submitted:

The job extension detects the new job with spec.runPolicy.suspend: true
A corresponding QueueUnit is automatically created
The job waits in the queue until resources are available
When dequeued, the Extension Server sets spec.runPolicy.suspend: false, allowing the job to create pods and start training

Use Queue

Queue Spec

Field	Type	Description
`queuePolicy`	`string`	Queuing policy: `Priority`, `Block`, or `Intelligent`.
`priority`	`*int32`	Queue priority for multi-queue ordering.
`priorityClassName`	`string`	Kubernetes PriorityClass name.
`admissionChecks`	`[]AdmissionCheckWithSelector`	List of admission checks required.

Queue Priority

By default, Koord-Queue's Job Extensions automatically derive the QueueUnit priority from the job's pod template: it reads spec.template.spec.priorityClassName and spec.template.spec.priority. If a PriorityClass object is found, its .value is used as the QueueUnit priority; otherwise the raw integer in spec.template.spec.priority is used.

You can also manually patch a QueueUnit's spec.priority after it is created to override this default and influence dequeue ordering.

Priority Queue

Jobs with higher priority values are dequeued first. Among jobs with the same priority, earlier-created jobs are dequeued first:

To set the priority of a QueueUnit:

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: QueueUnit
metadata:
  name: high-priority-job
  namespace: default
spec:
  queue: priority-queue
  priority: 200
  consumerRef:
    apiVersion: batch/v1
    kind: Job
    name: important-job
    namespace: default
  resource:
    cpu: "2"
    memory: 4Gi

Queue Policies

Koord-Queue supports three queue policies to control how jobs are dequeued and scheduled.

Priority Policy

Ordering: Queue units are ordered by priority value (descending), then by creation timestamp (ascending). Higher priority jobs are always dequeued first. Jobs with the same priority are processed in FIFO order.

Key Features:

Jobs with higher spec.priority values are dequeued first
When multiple jobs have the same priority, earlier-created jobs are scheduled first
Failed jobs will be re-added to the queue and can be retried
Supports preemption: lower-priority jobs can be preempted to make room for higher-priority jobs

Scheduling Behavior: Priority policy is not strict priority scheduling. When high-priority jobs are blocked (e.g., quota exhausted), the scheduler skips them and continues scanning. Lower-priority jobs that are schedulable can dequeue before blocked high-priority jobs. This improves throughput and prevents scheduler stall.

Key Difference from Block Policy:

Priority: Optimistic scheduling - continues scheduling when quota is near limit, blocked jobs are skipped
Block: Conservative scheduling - strictly blocks jobs when quota reaches limit

Use Cases:

Multi-tenant environments with different priority levels
Production jobs that should preempt development jobs

Configuration Example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: priority-queue
  labels:
    koord-queue/queue-policy: Priority
spec:
  max:
    cpu: "10"
    memory: 20Gi

Block Policy

Ordering: Same as Priority policy - queue units are ordered by priority (descending) then timestamp (ascending).

Key Features:

Strict resource blocking: When quota reaches the limit, subsequent jobs using that quota are blocked
Unlike Priority policy (which skips blocked high-priority jobs and allows lower-priority schedulable jobs to dequeue first), Block policy strictly enforces priority order
Prevents resource over-allocation
Blocked queue units are skipped during scheduling until resources become available

Key Differences:

Priority policy: Not strict priority scheduling - allows lower-priority jobs to dequeue before blocked high-priority jobs
Block policy: Strict priority scheduling - blocked high-priority jobs must wait, preventing lower-priority jobs from bypassing them

Use Cases:

Resource-constrained environments
Production workloads requiring guaranteed resource availability
Multi-tenant isolation where resource limits must be strictly enforced

Configuration Example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: block-queue
  labels:
    koord-queue/queue-policy: Block
spec:
  max:
    cpu: "10"
    memory: 20Gi

Intelligent Policy

Ordering: Uses a dual-queue mechanism with configurable priority threshold (default: 4):

High-priority queue: Jobs with priority >= threshold
- Ordered by priority (descending) then timestamp (ascending)
- Retry behavior: On failure, retries the same job (FIFO mode)
Low-priority queue: Jobs with priority < threshold
- Ordered by priority (descending) then timestamp (ascending)
- Retry behavior: On failure, moves to the next job (Round-Robin mode)

Key Features:

Prioritizes high-priority jobs: high-priority queue is always checked first
Retry guarantee for high-priority jobs: Failed high-priority jobs are immediately retried
Round-robin for low-priority jobs: Failed low-priority jobs yield to the next job
Threshold configurable via annotation: koord-queue/priority-threshold

Use Cases:

Mixed workloads with both critical and batch jobs
Environments requiring both priority enforcement and fair scheduling

Configuration Example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: intelligent-queue
  labels:
    koord-queue/queue-policy: Intelligent
  annotations:
    koord-queue/priority-threshold: "5"
spec:
  max:
    cpu: "10"
    memory: 20Gi

Policy Comparison

Feature	Priority	Block	Intelligent
Ordering	Priority + Timestamp	Priority + Timestamp	Dual-queue: High (FIFO) + Low (Round-Robin)
Retry Behavior	Retry failed job	Retry failed job	High: retry same job; Low: move to next job
Resource Blocking	Optimistic	Strict/Conservative	Balanced
Preemption Support	Yes	No	Yes (for high-priority jobs)
Use Cases	Priority scheduling	Strict resource isolation	Mixed critical + batch workloads

Configuring Queue Policy

Queue policy can be set in two ways:

Via ElasticQuota label (recommended):

metadata:
  labels:
    koord-queue/queue-policy: Priority  # Options: Priority, Block, Intelligent

Via Queue CR (advanced configuration):

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: Queue
metadata:
  name: my-queue
  namespace: koord-queue
spec:
  queuePolicy: Intelligent
  priority: 1000
  annotations:
    koord-queue/priority-threshold: "5"

Advanced Tuning Annotations:

koord-queue/priority-threshold: Set threshold for Intelligent policy (default: 4)
koord-queue/max-depth: Limit max number of jobs considered during scheduling
koord-queue/wait-for-pods-running: Wait for pods to enter Running state before dequeuing next job

Use QueueUnit

QueueUnit Spec

Field	Type	Description
`consumerRef`	`ObjectReference`	Reference to the original job CR.
`priority`	`*int32`	Priority within the queue.
`queue`	`string`	Name of the target queue.
`resource`	`ResourceList`	Total resource requirements.
`podSet`	`[]PodSet`	Pod group definitions (max 8).
`priorityClassName`	`string`	Kubernetes PriorityClass name.
`request`	`ResourceList`	Actual resource requests parsed from the job.

QueueUnit Status

Field	Type	Description
`phase`	`QueueUnitPhase`	Current lifecycle phase.
`attempts`	`int64`	Number of scheduling attempts.
`message`	`string`	Human-readable status message.
`lastUpdateTime`	`Time`	Last status update timestamp.
`admissionChecks`	`[]AdmissionCheckState`	Status of each admission check.
`podState`	`PodState`	Running/Pending pod counts.
`admissions`	`[]Admission`	Resource allocation and state per PodSet admission.

Use AdmissionCheck

Admission Checks (Work In Progress)

Note: The Admission Check controller is not yet included in this release. This section describes the planned API for future use.

Queues can require admission checks that must pass before a QueueUnit is released. This is useful for integrating with external resource provisioning systems.

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: Queue
metadata:
  name: checked-queue
  namespace: koord-queue
spec:
  queuePolicy: Priority
  admissionChecks:
    - name: prov-req-check
      labelSelector:
        matchLabels:
          requires-provisioning: "true"

When a QueueUnit is reserved, the admission check controller processes each configured check. The QueueUnit transitions to Dequeued only when all checks report Ready status.

Observability

Monitoring

Koord-Queue exposes Prometheus metrics for monitoring:

# Port-forward to the controller
$ kubectl port-forward -n koord-queue svc/koord-queue 10259:10259

# Fetch metrics
$ curl http://localhost:10259/metrics

If the visibility server is enabled (enableVisibilityServer: true), you can query queue status via REST API:

$ curl http://koord-queue-visibility:8090/api/queues

Debugging

Check the controller logs for scheduling decisions:

$ kubectl logs -n koord-queue deployment/koord-queue-controller -f --tail=100

Inspect QueueUnit status for scheduling details:

$ kubectl describe queueunit <name> -n <namespace>

Check Kubernetes events for scheduling-related messages:

$ kubectl get events -n <namespace> --field-selector reason=Scheduling

Koord-Queue

Introduction​

Setup​

Prerequisite​

Installation​

Configurations​

Default Configuration​

ElasticQuotaV2 Mode (Default)​

Use Koord-Queue​

Quick Start with ElasticQuota​

Create an ElasticQuota​

Queue Auto-creation​

Submit Jobs and verify queuing​

Job Suspension by Type​

Use Queue​

Queue Spec​

Queue Priority​

Priority Queue​

Queue Policies​

Priority Policy​

Block Policy​

Intelligent Policy​

Policy Comparison​

Configuring Queue Policy​

Use QueueUnit​

QueueUnit Spec​

QueueUnit Status​

Use AdmissionCheck​

Admission Checks (Work In Progress)​

Observability​

Monitoring​

Debugging​

Introduction

Setup

Prerequisite

Installation

Configurations

Default Configuration

ElasticQuotaV2 Mode (Default)

Use Koord-Queue

Quick Start with ElasticQuota

Create an ElasticQuota

Queue Auto-creation

Submit Jobs and verify queuing

Job Suspension by Type

Use Queue

Queue Spec

Queue Priority

Priority Queue

Queue Policies

Priority Policy

Block Policy

Intelligent Policy

Policy Comparison

Configuring Queue Policy

Use QueueUnit

QueueUnit Spec

QueueUnit Status

Use AdmissionCheck

Admission Checks (Work In Progress)

Observability

Monitoring

Debugging