Skip to main content
Version: v1.9 ๐Ÿšง

Koord-Queue

Introductionโ€‹

Koord-Queue is a native Kubernetes job queuing system designed for the Koordinator ecosystem. It manages job admission and ordering across multiple queues, integrating deeply with Koordinator's ElasticQuota for resource fairness and multi-tenant isolation. Key capabilities include:

  • Multi-queue management with Priority, Block, and Intelligent queuing policies.
  • Deep ElasticQuota integration to avoid duplicate quota configurations and enable elastic resource sharing.
  • Pre-scheduling to reduce scheduler pressure by queuing jobs before they create pods.
  • Multi-framework support including TFJob, PyTorchJob, Spark, Argo Workflow, Ray, and native Kubernetes Jobs.
  • Admission check framework compatible with Kueue's AdmissionCheck API.

Setupโ€‹

Prerequisiteโ€‹

  • Kubernetes >= 1.22
  • Koordinator >= 1.5 (for ElasticQuota integration)

Installationโ€‹

Install Koord-Queue using Helm:

# Option 1: Install from Helm repository
helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
--namespace koord-queue \
--create-namespace


Verify the installation:

$ kubectl get deployment -n koord-queue
NAME READY UP-TO-DATE AVAILABLE AGE
koord-queue-controllers 1/1 1 1 30s
koord-queue 1/1 1 1 30s

$ kubectl get crd | grep scheduling.x-k8s.io
queues.scheduling.x-k8s.io 2024-01-01T00:00:00Z
queueunits.scheduling.x-k8s.io 2024-01-01T00:00:00Z

Configurationsโ€‹

Koord-Queue uses ElasticQuotaV2 mode by default.

Default Configurationโ€‹

# Image registry (default: Aliyun Beijing)
global:
imagePrefix: registry.cn-beijing.aliyuncs.com

controller:
image:
repository: koordinator-sh/koord-queue
tag: v1.8.0

extension:
koord-queue-controllers:
repository: koordinator-sh/koord-queue-controllers
tag: v1.8.0
batchjob:
enable: true # Native Kubernetes Job support
tf:
enable: false
pytorch:
enable: false
argo:
enable: false
spark:
enable: false
ray:
enable: false
mpi:
enable: false

pluginConfigs:
apiVersion: scheduling.k8s.io/v1
kind: KoordQueueConfiguration
plugins:
- name: Priority
- name: ElasticQuotaV2

ElasticQuotaV2 Mode (Default)โ€‹

Uses individual ElasticQuota CRs (scheduling.sigs.k8s.io/v1alpha1). This is the recommended mode for Koordinator users.

Use Koord-Queueโ€‹

Quick Start with ElasticQuotaโ€‹

This example uses Koordinator's ElasticQuota for elastic resource management.

Create an ElasticQuotaโ€‹

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: team-a
namespace: default
labels:
quota.scheduling.koordinator.sh/parent: ""
quota.scheduling.koordinator.sh/is-parent: "false"
spec:
max:
cpu: "4"
memory: 8Gi
min:
cpu: "4"
memory: 8Gi
$ kubectl apply -f elastic-quota.yaml
Queue Auto-creationโ€‹

When using ElasticQuotaV2, the plugin automatically creates a Queue CR in the koord-queue namespace for each ElasticQuota resource. The auto-created Queue has the same name as the ElasticQuota (e.g., team-a), with a default priority: 1000 and queuePolicy: Priority. You do not need to manually create a Queue for each ElasticQuota.

If you want to customize the Queue policy, you can set the koord-queue/queue-policy label on the ElasticQuota:

metadata:
labels:
koord-queue/queue-policy: Priority # Options: Priority, Block, Intelligent
Submit Jobs and verify queuingโ€‹

Koord-Queue's Job Extensions automatically create QueueUnit resources for submitted jobs. To submit a Kubernetes Job managed by Koord-Queue, set spec.suspend: true and add the quota label. Save the following two-document YAML as jobs.yaml and apply it at once:

apiVersion: batch/v1
kind: Job
metadata:
name: my-job
namespace: default
labels:
quota.scheduling.koordinator.sh/name: team-a
spec:
suspend: true
template:
spec:
containers:
- name: test
image: busybox:stable
command: ["/bin/sh", "-c", "sleep 30"]
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "4"
memory: 8Gi
restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
name: my-job-blocked
namespace: default
labels:
quota.scheduling.koordinator.sh/name: team-a
spec:
suspend: true
template:
spec:
containers:
- name: test
image: busybox:stable
command: ["/bin/sh", "-c", "sleep 30"]
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "4"
memory: 8Gi
restartPolicy: Never
$ kubectl apply -f jobs.yaml

The team-a quota has max.cpu: "4" and max.memory: "8Gi", which is exactly enough for one job. The ElasticQuotaV2 plugin tracks quota usage based on running Pod resource consumption. Once my-job's Pod is Running and consuming the full quota, my-job-blocked will be held in the queue:

# Wait for my-job's pod to reach Running state first
$ kubectl wait --for=condition=Ready pod -l job-name=my-job -n default --timeout=120s

$ kubectl get queueunit my-job-blocked -n default
NAME PHASE PRIORITY ADMISSIONS JOBTYPE
my-job-blocked Job

The QueueUnit stays in Enqueued phase because team-a has already reached its max quota. Once my-job completes and resources are released, my-job-blocked will be dequeued automatically.

For other job types (TFJob, PyTorchJob, etc.), use the scheduling.x-k8s.io/suspend: "true" annotation instead of spec.suspend.

Use Queueโ€‹

Queue Specโ€‹

FieldTypeDescription
queuePolicystringQueuing policy: Priority, Block, or Intelligent.
priority*int32Queue priority for multi-queue ordering.
priorityClassNamestringKubernetes PriorityClass name.
admissionChecks[]AdmissionCheckWithSelectorList of admission checks required.

Queue Priorityโ€‹

By default, Koord-Queue's Job Extensions automatically derive the QueueUnit priority from the job's pod template: it reads spec.template.spec.priorityClassName and spec.template.spec.priority. If a PriorityClass object is found, its .value is used as the QueueUnit priority; otherwise the raw integer in spec.template.spec.priority is used.

You can also manually patch a QueueUnit's spec.priority after it is created to override this default and influence dequeue ordering.

Priority Queueโ€‹

Jobs with higher priority values are dequeued first. Among jobs with the same priority, earlier-created jobs are dequeued first:

To set the priority of a QueueUnit:

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: QueueUnit
metadata:
name: high-priority-job
namespace: default
spec:
queue: priority-queue
priority: 200
consumerRef:
apiVersion: batch/v1
kind: Job
name: important-job
namespace: default
resource:
cpu: "2"
memory: 4Gi

Queue Policiesโ€‹

Koord-Queue supports three queue policies to control how jobs are dequeued and scheduled.

Priority Policyโ€‹

Ordering: Queue units are ordered by priority value (descending), then by creation timestamp (ascending). Higher priority jobs are always dequeued first. Jobs with the same priority are processed in FIFO order.

Key Features:

  • Jobs with higher spec.priority values are dequeued first
  • When multiple jobs have the same priority, earlier-created jobs are scheduled first
  • Failed jobs will be re-added to the queue and can be retried
  • Supports preemption: lower-priority jobs can be preempted to make room for higher-priority jobs

Scheduling Behavior: Priority policy is not strict priority scheduling. When high-priority jobs are blocked (e.g., quota exhausted), the scheduler skips them and continues scanning. Lower-priority jobs that are schedulable can dequeue before blocked high-priority jobs. This improves throughput and prevents scheduler stall.

Key Difference from Block Policy:

  • Priority: Optimistic scheduling - continues scheduling when quota is near limit, blocked jobs are skipped
  • Block: Conservative scheduling - strictly blocks jobs when quota reaches limit

Use Cases:

  • Multi-tenant environments with different priority levels
  • Production jobs that should preempt development jobs

Configuration Example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: priority-queue
labels:
koord-queue/queue-policy: Priority
spec:
max:
cpu: "10"
memory: 20Gi

Block Policyโ€‹

Ordering: Same as Priority policy - queue units are ordered by priority (descending) then timestamp (ascending).

Key Features:

  • Strict resource blocking: When quota reaches the limit, subsequent jobs using that quota are blocked
  • Unlike Priority policy (which skips blocked high-priority jobs and allows lower-priority schedulable jobs to dequeue first), Block policy strictly enforces priority order
  • Prevents resource over-allocation
  • Blocked queue units are skipped during scheduling until resources become available

Key Differences:

  • Priority policy: Not strict priority scheduling - allows lower-priority jobs to dequeue before blocked high-priority jobs
  • Block policy: Strict priority scheduling - blocked high-priority jobs must wait, preventing lower-priority jobs from bypassing them

Use Cases:

  • Resource-constrained environments
  • Production workloads requiring guaranteed resource availability
  • Multi-tenant isolation where resource limits must be strictly enforced

Configuration Example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: block-queue
labels:
koord-queue/queue-policy: Block
spec:
max:
cpu: "10"
memory: 20Gi

Intelligent Policyโ€‹

Ordering: Uses a dual-queue mechanism with configurable priority threshold (default: 4):

  • High-priority queue: Jobs with priority >= threshold

    • Ordered by priority (descending) then timestamp (ascending)
    • Retry behavior: On failure, retries the same job (FIFO mode)
  • Low-priority queue: Jobs with priority < threshold

    • Ordered by priority (descending) then timestamp (ascending)
    • Retry behavior: On failure, moves to the next job (Round-Robin mode)

Key Features:

  • Prioritizes high-priority jobs: high-priority queue is always checked first
  • Retry guarantee for high-priority jobs: Failed high-priority jobs are immediately retried
  • Round-robin for low-priority jobs: Failed low-priority jobs yield to the next job
  • Threshold configurable via annotation: koord-queue/priority-threshold

Use Cases:

  • Mixed workloads with both critical and batch jobs
  • Environments requiring both priority enforcement and fair scheduling

Configuration Example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: intelligent-queue
labels:
koord-queue/queue-policy: Intelligent
annotations:
koord-queue/priority-threshold: "5"
spec:
max:
cpu: "10"
memory: 20Gi

Policy Comparisonโ€‹

FeaturePriorityBlockIntelligent
OrderingPriority + TimestampPriority + TimestampDual-queue: High (FIFO) + Low (Round-Robin)
Retry BehaviorRetry failed jobRetry failed jobHigh: retry same job; Low: move to next job
Resource BlockingOptimisticStrict/ConservativeBalanced
Preemption SupportYesNoYes (for high-priority jobs)
Use CasesPriority schedulingStrict resource isolationMixed critical + batch workloads

Configuring Queue Policyโ€‹

Queue policy can be set in two ways:

  1. Via ElasticQuota label (recommended):
metadata:
labels:
koord-queue/queue-policy: Priority # Options: Priority, Block, Intelligent
  1. Via Queue CR (advanced configuration):
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: Queue
metadata:
name: my-queue
namespace: koord-queue
spec:
queuePolicy: Intelligent
priority: 1000
annotations:
koord-queue/priority-threshold: "5"

Advanced Tuning Annotations:

  • koord-queue/priority-threshold: Set threshold for Intelligent policy (default: 4)
  • koord-queue/max-depth: Limit max number of jobs considered during scheduling
  • koord-queue/wait-for-pods-running: Wait for pods to enter Running state before dequeuing next job

Use QueueUnitโ€‹

QueueUnit Specโ€‹

FieldTypeDescription
consumerRefObjectReferenceReference to the original job CR.
priority*int32Priority within the queue.
queuestringName of the target queue.
resourceResourceListTotal resource requirements.
podSet[]PodSetPod group definitions (max 8).
priorityClassNamestringKubernetes PriorityClass name.
requestResourceListActual resource requests parsed from the job.

QueueUnit Statusโ€‹

FieldTypeDescription
phaseQueueUnitPhaseCurrent lifecycle phase.
attemptsint64Number of scheduling attempts.
messagestringHuman-readable status message.
lastUpdateTimeTimeLast status update timestamp.
admissionChecks[]AdmissionCheckStateStatus of each admission check.
podStatePodStateRunning/Pending pod counts.
admissions[]AdmissionResource allocation and state per PodSet admission.

Use AdmissionCheckโ€‹

Admission Checks (Work In Progress)โ€‹

Note: The Admission Check controller is not yet included in this release. This section describes the planned API for future use.

Queues can require admission checks that must pass before a QueueUnit is released. This is useful for integrating with external resource provisioning systems.

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: Queue
metadata:
name: checked-queue
namespace: koord-queue
spec:
queuePolicy: Priority
admissionChecks:
- name: prov-req-check
labelSelector:
matchLabels:
requires-provisioning: "true"

When a QueueUnit is reserved, the admission check controller processes each configured check. The QueueUnit transitions to Dequeued only when all checks report Ready status.

Observabilityโ€‹

Monitoringโ€‹

Koord-Queue exposes Prometheus metrics for monitoring:

# Port-forward to the controller
$ kubectl port-forward -n koord-queue svc/koord-queue 10259:10259

# Fetch metrics
$ curl http://localhost:10259/metrics

If the visibility server is enabled (enableVisibilityServer: true), you can query queue status via REST API:

$ curl http://koord-queue-visibility:8090/api/queues

Debuggingโ€‹

Check the controller logs for scheduling decisions:

$ kubectl logs -n koord-queue deployment/koord-queue-controller -f --tail=100

Inspect QueueUnit status for scheduling details:

$ kubectl describe queueunit <name> -n <namespace>

Check Kubernetes events for scheduling-related messages:

$ kubectl get events -n <namespace> --field-selector reason=Scheduling