Skip to main content

· 11 min read
Jianyu Wang
KunWuLuan
Rougang Han
Ziqiu Zhu
Zhe Zhu
Tao Song

Background

AI and batch workloads continue to drive the evolution of Kubernetes scheduling. As clusters grow larger and workloads become more diverse, users demand richer queueing semantics, more accurate resource reservation, deeper observability, and unified management across increasingly heterogeneous hardware.

Since its open-source release in April 2022, Koordinator has delivered 16 major versions, providing an end-to-end solution for workload orchestration, co-location, fine-grained scheduling, isolation, and performance optimization. We sincerely thank engineers from Alibaba, Ant Group, Intel, XiaoHongShu, Xiaomi, iQIYI, 360, YouZan, PITS Global Data Recovery Services, Quwan, meiyapico, dewu, Asiainfo, CaoCao Mobility, i-Tudou, NVIDIA, NIO, Mammotion, Zhongrui Group, Heshan Dehao, and many other organizations for their continuous contributions.

Today, we are excited to announce the release of Koordinator v1.8.0. This release introduces Koord-Queue, a native Kubernetes job queueing system built for the Koordinator ecosystem; enhances Resource Reservation with Pre-Allocation (cluster mode and multiple pre-allocated pods); adds the Scheduling Hint internal protocol to enable cooperative scheduling decisions; expands heterogeneous device support to MetaX GPU/sGPU and Huawei Ascend 300I Duo; ships new Grafana Dashboards for scheduler and descheduler; and upgrades the platform baseline to Kubernetes 1.35.

Key Features

1. Koord-Queue: Native Job-Level Queueing for Kubernetes

Multi-tenant AI/ML and batch clusters require job-level queueing, admission control, and resource fairness on top of Pod-level scheduling. Koordinator v1.8.0 introduces Koord-Queue, a new component purpose-built for these scenarios.

Koord-Queue Architecture

Koord-Queue provides:

  • Job-level queueing: Manages queue units representing whole jobs (TFJob, PyTorchJob, MPIJob, Spark, Argo Workflow, Ray, native Kubernetes Jobs) rather than individual pods.
  • Deep ElasticQuota integration: Integrates with Koordinator's ElasticQuota CRD (scheduling.sigs.k8s.io/v1alpha1) for elastic borrowing, min/max guarantees, and hierarchical fair-sharing.
  • Pluggable queueing policies: Supports both Priority (priority + creation-time ordering) and Block (strict blocking) policies per queue.
  • Pre-scheduling admission: Reduces scheduler pressure by gating jobs before they hit the Pod scheduler, through a plugin framework with MultiQueueSort, QueueSort, QueueUnitMapping, Filter, and Reserve extension points.
  • Admission check framework (WIP): API-compatible with Kueue's AdmissionCheck, enabling custom gates such as quota validation, capacity checks, or external approvals.

Example Queue with ElasticQuota integration:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Queue
metadata:
name: team-a-queue
namespace: koord-queue
spec:
queuePolicy: Priority
priority: 100

Koord-Queue is deployed separately via Helm:

helm install koord-queue koordinator-sh/koord-queue --version 1.8.0 \
--namespace koord-queue

For details, please see the Koord-Queue design document and the Queue Management user manual.

2. Resource Reservation: Pre-Allocation with Cluster Mode and Multiple Pods

In v1.8.0, Koordinator's Reservation CRD is extended with Pre-Allocation, enabling users to pre-allocate node resources for future demands even when the resources are not currently allocatable. This is particularly useful for inference orchestration, rolling upgrades, and priority-based capacity planning.

Key enhancements:

  • Cluster pre-allocation mode via preAllocationPolicy.mode: Cluster: instead of matching pre-allocatable pods through the Reservation's Owner matchers (the Default mode), the scheduler identifies candidate pods by cluster-wide label/annotation selectors. Pods marked with the label pod.koordinator.sh/is-pre-allocatable: "true" become pre-allocatable, which is especially useful in multi-tenant clusters where pre-allocatable pods may belong to different owners and should be managed centrally.
  • Multiple pre-allocated pods via preAllocationPolicy.enableMultiple: true: when disabled, only a single pod can be pre-allocated against the Reservation; when enabled, multiple pods can jointly satisfy the reservation's resource requirements — useful when no single pod can consume all the reserved resources due to resource fragmentation.
  • Pre-allocation priority through the pod.koordinator.sh/pre-allocatable-priority annotation (numeric string, higher = higher priority), giving fine-grained control over which candidate pods are picked first.
  • Integration with NodeNUMAResource and DeviceShare, so pre-allocation reserves CPUs, NUMA nodes, and GPU devices in a consistent way with regular scheduling.

Example snippet enabling Cluster mode and multi-pod pre-allocation:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: pre-alloc-cluster
spec:
preAllocation: true
preAllocationPolicy:
mode: Cluster
enableMultiple: true

For more information, please see Resource Reservation.

3. Scheduling Hint: Cooperative Scheduling Between Components

Koordinator v1.8.0 introduces Scheduling Hint, an internal protocol that allows scheduling-related components (for example, Koord-Queue or network-topology-aware pre-schedulers) to pass hints to koord-scheduler for more efficient decisions without overriding its authority.

The first supported hint is preferredNodeNames, a list of candidate nodes the scheduler tries first before falling back to normal scheduling:

apiVersion: v1
kind: Pod
metadata:
name: pod-with-hint
annotations:
internal.scheduling.koordinator.sh/scheduling-hint: '{"preferredNodeNames": ["node-1", "node-2"]}'
spec:
schedulerName: koord-scheduler
containers:
- name: app
image: nginx

Unlike status.nominatedNodeName, Scheduling Hint:

  • Accepts a list of nodes instead of a single node, providing natural fallback options.
  • Does not consume node capacity in the Assume phase, leaving other pods unaffected.
  • Falls back gracefully to normal scheduling when preferred nodes don't work.

For more information, please see Scheduling Hint.

4. Expanded Heterogeneous Device Support: MetaX GPU/sGPU and Huawei Ascend 300I Duo

Building on the Ascend NPU and Cambricon MLU support added in v1.7.0, Koordinator v1.8.0 further extends the unified device scheduling framework:

  • MetaX GPU/sGPU support through koord-device-daemon and fine-grained device scheduling. MetaX full cards and sGPU virtual slices are reported as Device CRs with standard koordinator.sh/gpu-* resources, allowing scheduling policies (partition-aware, topology-aware, GPU-Share) to work consistently across vendors.
  • Huawei Ascend 300I Duo adaptation in the device-scheduling DP adapter, complementing the existing 910B family and providing inference-optimized scheduling for Ascend 300I Duo cards.
  • NVIDIA GPU health condition reporting, giving upstream systems richer signals about node-level GPU health.

Example Pod requesting a MetaX virtual GPU (sGPU) with a specified compute percentage, GPU memory, and QoS policy:

apiVersion: v1
kind: Pod
metadata:
labels:
app: demo-sleep
name: test-metax-sgpu
namespace: default
annotations:
metax-tech.com/sgpu-qos-policy: "fixed-share" # fixed-share/best-effort/burst-share
spec:
containers:
- command:
- sleep
- infinity
image: ubuntu:22.04
imagePullPolicy: IfNotPresent
name: demo-sleep
resources:
limits:
cpu: "32"
memory: 64Gi
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
metax-tech.com/sgpu: "1"
requests:
cpu: "32"
memory: 64Gi
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
metax-tech.com/sgpu: "1"

For more information, please see Device Scheduling – Metax GPU and Fine-Grained Device Scheduling.

5. Descheduling with Custom Priority

v1.8.0 introduces a new CustomPriority balance plugin in koord-descheduler that deschedules Pods according to a user-defined node priority order. Nodes are split into multiple priority tiers based on business semantics — for example pay-as-you-go vs. pay-by-year-or-month, shared vs. dedicated pools, or spot vs. on-demand instances. When a lower-priority tier has enough capacity to accommodate Pods running on a higher-priority tier, the descheduler proactively evicts those Pods so they can be rescheduled onto the cheaper (or more reclaimable) pool.

Typical use cases:

  • Cost optimization: migrate workloads from pay-as-you-go nodes onto pay-by-year-or-month nodes.
  • Resource consolidation: gradually shift load from one type of node to another so that the source nodes can be safely scaled down, maintained, or returned.
  • Tiered pools: enforce a strict ordering between multiple node pools and let workloads “sink” toward the lower tiers over time.

Each descheduling cycle the plugin executes the following steps:

  1. Group all nodes in the cluster according to the order defined in evictionOrder. A node is assigned to the first matching priority tier only.
  2. Starting from the highest-priority tier (the one listed first), use it as the source pool and treat all subsequent lower-priority tiers together as the target pool candidates.
  3. For every Pod on a source node, apply the namespace / podSelector / Evictor filters and sort the candidate Pods to be evicted by ascending CPU and Memory request.
  4. Run the eviction strategy according to mode: BestEffort (default — evict any individual Pod as soon as a single target node can accommodate it) or DrainNode (only evict Pods on a source node when all candidate Pods on that node can be placed onto target-pool nodes, optionally cordoning it via autoCordon).
  5. Actual Pod eviction is performed asynchronously by the Evictor, which honors all rate-limit and safety mechanisms.

Custom Priority Descheduling

For more information, please see Descheduling with Custom Priority.

6. Observability: Grafana Dashboards for Scheduler and Descheduler

v1.8.0 ships a set of curated Grafana dashboards for koord-scheduler and koord-descheduler, covering scheduling throughput, queue latency, plugin latency, preemption activity, and descheduler evictions. Combined with the PodMonitor parameters introduced in the Helm chart, users can now light up production-grade observability with a single Helm flag:

helm install koordinator koordinator-sh/koordinator --version 1.8.0 \
--set scheduler.monitorEnabled=true \
--set descheduler.monitorEnabled=true

Example dashboards:

Scheduler Basic Summary — queue growth, pending pods, scheduling latency, enqueue/dequeue QPS, and scheduler process resource usage, providing an at-a-glance view of koord-scheduler health and throughput.

Scheduler Basic Summary

Descheduler Eviction Overview — cumulative and real-time eviction counts, success/failure rates, and current eviction rate, giving a quick snapshot of koord-descheduler activity.

Descheduler Eviction Overview

For more information, please see Scheduling Monitoring and Descheduling Monitoring.

7. Platform and Compatibility

v1.8.0 brings a number of platform-wide improvements:

  • Upgrade to Kubernetes 1.35.2, including controller-gen v0.20.0 and k8s.io/utils/ptr migration. v1.8 formally supports Kubernetes 1.24, 1.28, and 1.35. Kubernetes 1.22 and 1.20 remain only partially supported: certain Koordinator components have moved to newer Kubernetes APIs that do not exist on those older clusters, so they no longer work there, while core co-location, QoS, and scheduling capabilities continue to function. For details, please see Kubernetes Compatibility.
  • Multi-scheduler / multi-profile hardening: extensive refinement of reservation, coscheduling, PreBind, PreBindReservation, ForgetPod, and framework-extender flows so that koord-scheduler runs reliably alongside other schedulers or in backup-scheduler setups.
  • Protobuf for native resources: kubeclients now uses protobuf for core resources, reducing API Server CPU footprint.
  • NRI upgrade to 0.11.0 and refined NRI server in koordlet.
  • Koordlet improvements: static reserved mode for mid resource, allocatable-based eviction, BE CPU-suppress fix when BE pods exist, container-level cfs_quota unbinding fix, cpuset share-pool metric, GPU init-failure handling when a GPU is lost.
  • Descheduler improvements: skip eviction gates support, anomaly condition fixes, nodePool inheritance of top-level defaults, raw-allocatable based thresholds in LowNodeLoad (see section 5 above for the new CustomPriority plugin).

Contributors

Koordinator is an open-source community. Thanks to all long-time maintainers and first-time contributors. We welcome more developers to join the Koordinator community.

New Contributors

  • @IULeen made their first contribution in #2595
  • @lujinda made their first contribution in #2679
  • @hunshcn made their first contribution in #2707
  • @summingyu made their first contribution in #2711
  • @ikukaku made their first contribution in #2684
  • @AutuSnow made their first contribution in #2767
  • @PixelPixel00 made their first contribution in #2802
  • @106umao made their first contribution in #2819
  • @manukasvi made their first contribution in #2815
  • @aviralgarg05 made their first contribution in #2838

Future Plan

Koordinator tracks its roadmap via GitHub Milestones. The following items are planned for the upcoming v1.8.1 patch and the longer-term aspirational-26 milestone.

Near-term (v1.8.1)

  • Scheduler – Inference Orchestration: Inference Orchestration Enhancement with Grove Integration (#2821).
  • Scheduler – Reservation: Support reservation scale update by spec (#2859).
  • Scheduler – Diagnosis & Audit: Diagnosis API (#2607); customizable preemption diagnosis (#2632); tooling for schedule diagnosis (#2669); optimize schedule audit with queue (#2676); optimize failedDetail/alreadyWaitForBound and add TTL for explanations (#2792); workload auditor (#2872); schedule suggestions on job/pod scheduling failure (#2873).
  • Scheduler – Platform: Refactor ForceSyncFromInformer to align with vanilla kube-scheduler behavior (#2875); honor -stderrthreshold when -logtostderr=true (#2874).
  • Descheduler: Lambda-G scoring function for resource imbalance detection and balanced rescheduling (#2837).
  • Koordlet: Add cpuset share-pool CPU info to metrics (#2800); resolve CPUBurst triggering cfsScaleDown on CgroupV2 nodes (#2801); Memory NUMA Topology Alignment (#2826).

Longer-term (aspirational-26, by the end of 2026)

  • Align with Kubernetes 1.35 capabilities (#2851 – umbrella): SchedulerQueueingHints (#2852), Non-blocking API Calls (#2853), Opportunistic Batching (#2854), Gang Scheduling enhancements (#2856), Asynchronous Preemption (#2857), and NominatedNodeName for expectation (#2858).
  • Dynamic Resource Allocation (DRA): End-to-end DRA support across koord-scheduler, koord-manager, koord-device-daemon, and koordlet (#2855).
  • Multi-scheduler architecture: Support shared states between multiple profiles in a single scheduler (#2749); provide documentation for multi-master scheduler deployment (#2758).
  • Queueing & Job scheduling: JobNomination mechanism (#2803); optimize Kueue AdmissionCheck with Koordinator Reservation (#2754); resource estimation strategy for DAG-type workflows (e.g., Argo) in koord-queue (#2786).
  • Rescheduling & Balancing: Rescheduling to address the imbalance of different resource types on a single node (#2332); shrink binpack strategy (#2790).
  • QoS & Koordlet: PSI-based QoS reconciler (#2463); pod-level CPU burst strategies for fine-grained control (#2557); memory NUMA topology alignment proposal (#2590); ensure NRI Hooks that Pods depend on work as expected (#2579); support evicting YARN containers (#2464).
  • Scheduling Diagnosis: Continue enhancing the scheduler's ability to investigate abnormal Pod scheduling (#2348).

We encourage user feedback on usage experiences and welcome more developers to participate in the Koordinator project, jointly driving its development!

Acknowledgement

Since the project was open-sourced, Koordinator has released more than 16 versions with 120+ contributors. The community continues to grow, and we thank all community members for their active participation and valuable feedback. We also thank the CNCF and related community members for their support.

Welcome more developers and end users to join us! Whether you are a beginner or an expert in Cloud Native communities, we look forward to hearing your voice!

For the full change log, please see v1.8.0 Release.

· 11 min read
Jianyu Wang
Rougang Han
Ziqiu Zhu
Zhe Zhu

Background

As artificial intelligence continues to evolve, the scale and complexity of AI model training are growing exponentially. Large language models (LLMs) and distributed AI training scenarios place unprecedented demands on cluster resource scheduling. Efficient inter-pod communication, intelligent resource preemption, and unified heterogeneous device management have become critical challenges that production environments must address.

Since its official open-source release in April 2022, Koordinator has iterated through 15 major versions, consistently delivering comprehensive solutions for workload orchestration, resource scheduling, isolation, and performance optimization. The Koordinator community is grateful for the contributions from outstanding engineers at Alibaba, Ant Technology Group, Intel, XiaoHongShu, Xiaomi, iQiyi, 360, YouZan, and other organizations, who have provided invaluable ideas, code, and real-world scenarios.

Today, we are excited to announce the release of Koordinator v1.7.0. This version introduces groundbreaking capabilities tailored for large-scale AI training scenarios, including Network-Topology Aware Scheduling and Job-Level Preemption. Additionally, v1.7.0 enhances heterogeneous device scheduling with support for Huawei Ascend NPU and Cambricon MLU, providing end-to-end device management solutions. The release also includes comprehensive API Reference Documentation and a complete Developer Guide to improve the developer experience.

In the v1.7.0 release, 14 new developers actively contributed to the Koordinator community: @ditingdapeng, @Rouzip, @ClanEver, @zheng-weihao, @cntigers, @LennonChin, @ZhuZhezz, @dabaooline, @bobsongplus, @yccharles, @qingyuanz, @yyrdl, @hwenwur, and @hkttty2009. We sincerely thank all community members for their active participation and ongoing support!

Key Features

1. Network-Topology Aware Scheduling: Accelerating Communication in Distributed AI Training

In large-scale AI training scenarios, especially for large language models (LLMs), efficient inter-pod communication is critical to training performance. Model parallelism techniques such as Tensor Parallelism (TP), Pipeline Parallelism (PP), and Data Parallelism (DP) require frequent and high-bandwidth data exchange across GPUs—often spanning multiple nodes. Under such workloads, network topology becomes a key performance bottleneck, where communication latency and bandwidth are heavily influenced by the physical network hierarchy (e.g., NVLink, block, spine).

image

To optimize training efficiency, Koordinator v1.7.0 provides Network-Topology Aware Scheduling capability, which ensures:

  • When cluster resources are sufficient, pods with network topology requirements will be scheduled to topology domains with better performance (e.g., lower latency, higher bandwidth) according to user-specified strategies.
  • When cluster resources are insufficient, the scheduler will preempt resources for the GangGroup based on network topology constraints through job-level preemption, and record the resource nominations in the .status.nominatedNode field to ensure consistent placement.

Cluster Network Topology Configuration

Administrators first label nodes with their network topology positions using tools like NVIDIA's topograph:

apiVersion: v1
kind: Node
metadata:
name: node-0
labels:
network.topology.nvidia.com/block: b1
network.topology.nvidia.com/spine: s1

Then define the topology hierarchy via a ClusterNetworkTopology CR:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
name: default
spec:
networkTopologySpec:
- labelKey:
- network.topology.nvidia.com/spine
topologyLayer: SpineLayer
- labelKey:
- network.topology.nvidia.com/block
parentTopologyLayer: SpineLayer
topologyLayer: BlockLayer
- parentTopologyLayer: BlockLayer
topologyLayer: NodeTopologyLayer

Configuring Topology-Aware Gang Scheduling

To leverage network topology awareness, create a PodGroup and annotate it with topology requirements:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: training-job
namespace: default
annotations:
gang.scheduling.koordinator.sh/network-topology-spec: |
{
"gatherStrategy": [
{
"layer": "BlockLayer",
"strategy": "PreferGather"
}
]
}
spec:
minMember: 8
scheduleTimeoutSeconds: 300

When scheduling pods belonging to this PodGroup, the scheduler will attempt to place all member pods within the same BlockLayer topology domain to minimize inter-node communication latency.

For more information about Network-Topology Aware Scheduling, please see Network Topology Aware Scheduling.

2. Job-Level Preemption: Ensuring All-or-Nothing Resource Acquisition

In large-scale cluster environments, high-priority jobs (e.g., critical AI training tasks) often need to preempt resources from lower-priority workloads when sufficient resources are not available. However, traditional pod-level preemption in Kubernetes cannot guarantee that all member pods of a distributed job will seize resources together, leading to invalid preemption and resource waste.

To solve this, Koordinator v1.7.0 provides Job-Level Preemption, which ensures that:

  • Preemption is triggered at the job (GangGroup) level.
  • Only when all member pods can be co-scheduled after eviction will preemption occur.
  • Resources are reserved via nominatedNode for all members to maintain scheduling consistency.

Preemption Algorithm

The job-level preemption workflow consists of the following steps:

  1. Unschedulable Pod Detection: When a pod cannot be scheduled, it enters the PostFilter phase.
  2. Job Identification: The scheduler checks if the pod belongs to a PodGroup/GangGroup and fetches all member pods.
  3. Preemption Eligibility Check: Verifies that pods.spec.preemptionPolicy ≠ Never and ensures no terminating victims exist on currently nominated nodes.
  4. Candidate Node Selection: Finds nodes where preemption may help by simulating the removal of potential victims (lower-priority pods).
  5. Job-Aware Cost Model: Selects the optimal node and minimal-cost victim set based on a job-aware cost model.
  6. Preemption Execution: Deletes victims and sets status.nominatedNode for all member pods.

Usage Example

Define priority classes for preemptors and victims:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Used for critical AI training jobs that can preempt others."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
preemptionPolicy: PreemptLowerPriority
description: "Used for non-critical jobs that can be preempted."

Create a high-priority gang job:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: hp-training-job
namespace: default
spec:
minMember: 2
scheduleTimeoutSeconds: 300
---
apiVersion: v1
kind: Pod
metadata:
name: hp-worker-1
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:
schedulerName: koord-scheduler
priorityClassName: high-priority
preemptionPolicy: PreemptLowerPriority
containers:
- name: worker
resources:
limits:
cpu: 3
memory: 4Gi
requests:
cpu: 3
memory: 4Gi

When the high-priority job cannot be scheduled, the scheduler will preempt low-priority pods across multiple nodes to make room for all member pods of the job.

For more information about Job-Level Preemption, please see Job Level Preemption.

3. Heterogeneous Device Scheduling: Support for Huawei Ascend NPU and Cambricon MLU

Building on the strong foundation of GPU scheduling in v1.6, Koordinator v1.7.0 extends heterogeneous device scheduling to support Huawei Ascend NPU and Cambricon MLU, providing unified device management and scheduling capabilities across multiple vendors.

Device Scheduling Architecture

Huawei Ascend NPU Support

Koordinator v1.7.0 supports both Ascend virtualization templates and full cards through the koord-device-daemon and koordlet components. Key features include:

  • Device Reporting: Automatically detects and reports Ascend NPU information to the Device CR.
  • Partition-Aware Scheduling: Respects predefined GPU partition rules for HCCS affinity.
  • Topology Scheduling: Allocates NPUs based on PCIe and NUMA topology.

Example Device CR for Ascend NPU:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
labels:
node.koordinator.sh/gpu-model: Ascend-910B3
node.koordinator.sh/gpu-vendor: huawei
annotations:
scheduling.koordinator.sh/gpu-partitions: |
{
"4": [
{
"minors": [0,1,2,3],
"gpuLinkType": "HCCS",
"allocationScore": "1"
}
]
}
name: node-1
spec:
devices:
- health: true
id: GPU-fd971b33-4891-fd2e-ed42-ce6adf324615
minor: 0
resources:
huawei.com/npu-core: "20"
huawei.com/npu-cpu: "7"
huawei.com/npu-dvpp: "100"
koordinator.sh/gpu-memory: 64Gi
koordinator.sh/gpu-memory-ratio: "100"
topology:
busID: 0000:3b:00.0
nodeID: 0
pcieID: pci0000:3a
type: gpu

Cambricon MLU Support

Koordinator v1.7.0 supports Cambricon MLU cards in both full-card and virtualization (dynamic-smlu) modes. Key features include:

  • Device Reporting: Detects and reports Cambricon MLU information.
  • Virtualization Support: Enables GPU sharing through dynamic-smlu mode.
  • Unified Resource Naming: Uses koordinator.sh/gpu-* resources for consistent scheduling.

Example Pod requesting Cambricon virtual card:

apiVersion: v1
kind: Pod
metadata:
name: test-cambricon-partial
namespace: default
spec:
schedulerName: koord-scheduler
containers:
- name: demo-sleep
image: ubuntu:18.04
resources:
limits:
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
cambricon.com/mlu.smlu.vcore: "10"
cambricon.com/mlu.smlu.vmemory: "4"
requests:
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
cambricon.com/mlu.smlu.vcore: "10"
cambricon.com/mlu.smlu.vmemory: "4"

For more information, please see Device Scheduling - Ascend NPU and Device Scheduling - Cambricon MLU.

4. Other Enhancements and Improvements

Koordinator v1.7.0 also includes the following key enhancements:

  1. GPU Share with HAMi Enhancements:

    • Upgraded to HAMi v2.6.0 with support for NVIDIA drivers above 570.
    • Introduced Helm-based installation via hami-daemon chart (version 0.1.0) replacing manual DaemonSet deployment for easier management.
    • Added vGPUmonitor component for comprehensive GPU monitoring with Prometheus metrics including HostGPUMemoryUsage, HostCoreUtilization, vGPU_device_memory_usage_in_bytes, vGPU_device_memory_limit_in_bytes, and container-level device metrics.
  2. Load-Aware Scheduling Optimization:

    • Added PreFilter extension point for caching calculation results to significantly improve scheduling performance.
    • Introduced new configuration options including dominantResourceWeight for dominant resource fairness, prodUsageIncludeSys for comprehensive prod usage calculation, enableScheduleWhenNodeMetricsExpired for expired metrics handling, estimatedSecondsAfterPodScheduled and estimatedSecondsAfterInitialized for precise resource estimation timing, allowCustomizeEstimation for pod-level estimation customization, and supportedResources for extended resource type support.
  3. Enhanced ElasticQuota with Quota Hook Plugin framework:

    • Allows custom quota validation and enforcement logic
    • Supports hook plugins in ReplaceQuotas and OnQuotaUpdate methods
    • Enhanced pod update hook that runs regardless of whether used resources have changed

For a complete list of changes, please see v1.7.0 Release.

5. Comprehensive API Reference and Developer Guide

To improve the developer experience and facilitate community contributions, Koordinator v1.7.0 introduces comprehensive API Reference Documentation and a complete Developer Guide.

API Reference

The new API Reference provides detailed documentation for:

  • Custom Resource Definitions (CRDs): Comprehensive schema definitions, field descriptions, validation rules, and usage patterns for all Koordinator CRDs, including Recommendation, ClusterColocationProfile, ElasticQuota, Reservation, Device, NodeMetric, and more.
  • Client Libraries: Guidelines for using Koordinator's client libraries in Go, Python, and other languages.
  • Metrics Endpoints: Complete documentation of Prometheus metrics exposed by Koordinator components.
  • Webhook Endpoints: Detailed specifications of webhook endpoints for extending Koordinator's functionality.

Example from the Custom Resource Definitions documentation:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
name: worker01
labels:
node.koordinator.sh/gpu-model: NVIDIA-H20
node.koordinator.sh/gpu-vendor: nvidia
spec:
devices:
- health: true
id: GPU-a43e0de9-28a0-1e87-32f8-f5c4994b3e69
minor: 0
resources:
koordinator.sh/gpu-core: "100"
koordinator.sh/gpu-memory: 97871Mi
koordinator.sh/gpu-memory-ratio: "100"
topology:
busID: 0000:0e:00.0
nodeID: 0
pcieID: pci0000:0b
type: gpu

Developer Guide

The Developer Guide provides comprehensive resources for contributors, including:

  • Component Guide: Architecture and design of Koordinator components.
  • Metrics Collection: How to add and expose new metrics.
  • Extensibility: Extension points and plugin development patterns.
  • Plugin Development: Step-by-step guide to developing custom plugins.
  • Custom Scheduling Policies: How to implement custom scheduling policies.
  • Webhook Extensions: Developing webhook extensions for validation and mutation.
  • Custom Descheduler Plugins: Building custom descheduler plugins.

These resources significantly lower the barrier to entry for new contributors and enable developers to extend Koordinator's capabilities more easily.

For more information, please see API Reference and Developer Guide.

6. Best Practices: Batch Colocation Quick Start

To help users quickly get started with Koordinator's colocation capabilities, v1.7.0 introduces a new best practice guide: Batch Colocation Quick Start. This guide provides step-by-step instructions for:

  • Deploying Koordinator in a Kubernetes cluster.
  • Configuring colocation profiles for online and batch workloads.
  • Observing resource utilization improvements through batch resource overcommitment.
  • Monitoring and troubleshooting colocation scenarios.

This guide complements the existing best practices for Spark job colocation, Hadoop YARN colocation, and fine-grained CPU orchestration, providing a comprehensive resource library for production deployments.

For more information, please see Batch Colocation Quick Start.

Contributors

Koordinator is an open source community. In v1.7.0, there are 14 new developers who contributed to the Koordinator main repo:

@ditingdapeng made their first contribution in #2353
@Rouzip made their first contribution in #2005
@ClanEver made their first contribution in #2405
@zheng-weihao made their first contribution in #2409
@cntigers made their first contribution in #2434
@LennonChin made their first contribution in #2449
@ZhuZhezz made their first contribution in #2423
@dabaooline made their first contribution in #2483
@bobsongplus made their first contribution in #2524
@yccharles made their first contribution in #2474
@qingyuanz made their first contribution in #2584
@yyrdl made their first contribution in #2597
@hwenwur made their first contribution in #2621
@hkttty2009 made their first contribution in #2641

Thanks for the elders for their consistent efforts and the newbies for their active contributions. We welcome more contributors to join the Koordinator community.

Future Plan

In the next versions, Koordinator plans the following works:

  • Queue and Quota Management: Integrate Kube-Queue with Koordinator for comprehensive queue scheduling support (#2662)
  • Queue and Quota Management: Support PreEnqueue and QueueHint in Quota plugin (#2581)
  • Queue and Quota Management: Enhance Quota resource reclamation with PDB awareness (#2651)
  • Task Scheduling: Discuss with upstream developers about how to support Coscheduling and find a more elegant way to solve the following problems
    • Addressing the issue where PreEnqueue interception of Gang Pods prevents Pod events from being generated until the Gang MinMember requirement is met. (#2480)
    • Address GatedMetric Negative issues (kubernetes#133464)
  • Heterogeneous Scheduling Strategy: Consider GPU allocation in rescheduling for cluster resource consolidation (#2332)
  • Heterogeneous Resources Scheduling: Introduce Dynamic Resource Allocation (DRA) framework support
  • Heterogeneous Resources Scheduling: Expand support for more types of heterogeneous resources
  • Infrastructure and Compatibility: Upgrade to Kubernetes 1.33
  • Utils: Support PreAllocation in Reservation (#2150)
  • Utils: Implement Pod scheduling audit for pods in schedulingQueue (#2552)
  • Utils: Provide tooling for Pod scheduling audit analysis

We encourage user feedback on usage experiences and welcome more developers to participate in the Koordinator project, jointly driving its development!

Acknowledgement

Since the project was open-sourced, Koordinator has been released for more than 15 versions, with 110+ contributors involved. The community continues to grow and improve. We thank all community members for their active participation and valuable feedback. We also want to thank the CNCF organization and related community members for supporting the project.

Welcome more developers and end users to join us! It is your participation and feedback that make Koordinator keep improving. Whether you are a beginner or an expert in the Cloud Native communities, we look forward to hearing your voice!

· 21 min read
Jianyu Wang
Rougang Han
Tao Song

Background

With the explosive popularity of large models like DeepSeek, the demand for heterogeneous device resource scheduling in AI and high-performance computing fields has grown rapidly, whether it's for GPUs, NPUs, or RDMA devices. Efficiently managing and scheduling these resources has become a core concern in the industry. In response to this demand, Koordinator actively addresses community requests and continues to deepen its capabilities in heterogeneous device scheduling. In the latest v1.6 release, a series of innovative features have been introduced to help customers solve complex heterogeneous resource scheduling challenges.

In v1.6, we have enhanced device topology scheduling capabilities, supporting awareness of more machine types' GPU topologies, significantly accelerating GPU interconnect performance within AI applications. Collaborating with the open-source project HAMi, we have introduced end-to-end GPU & RDMA joint allocation capabilities as well as strong GPU isolation, effectively improving cross-machine interconnect efficiency for typical AI training tasks and increasing deployment density for inference tasks. This ensures better application performance and higher cluster resource utilization. Additionally, enhancements were made to the Kubernetes community’s resource plugins, enabling different resource configurations to apply distinct node scoring strategies. This feature significantly reduces GPU fragmentation when GPU and CPU tasks coexist in a single cluster.

Since its official open-source release in April 2022, Koordinator has iterated through 14 major versions, attracting contributions from outstanding engineers at companies such as Alibaba, Ant Group, Intel, Xiaohongshu, Xiaomi, iQIYI, 360, Youzan, and more. Their rich ideas, code contributions, and real-world application scenarios have greatly propelled the project's development. Notably, in the v1.6.0 release, ten new developers actively contributed to the Koordinator community: @LY-today, @AdrianMachao, @TaoYang526, @dongjiang1989, @chengjoey, @JBinin, @clay-wangzhi, @ferris-cx, @nce3xin, and @lijunxin559. We sincerely thank them for their contributions and all community members for their ongoing dedication and support!

Key Features

1. GPU Topology-Aware Scheduling: Accelerating GPU Interconnects Within AI Applications

With the rapid development of deep learning and high-performance computing (HPC), GPUs have become a core resource for many compute-intensive workloads. Efficient GPU utilization is crucial for enhancing application performance in Kubernetes clusters. However, GPU performance is not uniform and is influenced by hardware topology and resource allocation. For example:

  1. In multi-NUMA node systems, physical connections between GPUs, CPUs, and memory can affect data transfer speeds and computational efficiency.
  2. For NVIDIA cards like L20 and L40S, GPU communication efficiency depends on whether they are connected via the same PCIe or NUMA node.
  3. For Huawei’s Ascend NPU and virtualized environments using SharedNVSwitch mode with NVIDIA H-series machines, GPU allocation must adhere to predefined Partition rules.

image

To address these device scenarios, Koordinator provides rich device topology scheduling APIs to meet Pods’ GPU topology requirements. Below are examples of how to use these APIs:

  1. Allocating GPUs, CPUs, and memory within the same NUMA Node:
    apiVersion: v1
    kind: Pod
    metadata:
    annotations:
    scheduling.koordinator.sh/numa-topology-spec: '{"numaTopologyPolicy":"Restricted", "singleNUMANodeExclusive":"Preferred"}'
    spec:
    containers:
    - resources:
    limits:
    koordinator.sh/gpu: 200
    cpu: 64
    memory: 500Gi
    requests:
    koordinator.sh/gpu: 200
    cpu: 64
    memory: 500Gi
  2. Allocating GPUs within the same PCIe:
    apiVersion: v1
    kind: Pod
    metadata:
    annotations:
    scheduling.koordinator.sh/device-allocate-hint: |-
    {
    "gpu": {
    "requiredTopologyScope": "PCIe"
    }
    }
    spec:
    containers:
    - resources:
    limits:
    koordinator.sh/gpu: 200
  3. Allocating GPUs within the same NUMA Node:
    apiVersion: v1
    kind: Pod
    metadata:
    annotations:
    scheduling.koordinator.sh/device-allocate-hint: |-
    {
    "gpu": {
    "requiredTopologyScope": "NUMANode"
    }
    }
    spec:
    containers:
    - resources:
    limits:
    koordinator.sh/gpu: 400
  4. Allocating GPUs according to predefined Partitions:

Predefined GPU Partition rules are typically determined by specific GPU models or system configurations and may also depend on the GPU setup on individual nodes. The scheduler cannot discern hardware model specifics or GPU types; instead, it relies on node-level components reporting these predefined rules to custom resource (CR) definitions, as shown below:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
annotations:
scheduling.koordinator.sh/gpu-partitions: |
{
"1": [
"NVLINK": {
{
# Which GPUs are included
"minors": [
0
],
# GPU Interconnect Type
"gpuLinkType": "NVLink",
# Here we take the bottleneck bandwidth between GPUs in the Ring algorithm. BusBandwidth can be referenced from https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
"ringBusBandwidth": 400Gi
# Indicate the overall allocation quality for the node after the partition has been assigned away.
"allocationScore": "1",
},
...
}
...
],
"2": [
...
],
"4": [
...
],
"8": [
...
]
}
labels:
// Indicates whether the Partition rule must be followed
node.koordinator.sh/gpu-partition-policy: "Honor"
name: node-1

When multiple Partition options are available, Koordinator allows users to decide whether to allocate based on the optimal Partition:

kind: Pod
metadata:
name: hello-gpu
annotations:
scheduling.koordinator.sh/gpu-partition-spec: |
{
# BestEffort|Restricted
"allocatePolicy": "Restricted",
}
spec:
containers:
- name: main
resources:
limits:
koordinator.sh/gpu: 100

If users do not need to allocate based on the optimal Partition, the scheduler will allocate resources in a Binpack manner as much as possible.

For more details on GPU topology-aware scheduling, please refer to the following design documents:

Special thanks to community developer @eahydra for contributing to this feature!

2. End-to-End GDR Support: Enhancing Cross-Machine Task Interconnect Performance

image In AI model training scenarios, GPUs frequently require collective communication to synchronize updated weights during training iterations. GDR (GPUDirect RDMA) aims to solve the efficiency problem of exchanging data between multi-machine GPU devices. By using GDR technology, GPUs can exchange data directly without involving CPUs or memory, significantly reducing CPU/Memory overhead while lowering latency. To achieve this goal, Koordinator v1.6.0 introduces GPU/RDMA device joint scheduling capabilities, with the overall architecture outlined below:

image

  1. Koordlet detects GPUs and RDMA devices on nodes and reports relevant information to the Device CR.
  2. Koord-Manager synchronizes resources from the Device CR to node.status.allocatable.
  3. Koord-Scheduler allocates GPUs and RDMA based on device topology and annotates allocation results onto Pods.
  4. Multus-CNI accesses Koordlet PodResources Proxy to obtain RDMA devices allocated to Pods and attaches corresponding NICs to the Pods' network namespaces.
  5. Koordlet provides an NRI plugin to mount devices into containers.

Due to the involvement of numerous components and complex environments, Koordinator v1.6.0 provides best practices showcasing step-by-step deployments of Koordinator, Multus-CNI, and SRIOV-CNI. After deploying the necessary components, users can simply adopt the following Pod configuration to request joint GPU and RDMA allocations from the scheduler:

apiVersion: v1
kind: Pod
metadata:
name: pod-vf01
namespace: kubeflow
annotations:
scheduling.koordinator.sh/device-joint-allocate: |-
{
"deviceTypes": ["gpu","rdma"]
}
scheduling.koordinator.sh/device-allocate-hint: |-
{
"rdma": {
"vfSelector": {} //apply VF
}
}
spec:
schedulerName: koord-scheduler
containers:
- name: container-vf
resources:
requests:
koordinator.sh/gpu: 100
koordinator.sh/rdma: 100
limits:
koordinator.sh/gpu: 100
koordinator.sh/rdma: 100

For further end-to-end testing of GDR tasks using Koordinator, you can refer to the sample steps in the best practices. Special thanks to community developer @ferris-cx for contributing to this feature!

3. Strong GPU Sharing Isolation: Improving Resource Utilization for AI Inference Tasks

In AI applications, GPUs are indispensable core devices for large model training and inference, providing powerful computational capabilities for compute-intensive tasks. However, this powerful computing capability often comes with high costs. In production environments, we frequently encounter situations where small models or lightweight inference tasks only require a fraction of GPU resources (e.g., 20% of compute power or GPU memory), yet a high-performance GPU card must be exclusively occupied to run these tasks. This resource usage method not only wastes valuable GPU computing power but also significantly increases enterprise costs.

This situation is particularly common in the following scenarios:

  1. Online Inference Services: Many online inference tasks have low computational demands but require high latency responsiveness, often needing deployment on high-performance GPUs to meet real-time requirements.
  2. Development and Testing Environments: Developers debugging models usually only need a small amount of GPU resources, but traditional scheduling methods lead to low resource utilization.
  3. Multi-Tenant Shared Clusters: In multi-user or multi-team shared GPU clusters, each task monopolizing a GPU leads to uneven resource distribution, making it difficult to fully utilize hardware capabilities.

To address this issue, Koordinator, combined with HAMi, provides GPU sharing and isolation capabilities, allowing multiple Pods to share a single GPU card. This approach not only significantly improves GPU resource utilization but also reduces enterprise costs while meeting flexible resource demands for different tasks. For example, under Koordinator’s GPU sharing mode, users can precisely allocate GPU cores or memory ratios, ensuring each task receives the required resources while avoiding interference.

image

HAMi is a CNCF Sandbox project aimed at providing a device management middleware for Kubernetes. HAMi-Core, its core module, hijacks API calls between CUDA-Runtime (libcudart.so) and CUDA-Driver (libcuda.so) to provide GPU sharing and isolation capabilities. In v1.6.0, Koordinator leverages HAMi-Core’s GPU isolation features to offer an end-to-end GPU sharing solution.

You can deploy DaemonSet directly on corresponding nodes to install HAMi-core using the YAML file below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hami-core-distribute
namespace: default
spec:
selector:
matchLabels:
koord-app: hami-core-distribute
template:
metadata:
labels:
koord-app: hami-core-distribute
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- "gpu"
containers:
- command:
- /bin/sh
- -c
- |
cp -f /k8s-vgpu/lib/nvidia/libvgpu.so /usl/local/vgpu && sleep 3600000
image: docker.m.daocloud.io/projecthami/hami:v2.4.0
imagePullPolicy: Always
name: name
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: "0"
memory: "0"
volumeMounts:
- mountPath: /usl/local/vgpu
name: vgpu-hook
- mountPath: /tmp/vgpulock
name: vgpu-lock
tolerations:
- operator: Exists
volumes:
- hostPath:
path: /usl/local/vgpu
type: DirectoryOrCreate
name: vgpu-hook
# https://github.com/Project-HAMi/HAMi/issues/696
- hostPath:
path: /tmp/vgpulock
type: DirectoryOrCreate
name: vgpu-lock

Koordinator scheduler's GPU Binpack capability is enabled by default. After installing Koordinator and HAMi-Core, users can apply for shared GPU cards and enable HAMi-Core isolation as follows:

apiVersion: v1
kind: Pod
metadata:
name: pod-example
namespace: default
labels:
koordinator.sh/gpu-isolation-provider: hami-core
spec:
schedulerName: koord-scheduler
containers:
- command:
- sleep
- 365d
image: busybox
imagePullPolicy: IfNotPresent
name: curlimage
resources:
limits:
cpu: 40m
memory: 40Mi
koordinator.sh/gpu-shared: 1
koordinator.sh/gpu-core: 50
koordinator.sh/gpu-memory-ratio: 50
requests:
cpu: 40m
memory: 40Mi
koordinator.sh/gpu-shared: 1
koordinator.sh/gpu-core: 50
koordinator.sh/gpu-memory-ratio: 50
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
restartPolicy: Always

For guidance on enabling HAMi GPU sharing isolation capabilities in Koordinator, please refer to:

Special thanks to HAMi community maintainer @wawa0210 for supporting this feature!

4. Differentiated GPU Scheduling Policies: Effectively Reducing GPU Fragmentation

In modern Kubernetes clusters, various types of resources (such as CPU, memory, and GPU) are typically managed on a unified platform. However, the usage patterns and demands for different resources often vary significantly, leading to differing needs for stacking (Packing) and spreading (Spreading) strategies. For example:

  • GPU Resources: In AI model training or inference tasks, to maximize GPU utilization and reduce fragmentation, users generally prefer to schedule GPU tasks onto nodes that already have GPUs allocated ("stacking" strategy). This prevents resource waste caused by overly dispersed GPU distributions.
  • CPU and Memory Resources: In contrast, CPU and memory resource demands are more diverse. For some online services or batch processing tasks, users tend to distribute tasks across multiple nodes ("spreading" strategy) to avoid hotspots on individual nodes, thereby improving overall cluster stability and performance.

Additionally, in mixed workload scenarios, different tasks’ resource demands can influence each other. For instance:

  • In a cluster running both GPU training tasks and regular CPU-intensive tasks, if CPU-intensive tasks are scheduled onto GPU nodes and consume significant CPU and memory resources, subsequent GPU tasks may fail to start due to insufficient non-GPU resources, remaining in a Pending state.
  • In multi-tenant environments, some users may only request CPU and memory resources, while others need GPU resources. If the scheduler cannot distinguish these needs, it may lead to resource contention and unfair resource allocation.

image The native Kubernetes NodeResourcesFit plugin currently supports configuring the same scoring strategy for different resources, as shown below:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeResourcesFit
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeResourcesFitArgs
scoringStrategy:
type: LeastAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: nvidia.com/gpu
weight: 1

However, in practical production settings, this design may not always be suitable. For example, in AI scenarios, GPU-requesting jobs prefer to occupy entire GPU machines to prevent GPU fragmentation, whereas CPU&MEM jobs prefer spreading to reduce CPU hotspots. In v1.6.0, Koordinator introduces the NodeResourceFitPlus plugin to support differentiated scoring strategies for different resources. Users can configure it upon installing Koordinator scheduler as follows:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeResourcesFitPlusArgs
resources:
nvidia.com/gpu:
type: MostAllocated
weight: 2
cpu:
type: LeastAllocated
weight: 1
memory:
type: LeastAllocated
weight: 1
name: NodeResourcesFitPlus
plugins:
score:
enabled:
- name: NodeResourcesFitPlus
weight: 2
schedulerName: koord-scheduler

Moreover, CPU&MEM jobs would prefer to spread to non-GPU machines to prevent excessive consumption of CPU&MEM on GPU machines, which could cause true GPU tasks to remain Pending due to insufficient non-GPU resources. In v1.6.0, Koordinator introduces the ScarceResourceAvoidance plugin to support this requirement. Users can configure the scheduler as follows, indicating that nvidia.com/gpu is a scarce resource, and when Pods do not request this scarce resource, they should avoid being scheduled onto nodes possessing it.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: ScarceResourceAvoidanceArgs
resources:
- nvidia.com/gpu
name: ScarceResourceAvoidance
plugins:
score:
enabled:
- name: NodeResourcesFitPlus
weight: 2
- name: ScarceResourceAvoidance
weight: 2
disabled:
- name: "*"
schedulerName: koord-scheduler

For detailed designs and user guides on GPU resource differentiated scheduling policies, please refer to:

Special thanks to community developer @LY-today for contributing to this feature.

5. Fine-Grained Resource Reservation: Meeting Efficient Operation Needs for AI Tasks

Efficient utilization of heterogeneous resources often relies on precise alignment with closely coupled CPU and NUMA resources. For example:

  1. GPU-Accelerated Tasks: In multi-NUMA node servers, if the physical connection between GPU and CPU or memory spans NUMA boundaries, it may increase data transmission latency, significantly reducing task performance. Therefore, such tasks typically require GPU, CPU, and memory to be allocated on the same NUMA node.
  2. AI Inference Services: Online inference tasks are highly sensitive to latency and need to ensure GPU and CPU resource allocations are as close as possible to minimize cross-NUMA node communication overhead.
  3. Scientific Computing Tasks: Some high-performance computing tasks (e.g., molecular dynamics simulations or weather forecasting) require high-bandwidth, low-latency memory access, necessitating strict alignment of CPU cores and local memory.

These requirements extend beyond task scheduling to resource reservation scenarios. In production environments, resource reservation is an important mechanism for locking resources in advance for critical tasks, ensuring smooth operation at a future point in time. However, simple resource reservation mechanisms often fail to meet fine-grained orchestration needs in heterogeneous resource scenarios. For example:

  1. Certain tasks may need to reserve specific NUMA node CPU and GPU resources to guarantee optimal performance upon task startup.
  2. In multi-tenant clusters, different users may need to reserve different combinations of resources (e.g., GPU + CPU + memory) and expect these resources to be strictly aligned.
  3. When reserved resources are not fully utilized, how to flexibly allocate remaining resources to other tasks without affecting reserved task resource guarantees is another important challenge.

To address these complex scenarios, Koordinator comprehensively enhances resource reservation functionality in v1.6, providing more refined and flexible resource orchestration capabilities. Specific improvements include:

  1. Supporting fine-grained CPU and GPU resource reservations and preemption.
  2. Supporting exact matching of reserved resource quantities for Pods.
  3. Reservation affinity supports specifying reservation names and taint tolerance attributes.
  4. Resource reservation supports limiting the number of Pods.
  5. Supporting preempting lower-priority Pods with reserved resources.

Changes to plugin extension interfaces:

  1. The reservation validation interface ReservationFilterPlugin is moved from the PreScore phase to the Filter phase to ensure more accurate filtering results.
  2. The reservation ledger return interface ReservationRestorePlugin deprecates unnecessary methods to improve scheduling efficiency.

Below are examples of new functionalities:

  1. Exact-Match Reservation. Specify Pods to exactly match reserved resource quantities, which can narrow down the matching relationship between a group of Pods and a group of reservations, making reservation allocation more controllable.
apiVersion: v1
kind: Pod
metadata:
annotations:
# Specify the resource categories for which the Pod exactly matches reserved resources; Pods can only match Reservation objects whose reserved resource quantities and Pod specifications are completely equal in these resource categories; e.g., specify "cpu", "memory", "nvidia.com/gpu"
scheduling.koordinator.sh/exact-match-reservation: '{"resourceNames":{"cpu","memory","nvidia.com/gpu"}}'
  1. Ignore Resource Reservations (reservation-ignored). Specify Pods to ignore resource reservations, allowing Pods to fill idle resources on nodes with reservations but unallocated, complementing preemption to reduce resource fragmentation.
apiVersion: v1
kind: Pod
metadata:
labels:
# Specify that the Pod’s scheduling can ignore resource reservations
scheduling.koordinator.sh/reservation-ignored: "true"
  1. Specify Reservation Name Affinity (ReservationAffinity)
apiVersion: v1
kind: Pod
metadata:
annotations:
# Specify the name of the resource reservation matched by the Pod
scheduling.koordinator.sh/reservation-affinity: '{"name":"test-reservation"}'
  1. Specify Taints and Tolerations for Resource Reservations
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: test-reservation
spec:
# Specify Taints for the Reservation; its reserved resources can only be allocated to Pods tolerating this taint
taints:
- effect: NoSchedule
key: test-taint-key
value: test-taint-value
# ...
---
apiVersion: v1
kind: Pod
metadata:
annotations:
# Specify the Pod’s toleration for resource reservation taints
scheduling.koordinator.sh/reservation-affinity: '{"tolerations":[{"key":"test-taint-key","operator":"Equal","value":"test-taint-value","effect":"NoSchedule"}]}'
  1. Enable Reservation Preemption

Note: Currently, high-priority Pods preempting low-priority Reservations is not supported.

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfigs:
- name: Reservation
args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: ReservationArgs
enablePreemption: true
# ...
plugins:
postFilter:
# Disable DefaultPreemption plugin’s preemption in scheduler configuration, enable Reservation plugin’s preemption
- disabled:
- name: DefaultPreemption
# ...
- enabled:
- name: Reservation

Special thanks to community developer @saintube for contributing to this feature!

6. Co-location: Mid-tier Supports Idle Resource Reallocation, Enhances Pod-Level QoS Configuration

In modern data centers, co-location technology has become an important means to improve resource utilization. By mixing latency-sensitive tasks (e.g., online services) with resource-intensive tasks (e.g., offline batch processing) on the same cluster, enterprises can significantly reduce hardware costs and improve resource efficiency. However, as the resource water level in co-located clusters continues to rise, ensuring resource isolation between different types of tasks becomes a key challenge.

In co-location scenarios, the core objectives of resource isolation capabilities are:

  • Guaranteeing High-Priority Task Performance: For example, online services require stable CPU, memory, and I/O resources to meet low-latency requirements.
  • Fully Utilizing Idle Resources: Offline tasks should utilize as much unused resource from high-priority tasks as possible without interfering with them.
  • Dynamically Adjusting Resource Allocation: Real-time adjustment of resource allocation strategies based on node load changes to avoid resource contention or waste.

To achieve these goals, Koordinator continuously builds and refines resource isolation capabilities. In v1.6, we focused on optimizing resource oversubscription and co-location QoS with a series of functional optimizations and bug fixes, specifically including:

  • Optimizing calculation logic for Mid resource oversubscription and node profiling features, supporting oversubscription of unallocated node resources to avoid double oversubscription of node resources.
  • Optimizing metric degradation logic for load-aware scheduling. Supporting Pod-level configuration for CPU QoS and Resctrl QoS.
  • Supplementing Prometheus metrics for out-of-band load management to enhance observability.
  • Bugfixes for Blkio QoS, resource amplification, and other features.

Mid resource oversubscription was introduced starting from Koordinator v1.3, providing dynamic resource oversubscription capabilities based on Node Profiling. However, to ensure the stability of oversubscribed resources, Mid resources are entirely sourced from Prod pods already allocated on nodes, meaning no Mid resources exist on empty nodes initially, posing inconveniences for some workloads using Mid resources. The Koordinator community received feedback and contributions from some enterprise users. Resource Model In v1.6, Koordinator updated the oversubscription formula as follows:

MidAllocatable := min(ProdReclaimable, NodeAllocatable * thresholdRatio) + ProdUnallocated * unallocatedRatio
ProdReclaimable := min(max(0, ProdAllocated - ProdPeak * (1 + safeMargin)), NodeUnused)

There are two changes in the calculation logic:

  1. Supporting static proportional oversubscription of unallocated resources to improve cold start issues.
  2. Disallowing oversubscription of actually used node resources to avoid overestimated predictions caused by secondary oversubscription scenarios; for example, some users leverage Koordinator’s node resource amplification capabilities to schedule more Prod pods, causing ProdAllocated > NodeAllocatable, leading to MidAllocatable predictions deviating from actual node loads.

Additionally, in terms of co-location QoS, Koordinator v1.6 enhances Pod-level QoS policy configuration capabilities, applicable to scenarios such as adding blacklisted interfering Pods on co-located nodes and gray-scale adjustments to co-location QoS usage:

  1. Resctrl feature, supporting LLC and memory bandwidth isolation capabilities at the Pod level.
  2. CPU QoS feature, supporting CPU QoS configuration at the Pod level.

The Resctrl feature can be enabled at the Pod level as follows:

  1. Enable the Resctrl feature in Koordlet’s feature-gate.
  2. Configure LLC and memory bandwidth (MB) restriction policies via Pod Annotation protocol node.koordinator.sh/resctrl. For example,
apiVersion: v1
kind: Pod
metadata:
annotations:
node.koordinator.sh/resctrl: '{"llc": {"schemata": {"range": [0, 30]}}, "mb": {"schemata": {"percent": 20}}}'

Pod-level CPU QoS configuration can be enabled as follows:

  1. Enable CPU QoS, please refer to: https://koordinator.sh/docs/user-manuals/cpu-qos/
  2. Configure Pod CPU QoS policies via Pod Annotation protocol koordinator.sh/cpuQOS. For example,
apiVersion: v1
kind: Pod
metadata:
annotations:
koordinator.sh/cpuQOS: '{"groupIdentity": 1}'

Special thanks to @kangclzjc, @j4ckstraw, @lijunxin559, @tan90github, @yangfeiyu20102011 and other community developers for their contributions to co-location related features!

7. Scheduling, Rescheduling: Continuously Improved Operational Efficiency

With the continuous development of cloud-native technologies, more and more enterprises are migrating core businesses to Kubernetes platforms, resulting in explosive growth in cluster scale and task numbers. This trend brings significant technical challenges, especially in terms of scheduling performance and rescheduling strategies:

  • Scheduling Performance Requirements: As cluster sizes expand, the number of tasks schedulers need to handle surges dramatically, placing higher demands on scheduler performance and scalability. For instance, in large-scale clusters, how to quickly complete Pod scheduling decisions and reduce scheduling latency becomes a key issue.
  • Rescheduling Strategy Requirements: In multi-tenant environments, intensified resource competition may cause frequent rescheduling, leading to workloads repeatedly migrating between nodes, thereby increasing system burden and affecting cluster stability. Additionally, how to reasonably allocate resources to avoid hotspot issues while ensuring stable operation of production tasks has become a critical consideration in designing rescheduling strategies.

To address these challenges, Koordinator comprehensively optimized the scheduler and rescheduler in v1.6.0, aiming to improve scheduling performance and enhance the stability and rationality of rescheduling strategies. Below are our optimizations for scheduler performance in the current version:

  1. Moving MinMember checks for PodGroups to PreEnqueue to reduce unnecessary scheduling cycles.
  2. Delaying resource returns for Reservations to the AfterPreFilter stage, performing resource returns only on nodes allowed by PreFilterResult to reduce algorithm complexity.
  3. Optimizing CycleState distributions for plugins like NodeNUMAResource, DeviceShare, and Reservation to reduce memory overhead.
  4. Adding delay metrics for additional extension points introduced by Koordinator, such as BeforePreFilter and AfterPreFilter.

As cluster scales continue to grow, the stability and rationality of the rescheduling process become focal concerns. Frequent evictions may cause workloads to repeatedly migrate between nodes, increasing system burden and posing stability risks. To this end, we conducted several optimizations for the rescheduler in v1.6.0:

  1. LowNodeLoad Plugin Optimization:
    1. The LowNodeLoad plugin now supports configuring ProdHighThresholds and ProdLowThresholds, combining Koordinator priorities (Priority) to manage workload resource utilization differently, reducing hotspot issues caused by production applications and achieving finer-grained load balancing;
    2. Optimized sorting logic for candidate eviction Pods, selecting the most suitable Pods for eviction through segmented function scoring algorithms to ensure reasonable resource allocation and avoid stability issues caused by evicting the most resource-utilized Pods;
    3. Optimized pre-eviction checks for Pods; LowNodeLoad checks whether target nodes might become new hotspot nodes before evicting Pods, effectively preventing repeated rescheduling occurrences.
  2. MigrationController Enhancement:
    1. MigrationController possesses ObjectLimiter capabilities, controlling workload eviction frequency over a certain period. It now supports namespace-level eviction throttling, providing more granular control over evictions within namespaces; simultaneously moving ObjectLimiter from Arbitrator to inside MigrationController, fixing potential throttling failures in concurrent scenarios;
    2. Added EvictAllBarePods configuration item, allowing users to enable eviction of Pods without OwnerRef, thus increasing rescheduling flexibility;
    3. Added MaxMigratingGlobally configuration item, enabling MigrationController to independently control the maximum number of Pod evictions, thereby reducing stability risks;
    4. Optimized GetMaxUnavailable method calculation logic, adjusting downward-rounded calculations of workload maximum unavailable replicas to 1 when it rounds down to 0, avoiding loss of accuracy and consistency in user-controlled replica unavailability expectations.
  3. Added global rescheduling parameter MaxNoOfPodsToEvictTotal, ensuring the rescheduler’s global maximum number of Pod evictions, reducing cluster burden and enhancing stability;

Special thanks to community developers @AdrianMachao, @songtao98, @LY-today, @zwForrest, @JBinin, @googs1025, @bogo-y for their contributions to scheduling and rescheduling optimizations!

Future Plans

The Koordinator community will continue focusing on strengthening GPU resource management and scheduling functions, providing rescheduling plugins to further resolve GPU fragmentation issues caused by imbalanced resource allocation, and plans to introduce more new features and functionalities in the next version to support more complex workload scenarios; meanwhile, in resource reservation and co-location, we will further optimize to support finer-grained scenarios.

Currently planned Proposals in the community are as follows:

Key usage issues to be addressed include:

Long-term planned Proposals include:

We encourage user feedback on usage experiences and welcome more developers to participate in the Koordinator project, jointly driving its development!

· 12 min read
Rougang Han
Jianyu Wang

Background

Koordinator is an open source project, born from the accumulated experience of the container scheduling industry in Alibaba for more than two years. It has been iterating continuously to provide comprehensive solutions for workload consolidation, co-located resource scheduling, mixed resource isolation and mixed performance tuning. It aims to help users optimize container performance and improve the efficiency of cluster resource usage and management and optimization of latency-sensitive workloads and batch jobs.

Today, Koordinator v1.5.0 is released. It is the 13th major release of Koordinator since its officially open-sourced in April 2022. The Koordinator community is grateful to involve all the excellent engineers from Alibaba, Ant Technology Group, Intel, XiaoHongShu, Xiaomi, iQiyi, 360, YouZan, etc., who have contributed great ideas, code, and various scenarios. In v1.5.0, Koordinator brings a lot of feature improvements, including Pod-level NUMA alignment strategy, network QoS, Core Scheduling, etc.

Besides, Koordinator has been accepted by the CNCF TOC members as a Sandbox project. CNCF (Cloud Native Computing Foundation) is an independent, non-profit organization that supports and promotes cloud native software like Kubernetes, Prometheus, and etc.

koordinator-aboard-cncf-sandbox-img Vote address: https://github.com/cncf/sandbox/issues/51

Key Features

Pod-level NUMA Policy

In the past version of v1.4.0, Koordinator supports users to set different NUMA alignment policies for different nodes in the cluster. However, this means that users need to pre-split the nodes into different node pools with different NUMA alignment policies, which cause additional overhead of the node operations.

In v1.5.0, Koordinator introduces Pod-level NUMA alignment policies to solve this problem. For example, we can set SingleNUMANode for pod-1:

apiVersion: v1
kind: Pod
metadata:
name: pod-1
annotations:
scheduling.koordinator.sh/numa-topology-spec: |-
{
"numaTopologyPolicy": "SingleNUMANode",
}
spec:
containers:
- name: container-1
resources:
requests:
cpu: '1'
limits:
cpu: '1'

After introducing Pod-level NUMA policies, it is possible that there are multiple NUMA policies on the same node. For example, node-1 has two NUMA nodes, pod-1 uses SingleNUMANode policy on numa-0, and pod-2 uses Restricted policy on numa-0 and numa-1.

Since setting the resource requirements for the Pods can only limit the maximum resources they can use on the machines, it cannot limit the resources they can use on a NUMA node. So pod-2 may use more resources than the resources allocated on numa-0. This leads to resource contention between pod-2 and pod-1 on numa-0.

To solve this problem, Koordinator supports configuring the exclusive policy for Pods with SingleNUMANode policy. For example, we can configure pod-1 to use SingleNUMANode policy and not co-exist with other Pods on the same machine:

apiVersion: v1
kind: Pod
metadata:
name: pod-1
annotations:
scheduling.koordinator.sh/numa-topology-spec: |-
{
"numaTopologyPolicy": "SingleNUMANode",
"singleNUMANodeExclusive": "Required", # Required or Preferred
}
spec:
containers:
- name: container-1
resources:
requests:
cpu: '1'
limits:
cpu: '1'

Moreover, the introduction of Pod-level NUMA policies does not mean that the Node-level NUMA policies will be deprecated. Instead, they are compatible. If the Pod and Node-level NUMA policies are different, the Pod will not be scheduled to the node; if the Node-level NUMA policy is "", it means that the node can place any Pod; if the Pod-level NUMA policy is "", it means that the Pod can be scheduled to any node.

SingleNUMANode nodeRestricted nodeBestEffort node
SingleNUMANode pod[✓][x][x]
Restricted pod[x][✓][x]
BestEffort pod[x][x][✓]
""[✓][✓][✓]

For more information about Pod-level NUMA policies, please see Proposal: Pod-level NUMA Policy.

Terway Net QoS

In v1.5.0, Koordinator cooperates with the Terway community to build the Network QoS.

Terway QoS is born to solve the network bandwidth contention problem in workload consolidation and co-location scenarios. It supports limiting the bandwidth of Pods or QoS classes, which is different from other solutions:

  1. It supports limiting the bandwidth according to the business type, which is suitable for workload consolidation scenarios where multiple applications can be co-located at the same node.
  2. It supports dynamic adjustment of Pod bandwidth limits.
  3. It can limit the whole machine bandwidth, supporting multiple network cards, supporting to limit the container network and HostNetwork Pods.

Terway QoS has 3 types of network bandwidth priority, and the corresponding Koordinator default QoS mapping is as follows:

Koordinator QoSKubernetes QoSTerway Net QoS
SYSTEM--L0
LSEGuaranteedL1
LSRGuaranteedL1
LSGuaranteed/BurstableL1
BEBestEffortL2

In the co-location scenario, we want to ensure the maximum bandwidth of online applications to avoid contention. When the node is idle, offline jobs can also fully utilize all bandwidth resources.

Therefore, users can define business traffic as 3 priorities, from high to low, respectively as L0, L1, and L2. We define the contention scenario as: when the sum of the bandwidth of L0, L1, and L2 exceeds the whole machine bandwidth.

L0's maximum bandwidth will be dynamically adjusted according to the real-time bandwidth of L1 and L2. It can be high to the total machine bandwidth and low to "total machine bandwidth - L1 minimum bandwidth - L2 minimum bandwidth". In any case, the bandwidth of L1 and L2 will not exceed their upper limits. In the contention scenario, the bandwidth of L1 and L2 will not be lower than their lower limits, and the bandwidth will be limited in the order of L2, L1, and L0. Since Terway QoS only has three priorities, only the total machine bandwidth limit can be set for LS and BE. The remaining of L0 can be calculated according to the upper bandwidth limit of the whole machine.

Here is an example of the configuration:

# unit: bps
resource-qos-config: |
{
"clusterStrategy": {
"policies": {"netQOSPolicy":"terway-qos"},
"lsClass": {
"networkQOS": {
"enable": true,
"ingressRequest": "50M",
"ingressLimit": "100M",
"egressRequest": "50M",
"egressLimit": "100M"
}
},
"beClass": {
"networkQOS": {
"enable": true,
"ingressRequest": "10M",
"ingressLimit": "200M",
"egressRequest": "10M",
"egressLimit": "200M"
}
}
}
}
system-config: |-
{
"clusterStrategy": {
"totalNetworkBandwidth": "600M"
}
}

Besides, Koordinator supports Pod-level bandwidth limits through the following annotations:

KeyValue
koordinator.sh/networkQOS'{"IngressLimit": "10M", "EgressLimit": "20M"}'

For more information about the Network QoS, please see Network Bandwidth Limitation Using Terway QoS and Terway Community.

Core Scheduling

In v1.5.0, Koordinator provides container-level Core Scheduling ability. It reduces the risk of Side Channel Attacks (SCA) in multi-tenant scenarios, and can be used as a CPU QoS enhancement for the co-location scenarios.

Linux Core Scheduling supports defining a task group in user space that can share physical cores. Tasks belonging to the same group are assigned the same cookie as an identifier. And only tasks of one cookie will be run on a physical core (SMT dimension) at the same time. By applying this mechanism to security or performance scenarios, we can achieve the following things:

  1. Isolate physical cores for tasks of different tenants.
  2. Avoid the contention between offline jobs and online services.

Koordinator enables the kernel mechanism Core Scheduling to achieve container-level group isolation policies, and finally forms the following two capabilities:

  1. Runtime isolation of physical core: Pods can be grouped by the tenants, so pods in different groups cannot share physical cores at the same time for multi-tenant isolation.
  2. Next-gen CPU QoS policy: It can achieve a new CPU QoS policy which ensures both the CPU performance and the security.

Runtime Isolation of Physical Core

Koordinator provides Pod Label protocol to identify the Core Scheduling group of Pods.

KeyValue
koordinator.sh/coreSchedulingGroup"xxx-group"

Different groups of Pods are running exclusively at the physical core level, which can avoid some side channel attacks on the physical cores, L1 cache or L2 cache for multi-tenant scenarios.

container-core-scheduling-img

Different from the cpuset scheduling, the scope of the running cpus of Pods is not fixed. The physical cores can run Pods of different groups at different moments. Thus, the physical cores can be shared by time-division multiplexing.

Next-Gen CPU QoS Policy

Koordinator build a new CPU QoS policy based on the Core Scheduling and CGroup Idle mechanism provided by the Anolis OS kernel.

  • BE containers enable the CGroup Idle feature to lower scheduling weights and priorities.
  • LSR/LS containers enable Core Scheduling feature to expel BE tasks of the same group on the physical cores.

Users can enable the Core Scheduling policy by specifying cpuPolicy="coreSched" in the slo-controller-config.

# Example of the slo-controller-config ConfigMap.
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
resource-qos-config: |
{
"clusterStrategy": {
"policies": {
"cpuPolicy": "coreSched"
},
"lsClass": {
"cpuQOS": {
"enable": true,
"coreExpeller": true,
"schedIdle": 0
}
},
"beClass": {
"cpuQOS": {
"enable": true,
"coreExpeller": false,
"schedIdle": 1
}
}
}
}

For more information about the Core Scheduling, please see CPU QoS.

Other Changes

Koordinator v1.5.0 also includes the following enhancements and reliability improvements:

  1. Enhancements: Reservation Restricted mode supports controlling which resources strictly follow the Restricted semantic through Annotation. NUMA align policy adapts upstream; Coscheduling implements the fair scheduling queuing to ensure that Pods in the same GangGroup are dequeued together, and different Gangs and bare Pods are sorted by last scheduling time. NRI mode supports reconnection mechanism. Koordlet improves the monitoring metrics and adds performance metrics. BlkioReconcile updates the configurations.
  2. BugFixes: Fix the memory leak of koordlet CPUSuppress feature. Fix the panic problem of runtimeproxy. Revise the calculation logic of CPICollector, BECPUEvict, and CPUBurst modules.
  3. Environment compatibility: All components are upgraded to K8S 1.28. koordlet supports to run on a non-CUDA images. Koordlet adapts the kubelet 1.28 configuration and optimizes the compatibility logic for the cpu manager. Koordlet adapts cri-o runtime.
  4. Refactoring and improvement: Koordlet improves the resctrl updating logic. Koordlet improves the eviction logic. Revise the GPU resources and card model reporting. Revise the Batch resource calculation logic.
  5. CI/CD: Fix some flaky tests.

For more information about the v1.5.0 changes, please see v1.5.0 Release.

Contributors

Koordinator is an open source community. In v1.5.0, there are 10 new developers who contributed to Koordinator main repo. They are @georgexiang, @googs1025, @l1b0k, @ls-2018, @PeterChg, @sjtufl, @testwill, @yangfeiyu20102011, @zhifanggao, @zwForrest.

Koordinator community now has many enterprise contributors, some of which became Maintainers and Members. During the v1.5.0 release, the new Maintainers are

  • @songzh215
  • @j4ckstraw
  • @lucming
  • @kangclzjc

Thanks for the elders for their consistent efforts and the newbies for their active contributions. We welcome more contributors to join the Koordinator community.

Future Plan

In next versions, Koordinator plans the following works:

  • Scheduling performance optimization: The scheduling performance is the key indicator of whether the scheduler can handle large-scale clusters. In the next version, Koordinator will provide a setup guide of the basic benchmark environment and common benchmark scenarios, and start to improve the scheduling performance of Koord-Scheduler.
  • Device union allocation: In the LLC distributed training of AI scenarios, GPUs of different machines usually need to communicate with each other through high-performance network card, and GPU and high-performance network card are allocated near each other for better performance. Koordinator is working on the support of union allocation for multiple heterogeneous resources. The union allocation has been supported on the protocol and the scheduling logic. The single-node logic for reporting network card resources is being explored.
  • Job-level quota preemption: In the large-scale cluster scenario, some quotas can be busy, while other quotas can be idle. In the ElasticQuota plugin, we have supported borrowing resources from the idle quotas. But the scheduler has not considered the Job information when the borrowed quotas expect to take back resources. For the Pods belonging to the same Job, the scheduler should do preempt in the Job-level to ensure the job scheduling and improve the efficiency.
  • Load-aware scheduling for in-flight pods: Currently, the load-aware scheduling filters and scores nodes based on the resource utilization. It can improve the distribution of utilization to nodes, reduce the risks of scheduling pods to overload nodes. However, the accuracy of the utilization can be affected by the in-flight pods since the node metrics reporting has a lag. In the coming version, the load-aware scheduling will take consideration of the in-flight pods, guarantee pods not to schedule to overload nodes, and further improve the distribution of utilization to nodes.
  • Fine-grained isolation strategy for last-level cache and memory bandwidth: Contention of the last-level cache and memory bandwidth between containers can cause performance degradation of the memory access. By isolating the last-level cache and memory bandwidth in the QoS-level without exceeding the capacity of the RDT groups, koordlet provides the Resctrl QoS to reduce the contention between the offline workloads with the online services. In the next version, koordlet will enhance the isolation strategy based on NRI (Node Resource Interface) mode introduced in v1.3. It will provide the pod-level isolation capability, which greatly improves the feature's flexibility and timeliness.

Acknowledgement

Since the project was open-sourced, Koordinator has been released for more than 19 versions, getting 80+ contributors involved to contribute. The community is growing and has been continuously improving. We thank all the community members for their active participation and valuable feedback. We also want to thank the CNCF organization and related community members for supporting the project.

Welcome more developers and end users to join us! It is your participation and feedback that make Koordinator keep improving. Whether you are a beginner or an expert in the Cloud Native communities, we look forward to hearing your voice!

· 20 min read
Jianyu Wang

Background

As an actively developing open source project, Koordinator has undergo multiple version iterations since the release of v0.1.0 in April 2022, continuously bringing innovations and enhancements to the Kubernetes ecosystem. The core objective of the project is to provide comprehensive solutions for orchestrating collocated workloads, scheduling resources, ensuring resource isolation, and tuning performance to help users optimize container performance and improve cluster resource utilization.

In past version iterations, the Koordinator community has continued to grow, receiving active participation and contributions from engineers at well-known companies. These include Alibaba, Ant Technology Group, Intel, Xiaomi, Xiaohongshu, iQIYI, Qihoo 360, Youzan, Quwan, Meiya Pico, PITS, among others. Each version has advanced through the collective efforts of the community, demonstrating the project's capability to address challenges in actual production environments.

Today, we are pleased to announce that Koordinator v1.4.0 is officially released. This version introduces several new features, including Kubernetes and YARN workload co-location, a NUMA topology alignment strategy, CPU normalization, and cold memory reporting. It also enhances features in key areas such as elastic quota management, QoS management for non-containerized applications on hosts, and descheduling protection strategies. These innovations and improvements aim to better support enterprise-level Kubernetes cluster environments, particularly in complex and diverse application scenarios.

The release of version v1.4.0 will bring users support for more types of computing workloads and more flexible resource management mechanisms. We look forward to these improvements helping users to address a broader range of enterprise resource management challenges. In the v1.4.0 release, a total of 11 new developers have joined the development of the Koordinator community. They are @shaloulcy, @baowj-678, @zqzten, @tan90github, @pheianox, @zxh326, @qinfustu, @ikaven1024, @peiqiaoWang, @bogo-y, and @xujihui1985. We thank all community members for their active participation and contributions during this period and for their ongoing commitment to the community.

Interpretation of Version Features

1. Support Kubernetes and YARN workload co-location

Koordinator already supports the co-location of online and offline workloads within the Kubernetes ecosystem. However, outside the Kubernetes ecosystem, a considerable number of big data workloads still run on traditional Hadoop YARN.

In response, the Koordinator community, together with developers from Alibaba Cloud, Xiaohongshu, and Ant Financial, has jointly launched the Hadoop YARN and Kubernetes co-location project, Koordinator YARN Copilot. This initiative enables the running of Hadoop NodeManager within Kubernetes clusters, fully leveraging the technical value of peak-shaving and resource reuse for different types of workloads. Koordinator YARN Copilot has the following features:

  • Embracing the open-source ecosystem: Built upon the open-source version of Hadoop YARN without any intrusive modifications to YARN.
  • Unified resource priority and QoS policy: YARN NodeManager utilizes Koordinator’s Batch priority resources and adheres to Koordinator's QoS management policies.
  • Node-level resource sharing: The co-location resources provided by Koordinator can be used by both Kubernetes pod and YARN tasks. Different types of offline applications can run on the same node.

img

For the detailed design of Koordinator YARN Copilot and its use in the Xiaohongshu production environment, please refer to Previous Articles and Official Community Documents.

2. Introducing NUMA topology alignment strategy

The workloads running in Kubernetes clusters are increasingly diverse, particularly in fields such as machine learning, where the demand for high-performance computing resources is on the rise. In these fields, a significant amount of CPU resources is required, as well as other high-speed computing resources like GPUs and RDMA. Moreover, to achieve optimal performance, these resources often need to be located on the same NUMA node or even the same PCIe bus.

Kubernetes' kubelet includes a topology manager that manages the NUMA topology of resource allocation. It attempts to align the topologies of multiple resources at the node level during the admission phase. However, because the node component lacks a global view of the scheduler and the timing of node selection for pods, pods may be scheduled on nodes that are unable to meet the topology alignment policy. This can result in pods failing to start due to topology affinity errors.

To solve this problem, Koordinator moves NUMA topology selection and alignment to the central scheduler, optimizing resource NUMA topology at the cluster level. In this release, Koordinator introduces NUMA-aware scheduling of CPU resources (including Batch resources) and NUMA-aware scheduling of GPU devices as alpha features. The entire suite of NUMA-aware scheduling features is rapidly evolving.

Koordinator enables users to configure the NUMA topology alignment strategy for multiple resources on a node through the node's labels. The configurable strategies are as follows:

  • None, the default strategy, does not perform any topological alignment.
  • BestEffort indicates that the node does not strictly allocate resources according to NUMA topology alignment. The scheduler can always allocate such nodes to pods as long as the remaining resources meet the pods' needs.
  • Restricted means that nodes allocate resources in strict accordance with NUMA topology alignment. In other words, the scheduler must select the same one or more NUMA nodes when allocating multiple resources, otherwise, the node should not be considered. For instance, if a pod requests 33 CPU cores and each NUMA node has 32 cores, it can be allocated to use two NUMA nodes. Moreover, if the pod also requests GPUs or RDMA, these must be on the same NUMA node as the CPU.
  • SingleNUMANode is similar to Restricted, adhering strictly to NUMA topology alignment, but it differs in that Restricted permits the use of multiple NUMA nodes, whereas SingleNUMANode restricts allocation to a single NUMA node.

For example, to set the SingleNUMANode policy for node-0, you would do the following:

apiVersion: v1
kind: Node
metadata:
labels:
node.koordinator.sh/numa-topology-policy: "SingleNUMANode"
name: node-0
spec:
...

In a production environment, users may have enabled kubelet's topology alignment policy, which will be reflected by the koordlet in the TopologyPolicies field of the NodeResourceTopology CR object. When kubelet's policy conflicts with the policy set by the user on the node, the kubelet policy shall take precedence. The koord-scheduler essentially adopts the same NUMA alignment policy semantics as the kubelet topology manager. The kubelet policies SingleNUMANodePodLevel and SingleNUMANodeContainerLevel are both mapped to SingleNUMANode.

After configuring the NUMA alignment strategy for the node, the scheduler can identify many suitable NUMA node allocation results for each pod. Koordinator currently provides the NodeNUMAResource plugin, which allows for configuring the NUMA node allocation result scoring strategy for CPU and memory resources. This includes LeastAllocated and MostAllocated strategies, with LeastAllocated being the default. Each resource can also be assigned a configurable weight. The scheduler will ultimately select the NUMA node allocation with the highest score. For instance, we can configure the NUMA node allocation result scoring strategy to MostAllocated, as shown in the following example:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeNUMAResource
args:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: NodeNUMAResourceArgs
scoringStrategy: # Here configure Node level scoring strategy
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 1
numaScoringStrategy: # Here configure NUMA-Node level scoring strategy
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 1

3. ElasticQuota evolves again

In order to fully utilize cluster resources and reduce management system costs, users often deploy workloads from multiple tenants in the same cluster. When cluster resources are limited, competition for these resources is inevitable between different tenants. As a result, the workloads of some tenants may always be satisfied, while others may never be executed, leading to demands for fairness. The quota mechanism is a very natural way to ensure fairness among tenants, where each tenant is allocated a specific quota, and they can use resources within that quota. Tasks exceeding the quota will not be scheduled or executed. However, simple quota management cannot fulfill tenants' expectations for elasticity in the cloud. Users hope that in addition to satisfying resource requests within the quota, requests for resources beyond the quota can also be met on demand.

In previous versions, Koordinator leveraged the upstream ElasticQuota protocol, which allowed tenants to set a 'Min' value to express their resource requests that must be satisfied, and a 'Max' value to limit the maximum resources they can use. 'Max' was also used to represent the shared weight of the remaining resources of the cluster when they were insufficient.

In addition to offering a flexible quota mechanism that accommodates tenants' on-demand resource requests, Koordinator enhances ElasticQuota with annotations to organize it into a tree structure, thereby simplifying the expression of hierarchical organizational structures for users.

img

The figure above depicts a common quota tree in a cluster utilizing Koordinator's elastic quota. The root quota serves as the link between the quota system and the actual resources within the cluster. In previous iterations, the root quota existed only within the scheduler's logic. In this release, we have made the root quota accessible to users in the form of a Custom Resource (CR). Users can now view information about the root quota through the ElasticQuota CR named koordinator-root-quota.

3.1 Introducing Multi QuotaTree

In large clusters, there are various types of nodes. For example, VMs provided by cloud vendors will have different architectures. The most common ones are amd64 and arm64. There are also different models with the same architecture. In addition, nodes generally have location attributes such as availability zone. When nodes of different types are managed in the same quota tree, their unique attributes will be lost. When users want to manage the unique attributes of machines in a refined manner, the current ElasticQuota appears not to be accurate enough. In order to meet users' requirements for flexible resource management or resource isolation, Koordinator supports users to divide the resources in the cluster into multiple parts, each part is managed by a quota tree, as shown in the following figure:

img

Additionally, to help users simplify management complexity, Koordinator introduced the ElasticQuotaProfile mechanism in version 1.4.0. Users can quickly associate nodes with different quota trees through the nodeSelector, as shown in the following example:

apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
labels:
kubernetes.io/arch: amd64
name: amd64-profile
namespace: kube-system
spec:
nodeSelector:
matchLabels:
kubernetes.io/arch: amd64 // amd64 node
quotaName: amd64-root-quota // the name of root quota
---
apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
labels:
kubernetes.io/arch: arm64
name: arm64-profile
namespace: kube-system
spec:
nodeSelector:
matchLabels:
kubernetes.io/arch: arm64 // arm64 node
quotaName: arm64-root-quota // the name of root quota

After associating nodes with the quota tree, the user utilizes the ElasticQuota in each quota tree as before. When a user submits a pod to the corresponding quota, they currently still need to configure the pod's NodeAffinity to ensure that the pod runs on the correct node. In the future, we plan to add a feature that will help users automatically manage the mapping relationship from quota to node.

3.2 Support non-preemptible

Koordinator ElasticQuota supports sharing the unused part of 'Min' in ElasticQuota with other ElasticQuotas to improve resource utilization. However, when resources are tight, the pod that borrows the quota will be preempted and evicted through the preemption mechanism to get the resources back.

In actual production environments, if some critical online services borrow this part of the quota from other ElasticQuotas and preemption subsequently occurs, the quality of service may be adversely affected. Such workloads should not be subject to preemption.

To implement this safeguard, Koordinator v1.4.0 introduced a new API. Users can simply annotate a pod with quota.scheduling.koordinator.sh/preemptible: false to indicate that the pod should not be preempted.

When the scheduler detects that a pod is declared non-preemptible, it ensures that the available quota for such a pod does not exceed its 'Min'. Thus, it is important to note that when enabling this feature, the 'Min' of an ElasticQuota should be set judiciously, and the cluster must have appropriate resource guarantees in place. This feature maintains compatibility with the original behavior of Koordinator.

apiVersion: v1
kind: Pod
metadata:
name: pod-example
namespace: default
labels:
quota.scheduling.koordinator.sh/name: "quota-example"
quota.scheduling.koordinator.sh/preemptible: false
spec:
...

3.3 Other improvements

  1. The koord-scheduler previously supported the use of a single ElasticQuota object across multiple namespaces. However, in some cases, it is desirable for the same object to be shared by only a select few namespaces. To accommodate this need, users can now annotate the ElasticQuota CR with quota.scheduling.koordinator.sh/namespaces, assigning a JSON string array as the value.
  2. Performance optimization: Previously, whenever an ElasticQuota was modified, the ElasticQuota plugin would rebuild the entire quota tree. This process has been optimized in version 1.4.0.
  3. Support ignoring overhead: When a pod utilizes secure containers, an overhead declaration is typically added to the pod specification to account for the resource consumption of the secure container itself. However, whether these additional resource costs should be passed on to the end user depends on the resource pricing strategy. If it is expected that users should not be responsible for these costs, the ElasticQuota can be configured to disregard overhead. With version 1.4.0, this can be achieved by enabling the feature gate ElasticQuotaIgnorePodOverhead.

4. CPU normalization

With the diversification of node hardware in Kubernetes clusters, significant performance differences exist between CPUs of various architectures and generations. Therefore, even if a pod's CPU request is identical, the actual computing power it receives can vary greatly, potentially leading to resource waste or diminished application performance. The objective of CPU normalization is to ensure that each CPU unit in Kubernetes provides consistent computing power across heterogeneous nodes by standardizing the performance of allocatable CPUs.

To address this issue, Koordinator has implemented a CPU normalization mechanism in version 1.4.0. This mechanism adjusts the amount of CPU resources that can be allocated on a node according to the node's resource amplification strategy, ensuring that each allocatable CPU in the cluster delivers a consistent level of computing power. The overall architecture is depicted in the figure below:

img

CPU normalization consists of two steps

  1. CPU performance evaluation: To calculate the performance benchmarks of different CPUs, you can refer to the industrial performance evaluation standard, SPEC CPU. This part is not provided by the Koordinator project.
  2. Configuration of the CPU normalization ratio in Koordinator: The scheduling system schedules resources based on the normalization ratio, which is provided by Koordinator.

Configure the CPU normalization ratio information into slo-controller-config of koord-manager. The configuration example is as follows:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
cpu-normalization-config: |
{
"enable": true,
"ratioModel": {
"Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz": {
"baseRatio": 1.29,
"hyperThreadEnabledRatio": 0.82,
"turboEnabledRatio": 1.52,
"hyperThreadTurboEnabledRatio": 1.0
},
"Intel Xeon Platinum 8369B CPU @ 2.90GHz": {
"baseRatio": 1.69,
"hyperThreadEnabledRatio": 1.06,
"turboEnabledRatio": 1.91,
"hyperThreadTurboEnabledRatio": 1.20
}
}
}
# ...

For nodes configured with CPU normalization, Koordinator intercepts updates to Node.Status.Allocatable by kubelet through a webhook to achieve the amplification of CPU resources. This results in the display of the normalized amount of CPU resources available for allocation on the node.

5. Improved descheduling protection strategies

Pod migration is a complex process that involves steps such as auditing, resource allocation, and application startup. It is often intertwined with application upgrades, scaling scenarios, and the resource operations and maintenance performed by cluster administrators. Consequently, if a large number of pods are migrated simultaneously, the system's stability may be compromised. Furthermore, migrating many pods from the same workload at once can also affect the application's stability. Additionally, simultaneous migrations of pods from multiple jobs may lead to a 'thundering herd' effect. Therefore, it is preferable to process the pods in each job sequentially.

To address these issues, Koordinator previously provided the PodMigrationJob function with some protection strategies. In version v1.4.0, Koordinator has enhanced these protection strategies into an arbitration mechanism. When there is a large number of executable PodMigrationJobs, the arbiter decides which ones can proceed by employing sorting and filtering techniques.

The sorting process is as follows:

  • The time interval between the start of migration and the current, the smaller the interval, the higher the ranking.
  • The pod priority of PodMigrationJob, the lower the priority, the higher the ranking.
  • Disperse Jobs by workload, make PodMigrationJobs close in the same job.
  • If some pods in the job containing PodMigrationJob's pod is being migrated, the PodMigrationJob's ranking is higher.

The filtering process is as follows:

  • Group and filter PodMigrationJobs based on workload, node, namespace, etc.
  • Check the number of running podMigrationJobs in each workload, and those that reach a certain threshold will be excluded.
  • Check whether the number of unavailable replicas in each workload exceeds the maximum number of unavailable replicas, and those that exceed the number will be excluded.
  • Check whether the number of pods being migrated on the node where the target pod is located exceeds the maximum migration amount of a single node, and those that exceed will be excluded.

6. Cold Memory reporting

To improve system performance, the kernel generally tries not to free the page cache requested by an application but allocates as much as possible to the application. Although allocated by the kernel, this memory may no longer be accessed by applications and is referred to as cold memory.

Koordinator introduced the cold memory reporting function in version 1.4, primarily to lay the groundwork for future cold memory recycling capabilities. Cold memory recycling is designed to address two scenarios:

  1. In standard Kubernetes clusters, when the node memory level is too high, sudden memory requests can lead to direct memory recycling of the system. This can affect the performance of running containers and, in extreme cases, may result in out-of-memory (OOM) events if recycling is not timely. Therefore, maintaining a relatively free pool of node memory resources is crucial for runtime stability.
  2. In co-location scenarios, high-priority applications' unused requested resources can be recycled by lower-priority applications. Since memory not reclaimed by the operating system is invisible to the Koordinator scheduling system, reclaiming unused memory pages of a container is beneficial for improving resource utilization.

Koordlet has added a cold page collector to its collector plugins for reading the cgroup file memory.idle_stat, which is exported by kidled (Anolis kernel), kstaled (Google), or DAMON (Amazon). This file contains information about cold pages in the page cache and is present at every hierarchy level of memory. Koordlet already supports the kidled cold page collector and provides interfaces for other cold page collectors.

After collecting cold page information, the cold page collector stores the metrics, such as hot page usage and cold page size for nodes, pods, and containers, into metriccache. This data is then reported to the NodeMetric Custom Resource (CR).

Users can enable cold memory recycling and configure cold memory collection strategies through NodeMetric. Currently, three strategies are offered: usageWithHotPageCache, usageWithoutPageCache and usageWithPageCache. For more details, please see the community Design Document

7. QoS management for non-containerized applications

In the process of enterprise containerization, there may be non-containerized applications running on the host alongside those already running on Kubernetes. In order to be better compatible with enterprises in the containerization process, Koordinator has developed a node resource reservation mechanism. This mechanism can reserve resources and assign specific QoS (Quality of Service) levels to applications that have not yet been containerized. Unlike the resource reservation configuration provided by kubelet, Koordinator's primary goal is to address QoS issues that arise during the runtime of both non-containerized and containerized applications. The overall solution is depicted in the figure below:

img

Currently, applications need to start processes into the corresponding cgroup according to specifications, and Koordinator does not provide an automatic cgroup relocation tool. For host non-containerized applications, QoS is supported as follows:

  • LS (Latency Sensitive)

    • CPU QoS (Group Identity): The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet sets the Group Identity parameter for it according to the CPU QoS configuration;
    • CPUSet Allocation: The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet will set all CPU cores in the CPU share pool for it.
  • BE (Best-effort)

    • CPU QoS (Group Identity): The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet sets the Group Identity parameter for it according to the configuration of CPU QoS.

For detailed design of QoS management of non-containerized applications on the host, please refer to Community Documentation. In the future, we will gradually add support for other QoS strategies for host non-containerized applications.

8. Other features

In addition to the new features and functional enhancements mentioned above, Koordinator has also implemented the following bug fixes and optimizations in version 1.4.0:

  1. RequiredCPUBindPolicy: Fine-grained CPU orchestration now supports the configuration of the required CPU binding policy, which means that CPUs are allocated strictly in accordance with the specified CPU binding policy; otherwise, scheduling will fail.
  2. CICD: The Koordinator community provides a set of e2e testing Pipeline in v1.4.0; an ARM64 image is provided.
  3. Batch resource calculation strategy optimization: There is support for the maxUsageRequest calculation strategy, which conservatively reclaims high-priority resources. This update also optimizes the underestimate of Batch allocatable when a large number of pods start and stop on a node in a short period of time and improves considerations for special circumstances such as host non-containerized application, third-party allocatable, and dangling pod usage.
  4. Others: Optimizations include using libpfm4 and perf groups to improve CPI collection, allowing SystemResourceCollector to support customized expiration time configuration, enabling BE pods to calculate CPU satisfaction based on the evictByAllocatable policy, repairing koordlet's CPUSetAllocator filtering logic for pods with LS and None QoS, and enhancing RDT resource control to retrieve the task IDs of sandbox containers.

For a comprehensive list of new features in version 1.4.0, please visit the v1.4.0 Release page.

Future plan

In upcoming versions, Koordinator has planned the following features:

  • Core Scheduling: On the runtime side, Koordinator has begun exploring the next generation of CPU QoS capabilities. By leveraging kernel mechanisms such as Linux Core Scheduling, it aims to enhance resource isolation at the physical core level and reduce the security risks associated with co-location. For more details on this work, see Issue #1728.
  • Joint Allocation of Devices: In scenarios involving AI large model distributed training, GPUs from different machines often need to communicate through high-performance network cards. Performance is improved when GPUs and high-performance network cards are allocated in close proximity. Koordinator is advancing the joint allocation of multiple heterogeneous resources. Currently, it supports joint allocation in terms of protocol and scheduler logic; the reporting logic for network card resources on the node side is being explored.

For more information, please pay attention to Milestone v1.5.0.

Conclusion

Finally, we are immensely grateful to all the contributors and users of the Koordinator community. Your active participation and valuable advice have enabled Koordinator to continue improving. We eagerly look forward to your ongoing feedback and warmly welcome new contributors to join our ranks.

· 12 min read
Rougang Han

背景

Koordinator 是一个开源项目,旨在基于阿里巴巴在容器调度领域的多年经验,提供一个完整的混部解决方案,包含混部工作负载编排、资源调度、资源隔离及性能调优等多方面能力,来帮助用户优化容器性能,充分发掘空闲物理资源,提升资源效率,增强延迟敏感型工作负载和批处理作业的运行效率和可靠性。

在此,我们很高兴地向各位宣布 Koordinator v1.3.0 版本的发布。自 2022 年 4 月发布 v0.1.0 版本以来,Koordinator 迄今迭代发布了共 11 个版本,吸引了了包括阿里巴巴、Intel、小米、小红书、爱奇艺、360、有赞等企业在内的大量优秀工程师参与贡献。在 v1.3.0 版本中,Koordinator 带来了 NRI (Node Resource Interface) 支持、Mid 资源超卖等新特性,并在资源预留、负载感知调度、DeviceShare 调度、负载感知重调度、调度器框架、单机指标采集和资源超卖框架等特性上进行了稳定性修复、性能优化与功能增强。

在 v1.3.0 版本中,共有 12 位新加入的开发者参与到了 Koordinator 社区的建设,他们是 @bowen-intel,@BUPT-wxq,@Gala-R,@haoyann,@kangclzjc,@Solomonwisdom,@stulzq,@TheBeatles1994,@Tiana2018,@VinceCui,@wenchezhao,@zhouzijiang,感谢期间各位社区同学的积极参与和贡献,也感谢所有同学在社区的持续投入。

版本功能特性解读

资源预留增强

资源预留(Reservation)能力自 v0.5.0 版本提出后,经历了一年的打磨和迭代,在 v1.3.0 版本中针对抢占、设备预留、Coscheduling 等场景增强了预留机制,新增 allocatePolicy 字段用于定义不同的预留资源分配策略。最新的资源预留 API 如下:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo
spec:
# template字段填写reservation对象的资源需求和affinity信息,就像调度pod一样.
template:
namespace: default
spec:
containers:
- args:
- '-c'
- '1'
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources:
requests:
cpu: 500m
memory: 1Gi
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- cn-hangzhou-i
schedulerName: koord-scheduler # 指定koord-scheduler来负责reservation对象的调度.
# 指定可分配预留资源的owners.
owners:
- labelSelector:
matchLabels:
app: app-demo
ttl: 1h
# 指定预留资源是否仅支持一次性的分配.
allocateOnce: true
# 指定预留资源的分配策略,当前支持以下策略:
# - Default: 缺省配置,不限制对预留资源的分配,pod优先分配自节点上的预留资源;如果预留资源不足,则继续分配节点空闲资源。
# - Aligned: pod优先分配自节点上的预留资源;如果预留资源不足,则继续分配节点空闲资源,但要求这部分资源满足Pod需求。该策略可用于规避pod同时分配多个reservation的资源。
# - Restricted: 对于预留资源包含的各个资源维度,pod必须分配自预留资源;其余资源维度可以分配节点空闲资源。包含了Aligned策略的语义。
# 同一节点尚不支持Default策略和Aligned策略或Restricted策略共存。
allocatePolicy: "Aligned"
# 控制预留资源是否可以使用
unschedulable: false

此外,资源预留在 v1.3.0 中还包含了如下兼容性和性能优化:

  1. 增强 Reservation 的抢占,允许 Reservation 内的 Pod 间抢占,拒绝 Reservation 外的 Pod 抢占 Reservation 内的 Pod。
  2. 增强设备预留场景,如果节点上设备资源被部分预留并被 pod 使用,支持剩余资源的分配。
  3. 支持 Reservation 使用 Coscheduling。
  4. 新增 Reservation Affinity协议,允许用户一定从Reservation内分配资源。
  5. 优化 Reservation 兼容性,修复因 Reservation 导致原生打分插件失效的问题。
  6. 优化因引入 Reservation 导致的调度性能回归问题。
  7. 修复 Reservation 预留端口误删除的问题。

关于资源预留的设计,详见Designs - Resource Reservation

其他调度增强

在 v1.3.0 中,koordinator 在调度和重调度方面还包含如下增强:

  1. DeviceShare 调度

    • 更改 GPU 资源使用方式,使用 GPU Share API 时,必须声明koordinator.sh/gpu-memorykoordinator.sh/gpu-memory-ratio,允许不声明koordinator.sh/gpu-core
    • 支持打分,可用于实现 GPU Share 场景和整卡分配场景的 bin-packing 或 spread,并支持卡粒度 binpacking 或 spread。
    • 修复用户误删除 Device CRD 导致调度器内部状态异常重复分配设备的问题。
  2. 负载感知调度:修复对仅填写 Request 的 Pod 的调度逻辑。

  3. 调度器框架:优化 PreBind 阶段的 Patch 操作,将多个插件的 Patch 操作合并为一次提交,提升操作效率,降低 APIServer 压力。

  4. 重调度

    • LowNodeLoad 支持按节点池设置不同的负载水位和参数等。自动兼容原有配置。
    • 跳过 schedulerName 不是 koord-scheduler 的Pod,支持配置不同的 schedulerName。

NRI 资源管理模式

Koordinator 的 runtime hooks 支持两种模式,standalone 和 CRI proxy,然而这两种模式各自有着一些限制。当前,尽管在 standalone 模式做了很多优化,但当想获得更加及时的 Pod 或容器的事件或者环境变量的注入时还是需要依赖 proxy 模式。然而, proxy 模式要求单独部署 koord-runtime-proxy 组件来代理 CRI (Container Runtime Interface) 请求, 同时需要更改 Kubelet 的启动参数并重启 Kubelet。

NRI(Node Resource Interface),即节点资源接口,是 CRI 兼容的容器运行时插件扩展的通用框架,独立于具体的容器运行时(e.g. containerd, cri-o), 提供不同生命周期事件的接口,允许用户在不修改容器运行时源代码的情况下添加自定义逻辑。特别的是,2.0 版本 NRI 只需要运行一个插件实例用于处理所有 NRI 事件和请求,容器运行时通过 Unix-Domain Socket 与插件通信,使用基于 Protobuf 的协议数据,和 1.0 版本 NRI 相比拥有更高的性能,能够实现有状态的 NRI 插件。

通过 NRI 的引入,既能及时的订阅 Pod 或者容器的生命周期事件,又避免了对 Kubelet 的侵入修改。在 Koordinator v1.3.0 中,我们引入 NRI 这种社区推荐的方式来管理 runtime hooks 来解决之前版本遇到的问题,大大提升了 Koordinator 部署的灵活性和处理的时效性,提供了一种优雅的云原生系统的资源管理标准化模式。

nri

注:NRI 模式不支持 docker 的容器运行时,使用 docker 的用户请继续使用 standalone 模式或 proxy 模式。

关于 Koordinator 启用 NRI 的部署方式,请见Installation - Enable NRI Mode Resource Management

节点画像和 Mid 资源超卖

Koordinator 中将节点资源分为4种资源优先级模型 Prod、Mid、Batch 和 Free,低优先级资源可以复用高优先级已分配但未使用的物理资源,以提升物理资源利用率;同时,资源优先级越高,提供的资源也越稳定,例如 Batch 资源采用高优先级资源短期(short-term)已分配但未使用的超卖资源,而 Mid 资源采用高优先级资源长周期(long-term)已分配但未使用的超卖资源。不同资源优先级模型如下图所示:

resource-priority-model

Koordinator v1.3.0 新增了节点画像能力,基于 Prod 的历史资源用量进行峰值预测,以支持 Mid-tier 的资源超卖调度。Mid 资源的超卖计算公式如下:

MidAllocatable := min(ProdReclaimable, NodeAllocatable * thresholdRatio)
ProdReclaimable := max(0, ProdAllocated - ProdPeak * (1 + safeMargin))
  • ProdPeak:通过节点画像,预估的节点上已调度 Prod Pod 在中长周期内(e.g. 12h)的用量峰值。
  • ProdReclaimable:基于节点画像结果,预估在中长周期内可复用的 Prod 资源。
  • MidAllocatable:节点上可分配的 Mid 资源。

此外,Mid 资源的单机隔离保障将在下个版本得到完善,相关动态敬请关注Issue #1442。 在 v1.3.0 版本中,用户可以查看和提交 Mid-tier 的超卖资源,也可以通过以下 Prometheus metrics 来观测节点画像的趋势变化。

# 查看节点的CPU资源画像,reclaimable指标表示预测的可回收资源量,predictor对应不同的预测模型
koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="cpu", unit="core"}
# 查看节点的内存资源画像,reclaimable指标表示预测的可回收资源量,predictor对应不同的预测模型
koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="memory", unit="byte"}
$ kubectl get node test-node -o yaml
apiVersion: v1
kind: Node
metadata:
name: test-node
status:
# ...
allocatable:
cpu: '32'
memory: 129636240Ki
pods: '110'
kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods
kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods
capacity:
cpu: '32'
memory: 129636240Ki
pods: '110'
kubernetes.io/mid-cpu: '16000'
kubernetes.io/mid-memory: 64818120Ki

关于 Koordinator 节点画像的设计,详见Design - Node Prediction

其他功能

通过 v1.3.0 Release 页面,可以看到更多包含在 v1.3.0 版本的新增功能。

未来计划

在接下来的版本中,Koordinator 目前规划了以下功能:

  • 硬件拓扑感知调度,综合考虑节点 CPU、内存、GPU 等多个资源维度的拓扑关系,在集群范围内进行调度优化。
  • 提供节点可分配资源的放大机制。
  • NRI 资源管理模式的完善和增强。

更多信息,敬请关注 Milestone v1.4.0

结语

最后,Koordinator 是一个开放的社区,欢迎广大云原生爱好者们随时通过各种方式参与共建,无论您在云原生领域是初学乍到还是驾轻就熟,我们都非常期待听到您的声音!

· 13 min read
Zuowei Zhang

背景

Koordinator 是一个开源项目,基于阿里巴巴在容器调度领域多年累积的经验孵化诞生,可以提升容器性能,降低集群资源成本。通过混部、资源画像、调度优化等技术能力, 能够提高延迟敏感的工作负载和批处理作业的运行效率和可靠性,优化集群资源使用效率。

从 2022 年 4 月发布以来,Koordinator 迄今一共迭代发布了 10 个版本,吸引了了包括阿里巴巴、小米、小红书、爱奇艺、360、有赞 等在内的大量优秀工程师参与贡献。 随着2023年春天的来临,Koordinator也迎来了它的一周年诞辰,在此我们很高兴的向大家宣布,Koordinator v1.2版本正式发布。新版本中Koordinator支持了节点资源预留功能, 并兼容了K8s社区的重调度策略,同时在单机侧增加了对AMD环境L3 Cache和内存带宽隔离的支持。

在新版本中,共有12位新加入的开发者参与到了Koordiantor社区的建设,他们是@Re-Grh,@chengweiv5,@kingeasternsun,@shelwinnn,@yuexian1234,@Syulin7,@tzzcfrank @Dengerwei,@complone,@AlbeeSo,@xigang,@leason00,感谢以上开发者的贡献和参与。

版本功能特性解读

节点资源预留

混部场景中包含的应用形态多种多样,除了已经完成云原生化的容器,还包含很多尚未完成容器化的应用,这部分应用会以进程的形式在宿主机上与K8s容器共同运行。 为了减少K8s应用和其他类型应用在节点侧的资源竞争,Koordinator 支持将一部分资源预留,使其既不参与调度器的资源调度,也不参与节点侧的资源分配,达到资源分隔使用的效果。 在v1.2版本中,Koordiantor已经支持CPU和内存资源维度的预留,并允许直接指定预留的CPU编号,具体如下。

节点资源预留声明

在Node上可以配置需要预留的资源量或具体的CPU编号,举例如下:

apiVersion: v1
kind: Node
metadata:
name: fake-node
annotations: # specific 5 cores will be calculated, e.g. 0, 1, 2, 3, 4, and then those core will be reserved.
node.koordinator.sh/reservation: '{"resources":{"cpu":"5"}}'
---
apiVersion: v1
kind: Node
metadata:
name: fake-node
annotations: # the cores 0, 1, 2, 3 will be reserved.
node.koordinator.sh/reservation: '{"reservedCPUs":"0-3"}'

单机组件Koordlet在上报节点资源拓扑信息时,会将具体预留的CPU编号更新到NodeResourceTopology对象的Annotation中。

调度及重调度场景适配

调度器在分配资源的过程中,涉及了多种情况的资源校验,包括Quota管理,节点容量校验,CPU拓扑校验等等,这些场景都需要增加对节点预留资源的考虑,例如,调度器在计算节点CPU容量时,需要将节点预留的资源进行扣除。

cpus(alloc) = cpus(total) - cpus(allocated) - cpus(kubeletReserved) - cpus(nodeAnnoReserved)

此外,对于Batch混部超卖资源的计算同样需要将这部分资源扣除,而考虑到节点中还包括一部分系统进程的资源消耗,Koord-Manager在计算时会取节点预留和系统用量的最大值,具体为:

reserveRatio = (100-thresholdPercent) / 100.0
node.reserved = node.alloc * reserveRatio
system.used = max(node.used - pod.used, node.anno.reserved)
Node(BE).Alloc = Node.Alloc - Node.Reserved - System.Used - Pod(LS).Used

对于重调度,各插件策略需要在节点容量、利用率计算等场景感知节点预留资源量,此外,若已经有容器占用了节点的预留资源,重调度需要考虑将其进行驱逐,确保节点容量得到正确管理, 避免资源竞争。这部分重调度相关的功能,我们将在后续版本进行支持,也欢迎广大爱好者们一起参与共建。

单机资源管理

对于LS类型的Pod,单机Koordlet组件会根据CPU分配情况动态计算共享CPU池,对于节点预留的CPU核心会将其排除在外,确保LS类型pod和其他非容器化的进程资源隔离。 同时,对于单机相关的QoS策略,例如CPUSuppress压制策略在计算节点利用率时,会将预留资源量考虑在内。

suppress(BE) := node.Total * SLOPercent - pod(LS).Used - max(system.Used, node.anno.reserved)

关于节点资源预留功能的详细说明,可以参考 设计文档 中的介绍。

兼容社区重调度策略

得益于 Koordinator Descheduler 的框架日益成熟,在 Koordinator v1.2 版本中,通过引入一种接口适配机制,可以无缝的对 Kubernetes Desceheduler 已有插件进行兼容,在使用时您只需部署 Koordinator Descheduler 即可使用到上游的全部功能。

在实现上,Koordinator Descheduler 通过 import 上游代码不做任何侵入式的改动,保证完全兼容上游所有的插件、参数配置以及其运行策略。同时,Koordinator 允许用户为上游插件指定增强的 evictor,从而复用 Koordinator 提供的资源预留、工作负载可用性保障以及全局流控等安全性策略。

兼容的插件列表包括:

  • HighNodeUtilization
  • LowNodeUtilization
  • PodLifeTime
  • RemoveFailedPods
  • RemoveDuplicates
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingInterPodAntiAffinity
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingTopologySpreadConstraint
  • DefaultEvictor

在使用时,可以参考如下的方式配置,以 RemovePodsHavingTooManyRestarts 为例:

apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
clientConnection:
kubeconfig: "/Users/joseph/asi/koord-2/admin.kubeconfig"
leaderElection:
leaderElect: false
resourceName: test-descheduler
resourceNamespace: kube-system
deschedulingInterval: 10s
dryRun: true
profiles:
- name: koord-descheduler
plugins:
evict:
enabled:
- name: MigrationController
deschedule:
enabled:
- name: RemovePodsHavingTooManyRestarts
pluginConfig:
- name: RemovePodsHavingTooManyRestarts
args:
apiVersion: descheduler/v1alpha2
kind: RemovePodsHavingTooManyRestartsArgs
podRestartThreshold: 10

资源预留调度能力增强

Koordinator 在比较早期的版本中引入了 Reservation 机制,通过预留资源并复用给指定特征的 Pod 使用,用于帮助解决资源交付确定性问题。 例如重调度场景中期望被驱逐的 Pod 一定有资源可以使用,而不是被驱逐后无资源可用导致引起稳定性问题;又或者需要扩容时, 一些 PaaS 平台希望能够先确定是否满足应用调度编排的资源,再决定是否扩容,或者提前做一些预备工作等。

Koordinator Reservation 通过 CRD 定义,每个 Reservation 对象会在 koord-scheduler 内伪造成一个 Pod 进行调度, 这样的 Pod 我们称为 Reserve PodReserve Pod 就可以复用已有的调度插件和打分插件找到合适的节点,并最终在调度器内部状态中占据对应的资源。 Reservation 在创建时都会指定预留的资源将来要给哪些 Pod 使用,可以指定具体某个 Pod,也可以指定某些 workload 对象,或者具备某些标签的 Pod 使用。 当这些 Pod 通过 koord-scheduler 调度时,调度器会找到可以被该 Pod 使用的 Reservation 对象,并且优先使用 Reservation 的资源。 并且 Reservation Status 中会记录被哪个 Pod 使用,以及 Pod Annotations 中也会记录使用了哪个 Reservation。 Reservation 被使用后,会自动的清理内部状态,确保其他 Pod 不会因为 Reservation 导致无法调度。

在 Koordinator v1.2 中,我们做了大幅度的优化。首先我们放开了只能使用 Reservation 持有的资源的限制,允许跨出 Reservation 的资源边界, 既可以使用 Reservation 预留的资源,也可以使用节点上剩余的资源。而且我们通过非侵入式的方式扩展了 Kubernetes Scheduler Framework, 支持预留精细化资源,即可以预留 CPU 核和 GPU 设备等。我们也修改了 Reservation 可以被重复使用的默认行为,改为 AllocateOnce, 即 Reservation 一旦被某个 Pod 使用,该 Reservation 会被废弃。这样的改动是考虑到,AllocateOnce 更能覆盖大部分场景,这样作为默认行为,大家在使用时会更简单。

支持AMD环境下的L3 Cache和内存带宽隔离

在v0.3.0版本中,Koordiantor已经支持了Intel环境的L3 Cache和内存带宽隔离,在最新的1.2.0版本中我们新增了对AMD环境的支持。 Linux内核L3 Cache和内存带宽隔离能力提供了统一的resctrl接口,同时支持Intel和AMD环境,主要区别在于,Intel提供的内存带宽隔离接口为百分比格式, 而AMD提供的内存带宽隔离接口为绝对值格式,具体如下。

# Intel Format
# resctrl schema
L3:0=3ff;1=3ff
MB:0=100;1=100

# AMD Format
# resctrl schema
L3:0=ffff;1=ffff;2=ffff;3=ffff;4=ffff;5=ffff;6=ffff;7=ffff;8=ffff;9=ffff;10=ffff;11=ffff;12=ffff;13=ffff;14=ffff;15=ffff
MB:0=2048;1=2048;2=2048;3=2048;4=2048;5=2048;6=2048;7=2048;8=2048;9=2048;10=2048;11=2048;12=2048;13=2048;14=2048;15=2048

接口格式包含两部分,L3表示对应的socket或CCD可用的“路数”(way),以16进制的数据格式表示,每个比特位表示一路 MB表示对应的socket或CCD可以使用的内存带宽范围,Intel可选范围为0~100的百分比格式,AMD对应的为绝对值格式,单位为Gb/s,2048表示不限制。 Koordiantor统一提供了百分比格式的接口,并自动感知节点环境是否为AMD,决定resctrl接口中填写的格式。

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
resource-qos-config: |-
{
"clusterStrategy": {
"lsClass": {
"resctrlQOS": {
"enable": true,
"catRangeStartPercent": 0,
"catRangeEndPercent": 100,
"MBAPercent": 100
}
},
"beClass": {
"resctrlQOS": {
"enable": true,
"catRangeStartPercent": 0,
"catRangeEndPercent": 30,
"MBAPercent": 100
}
}
}
}

其他功能

通过 v1.2 release 页面,可以看到更多版本所包含的新增功能。

未来计划

在接下来的版本中,Koordiantor重点规划了以下功能,具体包括:

  • 硬件拓扑感知调度,综合考虑节点CPU、内存、GPU等多个资源维度的拓扑关系,在集群范围内进行调度优化。
  • 对重调度器的可观测性和可追溯性进行增强。
  • GPU资源调度能力的增强。

Koordinator 是一个开放的社区,非常欢迎广大云原生爱好者们通过各种方式一起参与共建,无论您在云原生领域是初学乍练还是驾轻就熟,我们都非常期待听到您的声音!

· 10 min read
Erwei Deng

什么是 CPU 混部

CPU 混部是指将不同类型的业务部署到同一台机器上运行,让它们共享机器上的 CPU 资源以提升 CPU 利用率,从而降低机器的采购和运营成本。但是,对于有些类型的任务来说,它们对延时非常的敏感,比如电商、搜索或 web 服务等,这类任务的实时性很高,但是通常对资源的消耗却不是很多,我们称之为在线任务;还有一类任务,它们更多的关注计算或者批处理,对延时没有要求,但是消耗的资源相对较多,我们称之为离线任务。

当这两类任务同时部署到同一台机器上时,由于离线任务对资源的占用较多,资源竞争导致在线任务的延时受到了很大的影响,而且,在超线程架构的机器上,即使离线任务和在线任务跑在不同的超线程 CPU 上,流水线和 cache 的竞争也会导致在线任务的运行受到影响。于是,CPU 混部技术诞生了,来解决离线任务对在线任务延时的影响,同时还能进一步提升 CPU 资源的利用率。

图1 单机混部 CPU 利用率示意图

内核 CPU 混部技术

CPU 混部技术,主要是通过单机操作系统调度器来实现的,通过任务类型来决定所分配到的 CPU 资源。Koordinator 社区主要使用的单机操作系统发行版有 Alibaba Cloud Linux 2/3(简称 Alinux2/3) 和 CentOS7.9。对于 Alinux2/3,它使用的是龙蜥社区的 Group Identity CPU 混部技术,在操作系统内核中提供了 CPU 混部能力。Group Identity 在原有的 CFS 调度器中新增了另一个运行队列来区分在线和离线任务,而且,为了避免对端 CPU(超线程架构)上离线任务的干扰,Group Identity 会对其进行驱逐。龙蜥的 Group Identity 技术已经经过阿里双十一等大型活动以及大规模商业化的验证,其 CPU 混部能力也得到广大用户和开发者的认可。

但是对于 CentOS 发行版来说,到目前为止还没有提供任何 CPU 混部相关的技术和能力。对于 CentOS CPU 混部能力的缺失,可能有以下几种解决方案:

  • 制作 CentOS 的衍生版系统,并包含 CPU 混部技术;
  • 迁移到 Alibaba Cloud Linux 2/3 操作系统发行版;

对于第一种方案,需要从 CentOS 镜像站中下载其内核源码,将 CPU 混部技术移植到内核,编译后安装,然后重启系统便可以使用该技术,但这会涉及到业务迁移和停机,势必会给业务方带来昂贵的代价。 对于第二种方案,虽然迁移工作会有一定的工作量,但是,Alinux2/3 或 Anolis OS 包含了完整的混部资源隔离方案(CPU 混部仅仅是其中一点),技术红利所带来的收益远比迁移代价要大得多。而且 CentOS 即将停服,为了解决 CentOS 停服问题,龙蜥社区推出了 Anolis OS 发行版操作系统,该发行版系统完全兼容 CentOS,用户可以进行无缝迁移。

龙蜥 CPU 混部插件

针对 Koordinator 云原生 CentOS 单机操作系统 CPU 混部能力的缺失,龙蜥社区开发人员给出了另一种方案,利用 plugsched 调度器热升级技术提供一种 CPU 混部技术的调度器插件包,该插件包含了阿里云早期(2017年)的 CPU 混部技术 bvt + noise clean,该技术采用的是 throttle 机制,当调度器选择下一个任务时,它会检测对端 CPU 上的任务类型以及当前 CPU 正在执行的任务类型,如果在、离线任务同时存在,则会将离线任务 throttle 掉,然后继续选择下一个任务进行调度,保证在线任务优先执行且不被对端 CPU 上的离线干扰。该 CPU 混部调度器插件可直接安装到 CentOS7.9,不需要停机和业务迁移等工作。

Plugsched SDK 神器

Plugsched 调度器热升级,是龙蜥社区推出的 plugsched SDK 调度器热升级开发工具,它可从 Linux 内核中将调度器解耦,形成一个独立的模块,然后将 CPU 混部技术移植到调度器模块,形成一个调度器插件,然后将其直接安装到运行的系统中就可以使用 CPU 混部技术。Plugsched,可以对内核调度器特性动态的进行增、删、改,来满足业务的需求,且无需进行业务迁移和停机升级,还可以回滚。内核开发人员可通过 plugsched SDK 生产出各种类型的调度器插件来满足不同的业务场景。

Plugsched 调度器热升级论文《Efficient Scheduler Live Update for Linux Kernel with Modularization》已被 ASPLOS 顶会收录,里面详细介绍了 plugsched 技术原理和应用价值,以及全面的测试和评估。目前,plugsched 生产的插件已在蚂蚁集团、阿里云和国内某大型互联网企业规模部署。

Plugsched 开源链接:https://gitee.com/anolis/plugsched

CPU 混部插件测试

开发人员对该调度器插件进行了 CPU 混部的测试,服务端配置:

  • 测试机器:阿里云神龙裸金属服务器,104 CPU,384 GB 内存
  • 系统配置:CentOS 7.9 发行版,内核版本 3.10,安装 CPU 混部调度器插件
  • 测试内容:在线任务是 Nginx 服务,容器配置为 80C 10GB,Nginx workers 数量为 80;离线任务是 ffmpeg 视频转码,容器配置为 50C 20GB,线程数量为 50。
  • 测试case:
    • 基线:单独启动 Nginx 容器
    • 对照组:同时启动 Nginx 容器和 ffmpeg 容器,但不设置优先级(不启用混部功能)
    • 实验组:同时启动 Nginx 容器和 ffmpeg 容器,给 Nginx 设置在线高优先级,ffmpeg 为离线低优先级(启用混部功能)

在另一台压测机上使用 wrk 工具向 Nginx 服务发起请求,结果如下:(单位:ms)

基线对照组实验组
RT-P500.2230.245(+9.86%)0.224(+0.44%)
RT-P750.3220.387(+20.18%)0.338(+4.96%)
RT-P900.4440.575(+29.50)0.504(+13.51%)
RT-P990.7061.7(+140.79)0.88(+24.64%)
CPU%25.15%71.7%49.15%

从上面的结果来看,没有 CPU 混部插件,离线任务对在线任务的影响很大,P99 延时增长了一倍多,而安装 CPU 混部插件后,P99 长尾延时的影响显著降低,CPU 利用率也接近50%。

该插件虽然能显著降低离线对在线任务的干扰,但还是逊色于龙蜥社区的 Group Identity 技术。龙蜥的 Group Identity 技术能让在线受到的干扰小于 5%,而且整机利用率的提升也比该插件要更多一些,达到 60% 以上(可查阅:koordinator 混部最佳实践手册)。这些差异的原因在于,1)内核自身的差异,CentOS 7.9 使用的是比较早的 3.10 内核,而龙蜥使用的是 4.19/5.10 内核,3.10 内核调度器性能本身就不及 4.19/5.10;2)Group Identity 的实现原理相比 noise clean 更适合 CPU 混部场景。

结语

最后,欢迎广大技术人员、开源爱好者和读者用户加入 Koordinator、openanolis 社区,享受社区带来的技术,不论是 Group Identity 还是 Plugsched 神器,一定会给大家带来意想不到的收益和价值,欢迎大家共建社区,与社区共同交流、成长和发展。

· 17 min read
Siyu Wang

背景

Koordinator 旨在为用户提供完整的混部工作负载编排、混部资源调度、混部资源隔离及性能调优解决方案,帮助用户提高延迟敏感服务的运行性能,挖掘空闲节点资源并分配给真正有需要的计算任务,从而提高全局的资源利用效率。

从 2022 年 4 月发布以来,Koordinator 迄今一共迭代发布了 9 个版本。项目经历的大半年发展过程中,社区吸纳了包括阿里巴巴、小米、小红书、爱奇艺、360、有赞 等在内的大量优秀工程师,贡献了众多的想法、代码和场景,一起推动 Koordinator 项目的成熟。

今天,很高兴的宣布 Koordinator v1.1 正式发布,它包含了负载感知调度/重调度、cgroup v2 支持、干扰检测指标采集,以及其他一系列优化点。接下来我们就针对这些新增特性做深入解读与说明。

版本特性深入解读

负载感知调度

支持按工作负载类型统计和均衡负载水位

Koordinator v1.0 及之前的版本,提供了负载感知调度提供基本的利用率阈值过滤保护高负载水位的节点继续恶化影响工作负载的运行时质量,以及通过预估机制解决解决冷节点过载的情况。已有的负载感知调度能解决很多常见场景的问题。但负载感知调度作为一种优化手段,还有比较多的场景是需要完善的。

目前的负载感知调度主要解决了集群内整机维度的负载均衡效果,但有可能出现一些特殊的情况:节点部署了不少离线Pod运行,拉高了整机的利用率,但在线应用工作负载的整体利用率偏低。这个时候如果有新的在线Pod,且整个集群内的资源比较紧张时,会有如下的问题:

  1. 有可能因为整机利用率超过整机安全阈值导致无法调度到这个节点上的;
  2. 还可能出现一个节点的利用率虽然相对比较低,但上面跑的全是在线应用率,从在线应用角度看,利用率已经偏高了,但按照当前的调度策略,还会继续调度这个Pod上来,导致该节点堆积了大量的在线应用,整体的运行效果并不好。

在 Koordinator v1.1 中,koord-scheduler 支持感知工作负载类型,区分不同的水位和策略进行调度。

在 Filter 阶段,新增 threshold 配置 prodUsageThresholds,表示在线应用的安全阈值,默认为空。如果当前调度的 Pod 是 Prod 类型,koord-scheduler 会从当前节点的 NodeMetric 中统计所有在线应用的利用率之和,如果超过了 prodUsageThresholds 就过滤掉该节点;如果是离线 Pod,或者没有配置 prodUsageThresholds,保持原有的逻辑,按整机利用率处理。

在 Score 阶段,新增开关 scoreAccordingProdUsage 表示是否按 Prod 类型的利用率打分均衡。默认不启用。当开启后,且当前 Pod 是 Prod 类型的话,koord-scheduler 在预估算法中只处理 Prod 类型的 Pod,并对 NodeMetrics 中记录的其他的未使用预估机制处理的在线应用的 Pod 的当前利用率值进行求和,求和后的值参与最终的打分。如果没有开启 scoreAccordingProdUsage,或者是离线Pod,保持原有逻辑,按整机利用率处理。

支持按百分位数利用率均衡

Koordinator v1.0及以前的版本都是按照 koordlet 上报的平均利用率数据进行过滤和打分。但平均值隐藏了比较多的信息,因此在 Koordinator v1.1 中 koordlet 新增了根据百分位数统计的利用率聚合数据。调度器侧也跟着做了相应的适配。

更改调度器的 LoadAware 插件的配置,aggregated 表示按照百分位数聚合数据进行打分和过滤。aggregated.usageThresholds 表示过滤时的水位阈值;aggregated.usageAggregationType 表示过滤阶段要使用的百分位数类型,支持 avgp99, p95, p90p50aggregated.usageAggregatedDuration 表示过滤阶段期望使用的聚合周期,如果不配置,调度器将使用 NodeMetrics 中上报的最大周期的数据;aggregated.scoreAggregationType 表示在打分阶段期望使用的百分位数类型;aggregated.scoreAggregatedDuration 表示打分阶段期望使用的聚合周期,如果不配置,调度器将使用 NodeMetrics 中上报的最大周期的数据。

在 Filter 阶段,如果配置了 aggregated.usageThresholds 以及对应的聚合类型,调度器将按该百分位数统计值进行过滤;

在 Score 阶段,如果配置了 aggregated.scoreAggregationType,调度器将会按该百分位数统计值打分;目前暂时不支持 Prod Pod 使用百分位数过滤。

负载感知重调度

Koordinator 在过去的几个版本中,持续的演进重调度器,先后了开源完整的框架,加强了安全性,避免因过度驱逐 Pod 影响在线应用的稳定性。这也影响了重调度功能的进展,过去 Koordinator 暂时没有太多力量建设重调度能力。这一情况将会得到改变。

Koordinator v1.1 中我们新增了负载感知重调度功能。新的插件称为 LowNodeLoad,该插件配合着调度器的负载感知调度能力,可以形成一个闭环,调度器的负载感知调度在调度时刻决策选择最优节点,但随着时间和集群环境以及工作负载面对的流量/请求的变化时,负载感知重调度可以介入进来,帮助优化负载水位超过安全阈值的节点。 LowNodeLoad 与 K8s descheduler 的插件 LowNodeUtilization 不同的是,LowNodeLoad是根据节点真实利用率的情况决策重调度,而 LowNodeUtilization 是根据资源分配率决策重调度。

LowNodeLoad 插件有两个最重要的参数,分别是 highThresholdslowThresholds

  • highThresholds 表示负载水位的警戒阈值,超过该阈值的节点上的Pod将参与重调度;
  • lowThresholds 表示负载水位的安全水位。低于该阈值的节点上的Pod不会被重调度。

以下图为例,lowThresholds 为45%,highThresholds 为 70%,那么低于 45% 的节点是安全的,因为水位已经很低了;高于45%,但是低于 70%的是区间是我们期望的负载水位范围;高于70%的节点就不安全了,应该把超过70%的这部分(假设当前节点A的负载水位是85%),那么 85% - 70% = 15% 的负载降低,筛选 Pod 后执行迁移。

LowNodeLoad 示例

迁移时,还要考虑到低于 45% 的这部分节点是我们重调度后要承载新Pod的节点,我们需要确保迁移的Pod的负载总量不会超过这些低负载节点的承载上限。这个承载上限即是 highThresholds - 节点当前负载,假设节点B的负载水位是20%,那么 70%-20% = 50%,这50%就是可以承载的容量了。因此迁移时每驱逐一个 Pod,这个承载容量就应该扣掉当前重调度 Pod 的当前负载或者预估负载或者画像值(这部分值与负载调度里的值对应)。这样就可以确保不会多迁移。

如果一个集群总是可能会出现某些节点的负载就是比较高,而且数量并不多,这个时候如果频繁的重调度这些节点,也会带来安全隐患,因此可以让用户按需设置 numberOfNodes

另外,LowNodeLoad 识别出超过阈值的节点后会筛选 Pod,当筛选 Pod 时,可以配置要支持或者过滤的 namespace,或者配置 pod selector 筛选,也可以配置 nodeFit 检查每个备选 Pod 对应的 Node Affinity/Node Selector/Toleration 是否有与之匹配的 Node,如果没有的话,这种节点也会被忽略。当然可以考虑不启用这个能力,通过配置 nodeFit 为 false 后即可禁用,此时完全由底层的 MigrationController 通过 Koordinator Reservation 预留资源;

当筛选出 Pod 后,会对这些 Pod 进行排序。会依靠Koordinator QoSClass、Kubernetes QoSClass、Priority、用量和创建时间等多个维度排序。

cgroup v2 支持

背景

Koordinator 中众多单机 QoS 能力和资源压制/弹性策略构建在 Linux Control Group (cgroups) 机制上,比如 CPU QoS (cpu)、Memory QoS (memory)、CPU Burst (cpu)、CPU Suppress (cpu, cpuset),koordlet 组件可以通过 cgroups (v1) 限制容器可用资源的时间片、权重、优先级、拓扑等属性。Linux 高版本内核也在持续增强和迭代了 cgroups 机制,带来了 cgroups v2 机制,统一 cgroups 目录结构,改善 v1 中不同 subsystem/cgroup controller 之间的协作,并进一步增强了部分子系统的资源管理和监控能力。Kubernetes 自 1.25 起将 cgroups v2 作为 GA (general availability) 特性,在 Kubelet 中启用该特性进行容器的资源管理,在统一的 cgroups 层次下设置容器的资源隔离参数,支持 MemoryQoS 的增强特性。

cgroup v1/v2 结构

在 Koordinator v1.1 中,单机组件 koordlet 新增对 cgroups v2 的支持,包括如下工作:

  • 重构了 Resource Executor 模块,以统一相同或近似的 cgroup 接口在 v1 和 v2 不同版本上的文件操作,便于 koordlet 特性兼容 cgroups v2 和合并读写冲突。
  • 在当前已开放的单机特性中适配 cgroups v2,采用新的 Resource Executor 模块替换 cgroup 操作,优化不同系统环境下的报错日志。

Koordinator v1.1 中大部分 koordlet 特性已经兼容 cgroups v2,包括但不限于:

  • 资源利用率采集
  • 动态资源超卖
  • Batch 资源隔离(BatchResource,废弃BECgroupReconcile)
  • CPU QoS(GroupIdentity)
  • Memory QoS(CgroupReconcile)
  • CPU 动态压制(BECPUSuppress)
  • 内存驱逐(BEMemoryEvict)
  • CPU Burst(CPUBurst)
  • L3 Cache 及内存带宽隔离(RdtResctrl)

遗留的未兼容特性如 PSICollector 将在接下来的 v1.2 版本中进行适配,可以跟进 issue#407 获取最新进展。接下来的 Koordinator 版本中也将逐渐引入更多 cgroups v2 的增强功能,敬请期待。

使用 cgroups v2

在 Koordinator v1.1 中,koordlet 对 cgroups v2 的适配对上层功能配置透明,除了被废弃特性的 feature-gate 以外,您无需变动 ConfigMap slo-controller-config 和其他 feature-gate 配置。当 koordlet 运行在启用 cgroups v2 的节点上时,相应单机特性将自动切换到 cgroups-v2 系统接口进行操作。

此外,cgroups v2 是 Linux 高版本内核(建议 >=5.8)的特性,对系统内核版本和 Kubernetes 版本有一定依赖。建议采用默认启用 cgroups v2 的 Linux 发行版以及 Kubernetes v1.24 以上版本。

更多关于如何启用 cgroups v2 的说明,请参照 Kubernetes 社区文档

干扰检测指标采集

在真实的生产环境下,单机的运行时状态是一个“混沌系统”,资源竞争产生的应用干扰无法绝对避免。Koordinator 正在建立干扰检测与优化的能力,通过提取应用运行状态的指标,进行实时的分析和检测,在发现干扰后对目标应用和干扰源采取更具针对性的策略。

当前 Koordinator 已经实现了一系列 Performance Collector,在单机侧采集与应用运行状态高相关性的底层指标,并通过 Prometheus 暴露出来,为干扰检测能力和集群应用调度提供支持。

指标采集

Performance Collector 由多个 feature-gate 进行控制,Koordinator 目前提供以下几个指标采集器:

  • CPICollector:用于控制 CPI 指标采集器。CPI:Cycles Per Instruction。指令在计算机中执行所需要的平均时钟周期数。CPI 采集器基于 Cycles 和 Instructions 这两个 Kernel PMU(Performance Monitoring Unit)事件以及 perf_event_open(2) 系统调用实现。
  • PSICollector:用于控制 PSI 指标采集器。PSI:Pressure Stall Information。表示容器在采集时间间隔内,因为等待 cpu、内存、IO 资源分配而阻塞的任务数。使用 PSI 采集器前,需要在 Anolis OS 中开启 PSI 功能,您可以参考文档获取开启方法。

Performance Collector 目前是默认关闭的。您可以通过修改 Koordlet 的 feature-gates 项来使用它,此项修改不会影响其他 feature-gate

kubectl edit ds koordlet -n koordinator-system
...
spec:
...
spec:
containers:
- args:
...
# modify here
# - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
- -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,CPICollector=true,PSICollector=true

ServiceMonitor

v1.1.0 版本的 Koordinator 为 Koordlet 增加了 ServiceMonitor 的能力,将所采集指标通过 Prometheus 暴露出来,用户可基于此能力采集相应指标进行应用系统的分析与管理。

ServiceMonitor 由 Prometheus 引入,故在 helm chart 中设置默认不开启安装,可以通过以下命令安装ServiceMonitor:

helm install koordinator https://... --set koordlet.enableServiceMonitor=true

部署后可在 Prometheus UI 找到该 Targets。

# HELP koordlet_container_cpi Container cpi collected by koordlet
# TYPE koordlet_container_cpi gauge
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="cycles",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 2.228107503e+09
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="instructions",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 4.1456092e+09

可以期待的是,Koordinator 干扰检测的能力在更复杂的真实场景下还需要更多检测指标的补充,后续将在如内存、磁盘 IO 等其他诸多资源的指标采集建设方面持续发力。

其他更新点

通过 v1.1 release 页面,可以看到更多版本所包含的新增功能。

· 7 min read
Joseph

Koordinator 今年3月份开源以来,先后发布了7个版本,逐步的把阿里巴巴&阿里云内部的混部系统的核心能力输出到开源社区,并在中间过程中逐渐的被 Kubernetes、大数据、高性能计算、机器学习领域或者社区的关注,Koordinator 社区也逐步获得了一些贡献者的支持,并有一些企业开始逐步的在生产环境中使用 Koordinator 解决实际生产中遇到的成本问题、混部问题等。 经过 Koordinator 社区的努力,我们怀着十分激动的心情向大家宣布 Koordinator 1.0 版本正式发布。

Koordinator 项目早期着重建设核心混部能力 -- 差异化 SLO,并且为了让用户更容易的使用 Koordinator 的混部能力,Koordinator 提供了 ClusterColocationProfile 机制帮助用户可以不用修改存量代码完成不同工作负载的混部,让用户逐步的熟悉混部技术。随后 Koordinaor 逐步在节点侧 QoS 保障机制上做了增强,提供了包括但不限于 CPU Suppress、CPU Burst、 Memory QoS、L3 Cache/MBA 资源隔离机制和基于满足度驱逐机制等多种能力,解决了大部分节点侧工作负载的稳定性问题。配合使用 Koordinator Runtime Proxy 组件,可以更好的兼容 Kubernetes kubelet 原生管理机制。

并且 Koordinator 在任务调度和 QoS 感知调度以及重调度等方面也都提供了一些创新方案,建设了全面兼容 Kubernetes CPU 管理机制的精细化 CPU 调度能力,面向节点实际负载的均衡调度能力。为了更好的让用户管理好资源, Koordinator 还提供了资源预留能力(Reservation),并且 Koordinator 基于 Kubernetes 社区已有的Coscheduling、ElasticQuota Scheduling 能力做了进一步的增强,为任务调度领域注入了新的活力。Koordinator 提供了全新的重调度器框架,着重建设 Descheduler 的扩展性和安全性问题。

安装或升级 Koordinator v1.0.0

使用 Helm 安装

您可以通过 helm v3.5+ 非常方便的安装 Koordinator,Helm 是一个简单的命令行工具,您可以从 这里 获取它。

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 1.0.0

版本功能特性解读

Koordinator v1.0 整体新增的特性并不多,主要有以下一些变化

独立 API Repo

为了更方便集成和使用 Koordiantor 定义的 API,并避免因依赖 Koordiantor 引入额外的依赖或者依赖冲突问题,我们建立了独立的 API Repo: koordinator-sh/apis

新增 ElasticQuota Webhook

在 Koordinator v0.7 版本中,我们基于 Kubernetes sig-scheduler 提供的 ElasticQuota 做了诸多增强,提供了树形管理机制,并提供了公平性保障机制等,可以很好的帮助您解决使用 ElasticQuota 遇到的问题。在 Koordinator v1.0 版本中,我们进一步提供了 ElasticQuota Webhook,帮助您在使用 ElasticQuota 树形管理机制时,保障新的 ElasticQuota 对象遵循 Koordinator 定义的规范或约束:

  1. 除了根节点,其他所有子节点的 min 之和要小于父节点的 min。
  2. 不限制子节点 max,允许子节点的 max 大于父节点的 max。考虑以下场景,集群中有 2 个 ElasticQuota 子树:dev-parent 和 production-parent,每个子树都有几个子 ElasticQuota。 当 production-parent 忙时,我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量,而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
  3. Pod 不能使用父节点ElasticQuota。如果放开这个限制,会导致整个弹性 Quota 的机制变的异常复杂,暂时不考虑支持这种场景。
  4. 只有父节点可以挂子节点,不允许子节点挂子节点
  5. 暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

进一步完善 ElasticQuota Scheduling

在 Koordinator v0.7 版本中,koord-scheduler 的主副 Pod 都会启动 ElasticQuota Controller 并都会更新 ElasticQuota 对象。在 Koordinator v1.0 中我们修复了该问题,确保只有主 Pod 可以启动 Controller 并更新 ElasticQuota 对象。 还优化了 ElasticQuota Controller 潜在的频繁更新 ElasticQuota 对象的问题,当检查到 ElasticQuota 各维度数据发生变化时才会更新,降低频繁更新给 APIServer 带来的压力。

进一步完善 Device Share Scheduling

Koordinator v1.0 中 koordlet 会上报 GPU 的型号和驱动版本到 Device CRD 对象中,并会由 koord-manager 同步更新到 Node 对象,追加相应的标签。

apiVersion: v1
kind: Node
metadata:
labels:
kubernetes.io/gpu-driver: 460.91.03
kubernetes.io/gpu-model: Tesla-T4
...
name: cn-hangzhou.10.0.4.164
spec:
...
status:
...

Koordinator Runtime Proxy 增强兼容性

在 Koordinator 之前的版本中,koord-runtime-proxy 和 koordlet 一起安装后,如果 koordlet 异常或者 koordlet 卸载/重装等场景下,会遇到新调度到节点的 Pod 无法创建容器的问题。为了解决这个问题,koord-runtime-proxy 会感知 Pod 是否具有特殊的 label runtimeproxy.koordinator.sh/skip-hookserver=true,如果 Pod 存在该标签,koord-runtime-proxy 会直接把 CRI 请求转发给 containerd/docker 等 runtime。

其他改动

你可以通过 Github release 页面,来查看更多的改动以及它们的作者与提交记录。