Koordinator v1.6: Enhanced Heterogeneous Resource Scheduling Capabilities for AI/ML Scenarios

February 24, 2025 · 21 min read

Jianyu Wang

Koordinator member

Rougang Han

Koordinator member

Tao Song

Koordinator member

Background

With the explosive popularity of large models like DeepSeek, the demand for heterogeneous device resource scheduling in AI and high-performance computing fields has grown rapidly, whether it's for GPUs, NPUs, or RDMA devices. Efficiently managing and scheduling these resources has become a core concern in the industry. In response to this demand, Koordinator actively addresses community requests and continues to deepen its capabilities in heterogeneous device scheduling. In the latest v1.6 release, a series of innovative features have been introduced to help customers solve complex heterogeneous resource scheduling challenges.

In v1.6, we have enhanced device topology scheduling capabilities, supporting awareness of more machine types' GPU topologies, significantly accelerating GPU interconnect performance within AI applications. Collaborating with the open-source project HAMi, we have introduced end-to-end GPU & RDMA joint allocation capabilities as well as strong GPU isolation, effectively improving cross-machine interconnect efficiency for typical AI training tasks and increasing deployment density for inference tasks. This ensures better application performance and higher cluster resource utilization. Additionally, enhancements were made to the Kubernetes community’s resource plugins, enabling different resource configurations to apply distinct node scoring strategies. This feature significantly reduces GPU fragmentation when GPU and CPU tasks coexist in a single cluster.

Since its official open-source release in April 2022, Koordinator has iterated through 14 major versions, attracting contributions from outstanding engineers at companies such as Alibaba, Ant Group, Intel, Xiaohongshu, Xiaomi, iQIYI, 360, Youzan, and more. Their rich ideas, code contributions, and real-world application scenarios have greatly propelled the project's development. Notably, in the v1.6.0 release, ten new developers actively contributed to the Koordinator community: @LY-today, @AdrianMachao, @TaoYang526, @dongjiang1989, @chengjoey, @JBinin, @clay-wangzhi, @ferris-cx, @nce3xin, and @lijunxin559. We sincerely thank them for their contributions and all community members for their ongoing dedication and support!

Key Features

1. GPU Topology-Aware Scheduling: Accelerating GPU Interconnects Within AI Applications

With the rapid development of deep learning and high-performance computing (HPC), GPUs have become a core resource for many compute-intensive workloads. Efficient GPU utilization is crucial for enhancing application performance in Kubernetes clusters. However, GPU performance is not uniform and is influenced by hardware topology and resource allocation. For example:

In multi-NUMA node systems, physical connections between GPUs, CPUs, and memory can affect data transfer speeds and computational efficiency.
For NVIDIA cards like L20 and L40S, GPU communication efficiency depends on whether they are connected via the same PCIe or NUMA node.
For Huawei’s Ascend NPU and virtualized environments using SharedNVSwitch mode with NVIDIA H-series machines, GPU allocation must adhere to predefined Partition rules.

To address these device scenarios, Koordinator provides rich device topology scheduling APIs to meet Pods’ GPU topology requirements. Below are examples of how to use these APIs:

Allocating GPUs, CPUs, and memory within the same NUMA Node:

apiVersion: v1
kind: Pod
metadata:
annotations:
    scheduling.koordinator.sh/numa-topology-spec: '{"numaTopologyPolicy":"Restricted", "singleNUMANodeExclusive":"Preferred"}'
spec:
containers:
- resources:
    limits:
        koordinator.sh/gpu: 200
        cpu: 64
        memory: 500Gi
    requests:
        koordinator.sh/gpu: 200
        cpu: 64
        memory: 500Gi

Allocating GPUs within the same PCIe:

apiVersion: v1
kind: Pod
metadata:
annotations: 
    scheduling.koordinator.sh/device-allocate-hint: |-
    {
        "gpu": {
        "requiredTopologyScope": "PCIe"
        }
    }
spec:
containers:
- resources:
    limits:
        koordinator.sh/gpu: 200

Allocating GPUs within the same NUMA Node:

apiVersion: v1
kind: Pod
metadata:
annotations: 
    scheduling.koordinator.sh/device-allocate-hint: |-
    {
        "gpu": {
        "requiredTopologyScope": "NUMANode"
        }
    }
spec:
containers:
- resources:
    limits:
        koordinator.sh/gpu: 400

Allocating GPUs according to predefined Partitions:

Predefined GPU Partition rules are typically determined by specific GPU models or system configurations and may also depend on the GPU setup on individual nodes. The scheduler cannot discern hardware model specifics or GPU types; instead, it relies on node-level components reporting these predefined rules to custom resource (CR) definitions, as shown below:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
  annotations:
    scheduling.koordinator.sh/gpu-partitions: |
      {
        "1": [
            "NVLINK": {
                {
                  # Which GPUs are included
                  "minors": [
                      0
                  ],
                  # GPU Interconnect Type
                  "gpuLinkType": "NVLink",
                  # Here we take the bottleneck bandwidth between GPUs in the Ring algorithm. BusBandwidth can be referenced from https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
                  "ringBusBandwidth": 400Gi
                  # Indicate the overall allocation quality for the node after the partition has been assigned away.
                  "allocationScore": "1",
                },
                ...
            }
            ...
        ],
        "2": [
            ...
        ],
        "4": [
            ...
        ],
        "8": [
            ...
        ]
      }
  labels:
    // Indicates whether the Partition rule must be followed
    node.koordinator.sh/gpu-partition-policy: "Honor"
  name: node-1

When multiple Partition options are available, Koordinator allows users to decide whether to allocate based on the optimal Partition:

kind: Pod
metadata:
  name: hello-gpu
  annotations:
    scheduling.koordinator.sh/gpu-partition-spec: |
      {
        # BestEffort|Restricted
        "allocatePolicy": "Restricted", 
      }
spec:
  containers:
    - name: main
      resources:
        limits:
          koordinator.sh/gpu: 100

If users do not need to allocate based on the optimal Partition, the scheduler will allocate resources in a Binpack manner as much as possible.

For more details on GPU topology-aware scheduling, please refer to the following design documents:

Special thanks to community developer @eahydra for contributing to this feature!

2. End-to-End GDR Support: Enhancing Cross-Machine Task Interconnect Performance

In AI model training scenarios, GPUs frequently require collective communication to synchronize updated weights during training iterations. GDR (GPUDirect RDMA) aims to solve the efficiency problem of exchanging data between multi-machine GPU devices. By using GDR technology, GPUs can exchange data directly without involving CPUs or memory, significantly reducing CPU/Memory overhead while lowering latency. To achieve this goal, Koordinator v1.6.0 introduces GPU/RDMA device joint scheduling capabilities, with the overall architecture outlined below:

Koordlet detects GPUs and RDMA devices on nodes and reports relevant information to the Device CR.
Koord-Manager synchronizes resources from the Device CR to node.status.allocatable.
Koord-Scheduler allocates GPUs and RDMA based on device topology and annotates allocation results onto Pods.
Multus-CNI accesses Koordlet PodResources Proxy to obtain RDMA devices allocated to Pods and attaches corresponding NICs to the Pods' network namespaces.
Koordlet provides an NRI plugin to mount devices into containers.

Due to the involvement of numerous components and complex environments, Koordinator v1.6.0 provides best practices showcasing step-by-step deployments of Koordinator, Multus-CNI, and SRIOV-CNI. After deploying the necessary components, users can simply adopt the following Pod configuration to request joint GPU and RDMA allocations from the scheduler:

apiVersion: v1
kind: Pod
metadata:
  name: pod-vf01
  namespace: kubeflow
  annotations:
    scheduling.koordinator.sh/device-joint-allocate: |-
      {
        "deviceTypes": ["gpu","rdma"]
      }
    scheduling.koordinator.sh/device-allocate-hint: |-
      {
       "rdma": {
         "vfSelector": {} //apply VF
       }
      }
spec:
  schedulerName: koord-scheduler
  containers:
  - name: container-vf
    resources:
      requests:
        koordinator.sh/gpu: 100
        koordinator.sh/rdma: 100
      limits:
        koordinator.sh/gpu: 100
        koordinator.sh/rdma: 100

For further end-to-end testing of GDR tasks using Koordinator, you can refer to the sample steps in the best practices. Special thanks to community developer @ferris-cx for contributing to this feature!

In AI applications, GPUs are indispensable core devices for large model training and inference, providing powerful computational capabilities for compute-intensive tasks. However, this powerful computing capability often comes with high costs. In production environments, we frequently encounter situations where small models or lightweight inference tasks only require a fraction of GPU resources (e.g., 20% of compute power or GPU memory), yet a high-performance GPU card must be exclusively occupied to run these tasks. This resource usage method not only wastes valuable GPU computing power but also significantly increases enterprise costs.

This situation is particularly common in the following scenarios:

Online Inference Services: Many online inference tasks have low computational demands but require high latency responsiveness, often needing deployment on high-performance GPUs to meet real-time requirements.
Development and Testing Environments: Developers debugging models usually only need a small amount of GPU resources, but traditional scheduling methods lead to low resource utilization.
Multi-Tenant Shared Clusters: In multi-user or multi-team shared GPU clusters, each task monopolizing a GPU leads to uneven resource distribution, making it difficult to fully utilize hardware capabilities.

To address this issue, Koordinator, combined with HAMi, provides GPU sharing and isolation capabilities, allowing multiple Pods to share a single GPU card. This approach not only significantly improves GPU resource utilization but also reduces enterprise costs while meeting flexible resource demands for different tasks. For example, under Koordinator’s GPU sharing mode, users can precisely allocate GPU cores or memory ratios, ensuring each task receives the required resources while avoiding interference.

HAMi is a CNCF Sandbox project aimed at providing a device management middleware for Kubernetes. HAMi-Core, its core module, hijacks API calls between CUDA-Runtime (libcudart.so) and CUDA-Driver (libcuda.so) to provide GPU sharing and isolation capabilities. In v1.6.0, Koordinator leverages HAMi-Core’s GPU isolation features to offer an end-to-end GPU sharing solution.

You can deploy DaemonSet directly on corresponding nodes to install HAMi-core using the YAML file below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-core-distribute
  namespace: default
spec:
  selector:
    matchLabels:
      koord-app: hami-core-distribute
  template:
    metadata:
      labels:
        koord-app: hami-core-distribute
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - "gpu"
      containers:
      - command:
        - /bin/sh
        - -c
        - |
          cp -f /k8s-vgpu/lib/nvidia/libvgpu.so /usl/local/vgpu && sleep 3600000
        image: docker.m.daocloud.io/projecthami/hami:v2.4.0
        imagePullPolicy: Always
        name: name
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: "0"
            memory: "0"
        volumeMounts:
        - mountPath: /usl/local/vgpu
          name: vgpu-hook
        - mountPath: /tmp/vgpulock
          name: vgpu-lock
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /usl/local/vgpu
          type: DirectoryOrCreate
        name: vgpu-hook
     # https://github.com/Project-HAMi/HAMi/issues/696
      - hostPath:
          path: /tmp/vgpulock
          type: DirectoryOrCreate
        name: vgpu-lock

Koordinator scheduler's GPU Binpack capability is enabled by default. After installing Koordinator and HAMi-Core, users can apply for shared GPU cards and enable HAMi-Core isolation as follows:

apiVersion: v1
kind: Pod
metadata:
  name: pod-example
  namespace: default
  labels:
    koordinator.sh/gpu-isolation-provider: hami-core
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 40m
        memory: 40Mi
        koordinator.sh/gpu-shared: 1
        koordinator.sh/gpu-core: 50
        koordinator.sh/gpu-memory-ratio: 50
      requests:
        cpu: 40m
        memory: 40Mi
        koordinator.sh/gpu-shared: 1
        koordinator.sh/gpu-core: 50
        koordinator.sh/gpu-memory-ratio: 50
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always

For guidance on enabling HAMi GPU sharing isolation capabilities in Koordinator, please refer to:

Device Scheduling - GPU Share With HAMi

Special thanks to HAMi community maintainer @wawa0210 for supporting this feature!

4. Differentiated GPU Scheduling Policies: Effectively Reducing GPU Fragmentation

In modern Kubernetes clusters, various types of resources (such as CPU, memory, and GPU) are typically managed on a unified platform. However, the usage patterns and demands for different resources often vary significantly, leading to differing needs for stacking (Packing) and spreading (Spreading) strategies. For example:

GPU Resources: In AI model training or inference tasks, to maximize GPU utilization and reduce fragmentation, users generally prefer to schedule GPU tasks onto nodes that already have GPUs allocated ("stacking" strategy). This prevents resource waste caused by overly dispersed GPU distributions.
CPU and Memory Resources: In contrast, CPU and memory resource demands are more diverse. For some online services or batch processing tasks, users tend to distribute tasks across multiple nodes ("spreading" strategy) to avoid hotspots on individual nodes, thereby improving overall cluster stability and performance.

Additionally, in mixed workload scenarios, different tasks’ resource demands can influence each other. For instance:

In a cluster running both GPU training tasks and regular CPU-intensive tasks, if CPU-intensive tasks are scheduled onto GPU nodes and consume significant CPU and memory resources, subsequent GPU tasks may fail to start due to insufficient non-GPU resources, remaining in a Pending state.
In multi-tenant environments, some users may only request CPU and memory resources, while others need GPU resources. If the scheduler cannot distinguish these needs, it may lead to resource contention and unfair resource allocation.

The native Kubernetes NodeResourcesFit plugin currently supports configuring the same scoring strategy for different resources, as shown below:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - pluginConfig:
      - name: NodeResourcesFit
        args:
          apiVersion: kubescheduler.config.k8s.io/v1
          kind: NodeResourcesFitArgs
          scoringStrategy:
            type: LeastAllocated
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
              - name: nvidia.com/gpu
                weight: 1

However, in practical production settings, this design may not always be suitable. For example, in AI scenarios, GPU-requesting jobs prefer to occupy entire GPU machines to prevent GPU fragmentation, whereas CPU&MEM jobs prefer spreading to reduce CPU hotspots. In v1.6.0, Koordinator introduces the NodeResourceFitPlus plugin to support differentiated scoring strategies for different resources. Users can configure it upon installing Koordinator scheduler as follows:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: NodeResourcesFitPlusArgs
      resources: 
        nvidia.com/gpu:
          type: MostAllocated
          weight: 2
        cpu:
          type: LeastAllocated
          weight: 1
        memory:
          type: LeastAllocated
          weight: 1
    name: NodeResourcesFitPlus
  plugins:
    score:
      enabled:
      - name: NodeResourcesFitPlus
        weight: 2
  schedulerName: koord-scheduler

Moreover, CPU&MEM jobs would prefer to spread to non-GPU machines to prevent excessive consumption of CPU&MEM on GPU machines, which could cause true GPU tasks to remain Pending due to insufficient non-GPU resources. In v1.6.0, Koordinator introduces the ScarceResourceAvoidance plugin to support this requirement. Users can configure the scheduler as follows, indicating that nvidia.com/gpu is a scarce resource, and when Pods do not request this scarce resource, they should avoid being scheduled onto nodes possessing it.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: ScarceResourceAvoidanceArgs
      resources: 
      - nvidia.com/gpu
    name: ScarceResourceAvoidance
  plugins:
    score:
      enabled:
      - name: NodeResourcesFitPlus
        weight: 2
      - name: ScarceResourceAvoidance
        weight: 2
      disabled:
      - name: "*"
  schedulerName: koord-scheduler

For detailed designs and user guides on GPU resource differentiated scheduling policies, please refer to:

Special thanks to community developer @LY-today for contributing to this feature.

5. Fine-Grained Resource Reservation: Meeting Efficient Operation Needs for AI Tasks

Efficient utilization of heterogeneous resources often relies on precise alignment with closely coupled CPU and NUMA resources. For example:

GPU-Accelerated Tasks: In multi-NUMA node servers, if the physical connection between GPU and CPU or memory spans NUMA boundaries, it may increase data transmission latency, significantly reducing task performance. Therefore, such tasks typically require GPU, CPU, and memory to be allocated on the same NUMA node.
AI Inference Services: Online inference tasks are highly sensitive to latency and need to ensure GPU and CPU resource allocations are as close as possible to minimize cross-NUMA node communication overhead.
Scientific Computing Tasks: Some high-performance computing tasks (e.g., molecular dynamics simulations or weather forecasting) require high-bandwidth, low-latency memory access, necessitating strict alignment of CPU cores and local memory.

These requirements extend beyond task scheduling to resource reservation scenarios. In production environments, resource reservation is an important mechanism for locking resources in advance for critical tasks, ensuring smooth operation at a future point in time. However, simple resource reservation mechanisms often fail to meet fine-grained orchestration needs in heterogeneous resource scenarios. For example:

Certain tasks may need to reserve specific NUMA node CPU and GPU resources to guarantee optimal performance upon task startup.
In multi-tenant clusters, different users may need to reserve different combinations of resources (e.g., GPU + CPU + memory) and expect these resources to be strictly aligned.
When reserved resources are not fully utilized, how to flexibly allocate remaining resources to other tasks without affecting reserved task resource guarantees is another important challenge.

To address these complex scenarios, Koordinator comprehensively enhances resource reservation functionality in v1.6, providing more refined and flexible resource orchestration capabilities. Specific improvements include:

Supporting fine-grained CPU and GPU resource reservations and preemption.
Supporting exact matching of reserved resource quantities for Pods.
Reservation affinity supports specifying reservation names and taint tolerance attributes.
Resource reservation supports limiting the number of Pods.
Supporting preempting lower-priority Pods with reserved resources.

Changes to plugin extension interfaces:

The reservation validation interface ReservationFilterPlugin is moved from the PreScore phase to the Filter phase to ensure more accurate filtering results.
The reservation ledger return interface ReservationRestorePlugin deprecates unnecessary methods to improve scheduling efficiency.

Below are examples of new functionalities:

Exact-Match Reservation. Specify Pods to exactly match reserved resource quantities, which can narrow down the matching relationship between a group of Pods and a group of reservations, making reservation allocation more controllable.

apiVersion: v1
kind: Pod
metadata:
  annotations:
   # Specify the resource categories for which the Pod exactly matches reserved resources; Pods can only match Reservation objects whose reserved resource quantities and Pod specifications are completely equal in these resource categories; e.g., specify "cpu", "memory", "nvidia.com/gpu"
    scheduling.koordinator.sh/exact-match-reservation: '{"resourceNames":{"cpu","memory","nvidia.com/gpu"}}'

Ignore Resource Reservations (reservation-ignored). Specify Pods to ignore resource reservations, allowing Pods to fill idle resources on nodes with reservations but unallocated, complementing preemption to reduce resource fragmentation.

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Specify that the Pod’s scheduling can ignore resource reservations
    scheduling.koordinator.sh/reservation-ignored: "true"

Specify Reservation Name Affinity (ReservationAffinity)

apiVersion: v1
kind: Pod
metadata:
  annotations:
    # Specify the name of the resource reservation matched by the Pod
    scheduling.koordinator.sh/reservation-affinity: '{"name":"test-reservation"}'

Specify Taints and Tolerations for Resource Reservations

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: test-reservation
spec:
  # Specify Taints for the Reservation; its reserved resources can only be allocated to Pods tolerating this taint
  taints:
  - effect: NoSchedule
    key: test-taint-key
    value: test-taint-value
  # ...
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    # Specify the Pod’s toleration for resource reservation taints
    scheduling.koordinator.sh/reservation-affinity: '{"tolerations":[{"key":"test-taint-key","operator":"Equal","value":"test-taint-value","effect":"NoSchedule"}]}'

Enable Reservation Preemption

Note: Currently, high-priority Pods preempting low-priority Reservations is not supported.

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfigs:
  - name: Reservation
    args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: ReservationArgs
      enablePreemption: true
  # ...
  plugins:
    postFilter:
    # Disable DefaultPreemption plugin’s preemption in scheduler configuration, enable Reservation plugin’s preemption
    - disabled:
      - name: DefaultPreemption
      # ...
    - enabled:
      - name: Reservation

Special thanks to community developer @saintube for contributing to this feature!

6. Co-location: Mid-tier Supports Idle Resource Reallocation, Enhances Pod-Level QoS Configuration

In modern data centers, co-location technology has become an important means to improve resource utilization. By mixing latency-sensitive tasks (e.g., online services) with resource-intensive tasks (e.g., offline batch processing) on the same cluster, enterprises can significantly reduce hardware costs and improve resource efficiency. However, as the resource water level in co-located clusters continues to rise, ensuring resource isolation between different types of tasks becomes a key challenge.

In co-location scenarios, the core objectives of resource isolation capabilities are:

Guaranteeing High-Priority Task Performance: For example, online services require stable CPU, memory, and I/O resources to meet low-latency requirements.
Fully Utilizing Idle Resources: Offline tasks should utilize as much unused resource from high-priority tasks as possible without interfering with them.
Dynamically Adjusting Resource Allocation: Real-time adjustment of resource allocation strategies based on node load changes to avoid resource contention or waste.

To achieve these goals, Koordinator continuously builds and refines resource isolation capabilities. In v1.6, we focused on optimizing resource oversubscription and co-location QoS with a series of functional optimizations and bug fixes, specifically including:

Optimizing calculation logic for Mid resource oversubscription and node profiling features, supporting oversubscription of unallocated node resources to avoid double oversubscription of node resources.
Optimizing metric degradation logic for load-aware scheduling. Supporting Pod-level configuration for CPU QoS and Resctrl QoS.
Supplementing Prometheus metrics for out-of-band load management to enhance observability.
Bugfixes for Blkio QoS, resource amplification, and other features.

Mid resource oversubscription was introduced starting from Koordinator v1.3, providing dynamic resource oversubscription capabilities based on Node Profiling. However, to ensure the stability of oversubscribed resources, Mid resources are entirely sourced from Prod pods already allocated on nodes, meaning no Mid resources exist on empty nodes initially, posing inconveniences for some workloads using Mid resources. The Koordinator community received feedback and contributions from some enterprise users. Resource Model In v1.6, Koordinator updated the oversubscription formula as follows:

MidAllocatable := min(ProdReclaimable, NodeAllocatable * thresholdRatio) + ProdUnallocated * unallocatedRatio
ProdReclaimable := min(max(0, ProdAllocated - ProdPeak * (1 + safeMargin)), NodeUnused)

There are two changes in the calculation logic:

Supporting static proportional oversubscription of unallocated resources to improve cold start issues.
Disallowing oversubscription of actually used node resources to avoid overestimated predictions caused by secondary oversubscription scenarios; for example, some users leverage Koordinator’s node resource amplification capabilities to schedule more Prod pods, causing ProdAllocated > NodeAllocatable, leading to MidAllocatable predictions deviating from actual node loads.

Additionally, in terms of co-location QoS, Koordinator v1.6 enhances Pod-level QoS policy configuration capabilities, applicable to scenarios such as adding blacklisted interfering Pods on co-located nodes and gray-scale adjustments to co-location QoS usage:

Resctrl feature, supporting LLC and memory bandwidth isolation capabilities at the Pod level.
CPU QoS feature, supporting CPU QoS configuration at the Pod level.

The Resctrl feature can be enabled at the Pod level as follows:

Enable the Resctrl feature in Koordlet’s feature-gate.
Configure LLC and memory bandwidth (MB) restriction policies via Pod Annotation protocol node.koordinator.sh/resctrl. For example,

apiVersion: v1
kind: Pod
metadata:
  annotations:
    node.koordinator.sh/resctrl: '{"llc": {"schemata": {"range": [0, 30]}}, "mb": {"schemata": {"percent": 20}}}'

Pod-level CPU QoS configuration can be enabled as follows:

Enable CPU QoS, please refer to: https://koordinator.sh/docs/user-manuals/cpu-qos/
Configure Pod CPU QoS policies via Pod Annotation protocol koordinator.sh/cpuQOS. For example,

apiVersion: v1
kind: Pod
metadata:
  annotations:
    koordinator.sh/cpuQOS: '{"groupIdentity": 1}'

Special thanks to @kangclzjc, @j4ckstraw, @lijunxin559, @tan90github, @yangfeiyu20102011 and other community developers for their contributions to co-location related features!

7. Scheduling, Rescheduling: Continuously Improved Operational Efficiency

With the continuous development of cloud-native technologies, more and more enterprises are migrating core businesses to Kubernetes platforms, resulting in explosive growth in cluster scale and task numbers. This trend brings significant technical challenges, especially in terms of scheduling performance and rescheduling strategies:

Scheduling Performance Requirements: As cluster sizes expand, the number of tasks schedulers need to handle surges dramatically, placing higher demands on scheduler performance and scalability. For instance, in large-scale clusters, how to quickly complete Pod scheduling decisions and reduce scheduling latency becomes a key issue.
Rescheduling Strategy Requirements: In multi-tenant environments, intensified resource competition may cause frequent rescheduling, leading to workloads repeatedly migrating between nodes, thereby increasing system burden and affecting cluster stability. Additionally, how to reasonably allocate resources to avoid hotspot issues while ensuring stable operation of production tasks has become a critical consideration in designing rescheduling strategies.

To address these challenges, Koordinator comprehensively optimized the scheduler and rescheduler in v1.6.0, aiming to improve scheduling performance and enhance the stability and rationality of rescheduling strategies. Below are our optimizations for scheduler performance in the current version:

Moving MinMember checks for PodGroups to PreEnqueue to reduce unnecessary scheduling cycles.
Delaying resource returns for Reservations to the AfterPreFilter stage, performing resource returns only on nodes allowed by PreFilterResult to reduce algorithm complexity.
Optimizing CycleState distributions for plugins like NodeNUMAResource, DeviceShare, and Reservation to reduce memory overhead.
Adding delay metrics for additional extension points introduced by Koordinator, such as BeforePreFilter and AfterPreFilter.

As cluster scales continue to grow, the stability and rationality of the rescheduling process become focal concerns. Frequent evictions may cause workloads to repeatedly migrate between nodes, increasing system burden and posing stability risks. To this end, we conducted several optimizations for the rescheduler in v1.6.0:

LowNodeLoad Plugin Optimization:
1. The LowNodeLoad plugin now supports configuring ProdHighThresholds and ProdLowThresholds, combining Koordinator priorities (Priority) to manage workload resource utilization differently, reducing hotspot issues caused by production applications and achieving finer-grained load balancing;
2. Optimized sorting logic for candidate eviction Pods, selecting the most suitable Pods for eviction through segmented function scoring algorithms to ensure reasonable resource allocation and avoid stability issues caused by evicting the most resource-utilized Pods;
3. Optimized pre-eviction checks for Pods; LowNodeLoad checks whether target nodes might become new hotspot nodes before evicting Pods, effectively preventing repeated rescheduling occurrences.
MigrationController Enhancement:
1. MigrationController possesses ObjectLimiter capabilities, controlling workload eviction frequency over a certain period. It now supports namespace-level eviction throttling, providing more granular control over evictions within namespaces; simultaneously moving ObjectLimiter from Arbitrator to inside MigrationController, fixing potential throttling failures in concurrent scenarios;
2. Added EvictAllBarePods configuration item, allowing users to enable eviction of Pods without OwnerRef, thus increasing rescheduling flexibility;
3. Added MaxMigratingGlobally configuration item, enabling MigrationController to independently control the maximum number of Pod evictions, thereby reducing stability risks;
4. Optimized GetMaxUnavailable method calculation logic, adjusting downward-rounded calculations of workload maximum unavailable replicas to 1 when it rounds down to 0, avoiding loss of accuracy and consistency in user-controlled replica unavailability expectations.
Added global rescheduling parameter MaxNoOfPodsToEvictTotal, ensuring the rescheduler’s global maximum number of Pod evictions, reducing cluster burden and enhancing stability;

Special thanks to community developers @AdrianMachao, @songtao98, @LY-today, @zwForrest, @JBinin, @googs1025, @bogo-y for their contributions to scheduling and rescheduling optimizations!

Future Plans

The Koordinator community will continue focusing on strengthening GPU resource management and scheduling functions, providing rescheduling plugins to further resolve GPU fragmentation issues caused by imbalanced resource allocation, and plans to introduce more new features and functionalities in the next version to support more complex workload scenarios; meanwhile, in resource reservation and co-location, we will further optimize to support finer-grained scenarios.

Currently planned Proposals in the community are as follows:

Key usage issues to be addressed include:

NRI Plugin Conflicts

Long-term planned Proposals include:

Providing an End-to-End Evolvable Device Management Solution

We encourage user feedback on usage experiences and welcome more developers to participate in the Koordinator project, jointly driving its development!

Koordinator v1.5: continuous optimization, join CNCF Sandbox

June 18, 2024 · 12 min read

Rougang Han

Koordinator member

Jianyu Wang

Koordinator member

Background

Koordinator is an open source project, born from the accumulated experience of the container scheduling industry in Alibaba for more than two years. It has been iterating continuously to provide comprehensive solutions for workload consolidation, co-located resource scheduling, mixed resource isolation and mixed performance tuning. It aims to help users optimize container performance and improve the efficiency of cluster resource usage and management and optimization of latency-sensitive workloads and batch jobs.

Today, Koordinator v1.5.0 is released. It is the 13th major release of Koordinator since its officially open-sourced in April 2022. The Koordinator community is grateful to involve all the excellent engineers from Alibaba, Ant Technology Group, Intel, XiaoHongShu, Xiaomi, iQiyi, 360, YouZan, etc., who have contributed great ideas, code, and various scenarios. In v1.5.0, Koordinator brings a lot of feature improvements, including Pod-level NUMA alignment strategy, network QoS, Core Scheduling, etc.

Besides, Koordinator has been accepted by the CNCF TOC members as a Sandbox project. CNCF (Cloud Native Computing Foundation) is an independent, non-profit organization that supports and promotes cloud native software like Kubernetes, Prometheus, and etc.

Vote address: https://github.com/cncf/sandbox/issues/51

Key Features

Pod-level NUMA Policy

In the past version of v1.4.0, Koordinator supports users to set different NUMA alignment policies for different nodes in the cluster. However, this means that users need to pre-split the nodes into different node pools with different NUMA alignment policies, which cause additional overhead of the node operations.

In v1.5.0, Koordinator introduces Pod-level NUMA alignment policies to solve this problem. For example, we can set SingleNUMANode for pod-1:

apiVersion: v1
kind: Pod
metadata:
  name: pod-1
  annotations:
    scheduling.koordinator.sh/numa-topology-spec: |-
      {
        "numaTopologyPolicy": "SingleNUMANode",
      }
spec:
  containers:
    - name: container-1
      resources:
        requests:
          cpu: '1'
        limits:
          cpu: '1'

After introducing Pod-level NUMA policies, it is possible that there are multiple NUMA policies on the same node. For example, node-1 has two NUMA nodes, pod-1 uses SingleNUMANode policy on numa-0, and pod-2 uses Restricted policy on numa-0 and numa-1.

Since setting the resource requirements for the Pods can only limit the maximum resources they can use on the machines, it cannot limit the resources they can use on a NUMA node. So pod-2 may use more resources than the resources allocated on numa-0. This leads to resource contention between pod-2 and pod-1 on numa-0.

To solve this problem, Koordinator supports configuring the exclusive policy for Pods with SingleNUMANode policy. For example, we can configure pod-1 to use SingleNUMANode policy and not co-exist with other Pods on the same machine:

apiVersion: v1
kind: Pod
metadata:
  name: pod-1
  annotations:
    scheduling.koordinator.sh/numa-topology-spec: |-
      {
        "numaTopologyPolicy": "SingleNUMANode",
        "singleNUMANodeExclusive": "Required", # Required or Preferred
      }
spec:
  containers:
    - name: container-1
      resources:
        requests:
          cpu: '1'
        limits:
          cpu: '1'

Moreover, the introduction of Pod-level NUMA policies does not mean that the Node-level NUMA policies will be deprecated. Instead, they are compatible. If the Pod and Node-level NUMA policies are different, the Pod will not be scheduled to the node; if the Node-level NUMA policy is "", it means that the node can place any Pod; if the Pod-level NUMA policy is "", it means that the Pod can be scheduled to any node.

	SingleNUMANode node	Restricted node	BestEffort node
SingleNUMANode pod	[✓]	[x]	[x]
Restricted pod	[x]	[✓]	[x]
BestEffort pod	[x]	[x]	[✓]
""	[✓]	[✓]	[✓]

For more information about Pod-level NUMA policies, please see Proposal: Pod-level NUMA Policy.

Terway Net QoS

In v1.5.0, Koordinator cooperates with the Terway community to build the Network QoS.

Terway QoS is born to solve the network bandwidth contention problem in workload consolidation and co-location scenarios. It supports limiting the bandwidth of Pods or QoS classes, which is different from other solutions:

It supports limiting the bandwidth according to the business type, which is suitable for workload consolidation scenarios where multiple applications can be co-located at the same node.
It supports dynamic adjustment of Pod bandwidth limits.
It can limit the whole machine bandwidth, supporting multiple network cards, supporting to limit the container network and HostNetwork Pods.

Terway QoS has 3 types of network bandwidth priority, and the corresponding Koordinator default QoS mapping is as follows:

Koordinator QoS	Kubernetes QoS	Terway Net QoS
SYSTEM	--	L0
LSE	Guaranteed	L1
LSR	Guaranteed	L1
LS	Guaranteed/Burstable	L1
BE	BestEffort	L2

In the co-location scenario, we want to ensure the maximum bandwidth of online applications to avoid contention. When the node is idle, offline jobs can also fully utilize all bandwidth resources.

Therefore, users can define business traffic as 3 priorities, from high to low, respectively as L0, L1, and L2. We define the contention scenario as: when the sum of the bandwidth of L0, L1, and L2 exceeds the whole machine bandwidth.

L0's maximum bandwidth will be dynamically adjusted according to the real-time bandwidth of L1 and L2. It can be high to the total machine bandwidth and low to "total machine bandwidth - L1 minimum bandwidth - L2 minimum bandwidth". In any case, the bandwidth of L1 and L2 will not exceed their upper limits. In the contention scenario, the bandwidth of L1 and L2 will not be lower than their lower limits, and the bandwidth will be limited in the order of L2, L1, and L0. Since Terway QoS only has three priorities, only the total machine bandwidth limit can be set for LS and BE. The remaining of L0 can be calculated according to the upper bandwidth limit of the whole machine.

Here is an example of the configuration:

# unit: bps
resource-qos-config: |
  {
    "clusterStrategy": {
      "policies": {"netQOSPolicy":"terway-qos"},
      "lsClass": {
        "networkQOS": {
          "enable": true,
          "ingressRequest": "50M",
          "ingressLimit": "100M",
          "egressRequest": "50M",
          "egressLimit": "100M"
        }
      },
      "beClass": {
        "networkQOS": {
          "enable": true,
          "ingressRequest": "10M",
          "ingressLimit": "200M",
          "egressRequest": "10M",
          "egressLimit": "200M"
        }
      }
    }
  }
system-config: |-
  {
    "clusterStrategy": {
      "totalNetworkBandwidth": "600M"
    }
  }

Besides, Koordinator supports Pod-level bandwidth limits through the following annotations:

Key	Value
koordinator.sh/networkQOS	'{"IngressLimit": "10M", "EgressLimit": "20M"}'

For more information about the Network QoS, please see Network Bandwidth Limitation Using Terway QoS and Terway Community.

Core Scheduling

In v1.5.0, Koordinator provides container-level Core Scheduling ability. It reduces the risk of Side Channel Attacks (SCA) in multi-tenant scenarios, and can be used as a CPU QoS enhancement for the co-location scenarios.

Linux Core Scheduling supports defining a task group in user space that can share physical cores. Tasks belonging to the same group are assigned the same cookie as an identifier. And only tasks of one cookie will be run on a physical core (SMT dimension) at the same time. By applying this mechanism to security or performance scenarios, we can achieve the following things:

Isolate physical cores for tasks of different tenants.
Avoid the contention between offline jobs and online services.

Koordinator enables the kernel mechanism Core Scheduling to achieve container-level group isolation policies, and finally forms the following two capabilities:

Runtime isolation of physical core: Pods can be grouped by the tenants, so pods in different groups cannot share physical cores at the same time for multi-tenant isolation.
Next-gen CPU QoS policy: It can achieve a new CPU QoS policy which ensures both the CPU performance and the security.

Runtime Isolation of Physical Core

Koordinator provides Pod Label protocol to identify the Core Scheduling group of Pods.

Key	Value
koordinator.sh/coreSchedulingGroup	"xxx-group"

Different groups of Pods are running exclusively at the physical core level, which can avoid some side channel attacks on the physical cores, L1 cache or L2 cache for multi-tenant scenarios.

container-core-scheduling-img

Different from the cpuset scheduling, the scope of the running cpus of Pods is not fixed. The physical cores can run Pods of different groups at different moments. Thus, the physical cores can be shared by time-division multiplexing.

Next-Gen CPU QoS Policy

Koordinator build a new CPU QoS policy based on the Core Scheduling and CGroup Idle mechanism provided by the Anolis OS kernel.

BE containers enable the CGroup Idle feature to lower scheduling weights and priorities.
LSR/LS containers enable Core Scheduling feature to expel BE tasks of the same group on the physical cores.

Users can enable the Core Scheduling policy by specifying cpuPolicy="coreSched" in the slo-controller-config.

# Example of the slo-controller-config ConfigMap.
apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  resource-qos-config: |
    {
      "clusterStrategy": {
        "policies": {
          "cpuPolicy": "coreSched"
        },
        "lsClass": {
          "cpuQOS": {
            "enable": true,
            "coreExpeller": true,
            "schedIdle": 0
          }
        },
        "beClass": {
          "cpuQOS": {
            "enable": true,
            "coreExpeller": false,
            "schedIdle": 1
          }
        }
      }
    }

For more information about the Core Scheduling, please see CPU QoS.

Other Changes

Koordinator v1.5.0 also includes the following enhancements and reliability improvements:

Enhancements: Reservation Restricted mode supports controlling which resources strictly follow the Restricted semantic through Annotation. NUMA align policy adapts upstream; Coscheduling implements the fair scheduling queuing to ensure that Pods in the same GangGroup are dequeued together, and different Gangs and bare Pods are sorted by last scheduling time. NRI mode supports reconnection mechanism. Koordlet improves the monitoring metrics and adds performance metrics. BlkioReconcile updates the configurations.
BugFixes: Fix the memory leak of koordlet CPUSuppress feature. Fix the panic problem of runtimeproxy. Revise the calculation logic of CPICollector, BECPUEvict, and CPUBurst modules.
Environment compatibility: All components are upgraded to K8S 1.28. koordlet supports to run on a non-CUDA images. Koordlet adapts the kubelet 1.28 configuration and optimizes the compatibility logic for the cpu manager. Koordlet adapts cri-o runtime.
Refactoring and improvement: Koordlet improves the resctrl updating logic. Koordlet improves the eviction logic. Revise the GPU resources and card model reporting. Revise the Batch resource calculation logic.
CI/CD: Fix some flaky tests.

For more information about the v1.5.0 changes, please see v1.5.0 Release.

Contributors

Koordinator is an open source community. In v1.5.0, there are 10 new developers who contributed to Koordinator main repo. They are @georgexiang, @googs1025, @l1b0k, @ls-2018, @PeterChg, @sjtufl, @testwill, @yangfeiyu20102011, @zhifanggao, @zwForrest.

Koordinator community now has many enterprise contributors, some of which became Maintainers and Members. During the v1.5.0 release, the new Maintainers are

@songzh215
@j4ckstraw
@lucming
@kangclzjc

Thanks for the elders for their consistent efforts and the newbies for their active contributions. We welcome more contributors to join the Koordinator community.

Future Plan

In next versions, Koordinator plans the following works:

Scheduling performance optimization: The scheduling performance is the key indicator of whether the scheduler can handle large-scale clusters. In the next version, Koordinator will provide a setup guide of the basic benchmark environment and common benchmark scenarios, and start to improve the scheduling performance of Koord-Scheduler.
Device union allocation: In the LLC distributed training of AI scenarios, GPUs of different machines usually need to communicate with each other through high-performance network card, and GPU and high-performance network card are allocated near each other for better performance. Koordinator is working on the support of union allocation for multiple heterogeneous resources. The union allocation has been supported on the protocol and the scheduling logic. The single-node logic for reporting network card resources is being explored.
Job-level quota preemption: In the large-scale cluster scenario, some quotas can be busy, while other quotas can be idle. In the ElasticQuota plugin, we have supported borrowing resources from the idle quotas. But the scheduler has not considered the Job information when the borrowed quotas expect to take back resources. For the Pods belonging to the same Job, the scheduler should do preempt in the Job-level to ensure the job scheduling and improve the efficiency.
Load-aware scheduling for in-flight pods: Currently, the load-aware scheduling filters and scores nodes based on the resource utilization. It can improve the distribution of utilization to nodes, reduce the risks of scheduling pods to overload nodes. However, the accuracy of the utilization can be affected by the in-flight pods since the node metrics reporting has a lag. In the coming version, the load-aware scheduling will take consideration of the in-flight pods, guarantee pods not to schedule to overload nodes, and further improve the distribution of utilization to nodes.
Fine-grained isolation strategy for last-level cache and memory bandwidth: Contention of the last-level cache and memory bandwidth between containers can cause performance degradation of the memory access. By isolating the last-level cache and memory bandwidth in the QoS-level without exceeding the capacity of the RDT groups, koordlet provides the Resctrl QoS to reduce the contention between the offline workloads with the online services. In the next version, koordlet will enhance the isolation strategy based on NRI (Node Resource Interface) mode introduced in v1.3. It will provide the pod-level isolation capability, which greatly improves the feature's flexibility and timeliness.

Acknowledgement

Since the project was open-sourced, Koordinator has been released for more than 19 versions, getting 80+ contributors involved to contribute. The community is growing and has been continuously improving. We thank all the community members for their active participation and valuable feedback. We also want to thank the CNCF organization and related community members for supporting the project.

Welcome more developers and end users to join us! It is your participation and feedback that make Koordinator keep improving. Whether you are a beginner or an expert in the Cloud Native communities, we look forward to hearing your voice!

Koordinator v1.4: more types of computing workloads and more flexible resource management mechanisms

January 15, 2024 · 20 min read

Jianyu Wang

Koordinator member

Background

As an actively developing open source project, Koordinator has undergo multiple version iterations since the release of v0.1.0 in April 2022, continuously bringing innovations and enhancements to the Kubernetes ecosystem. The core objective of the project is to provide comprehensive solutions for orchestrating collocated workloads, scheduling resources, ensuring resource isolation, and tuning performance to help users optimize container performance and improve cluster resource utilization.

In past version iterations, the Koordinator community has continued to grow, receiving active participation and contributions from engineers at well-known companies. These include Alibaba, Ant Technology Group, Intel, Xiaomi, Xiaohongshu, iQIYI, Qihoo 360, Youzan, Quwan, Meiya Pico, PITS, among others. Each version has advanced through the collective efforts of the community, demonstrating the project's capability to address challenges in actual production environments.

Today, we are pleased to announce that Koordinator v1.4.0 is officially released. This version introduces several new features, including Kubernetes and YARN workload co-location, a NUMA topology alignment strategy, CPU normalization, and cold memory reporting. It also enhances features in key areas such as elastic quota management, QoS management for non-containerized applications on hosts, and descheduling protection strategies. These innovations and improvements aim to better support enterprise-level Kubernetes cluster environments, particularly in complex and diverse application scenarios.

The release of version v1.4.0 will bring users support for more types of computing workloads and more flexible resource management mechanisms. We look forward to these improvements helping users to address a broader range of enterprise resource management challenges. In the v1.4.0 release, a total of 11 new developers have joined the development of the Koordinator community. They are @shaloulcy, @baowj-678, @zqzten, @tan90github, @pheianox, @zxh326, @qinfustu, @ikaven1024, @peiqiaoWang, @bogo-y, and @xujihui1985. We thank all community members for their active participation and contributions during this period and for their ongoing commitment to the community.

Interpretation of Version Features

1. Support Kubernetes and YARN workload co-location

Koordinator already supports the co-location of online and offline workloads within the Kubernetes ecosystem. However, outside the Kubernetes ecosystem, a considerable number of big data workloads still run on traditional Hadoop YARN.

In response, the Koordinator community, together with developers from Alibaba Cloud, Xiaohongshu, and Ant Financial, has jointly launched the Hadoop YARN and Kubernetes co-location project, Koordinator YARN Copilot. This initiative enables the running of Hadoop NodeManager within Kubernetes clusters, fully leveraging the technical value of peak-shaving and resource reuse for different types of workloads. Koordinator YARN Copilot has the following features:

Embracing the open-source ecosystem: Built upon the open-source version of Hadoop YARN without any intrusive modifications to YARN.
Unified resource priority and QoS policy: YARN NodeManager utilizes Koordinator’s Batch priority resources and adheres to Koordinator's QoS management policies.
Node-level resource sharing: The co-location resources provided by Koordinator can be used by both Kubernetes pod and YARN tasks. Different types of offline applications can run on the same node.

For the detailed design of Koordinator YARN Copilot and its use in the Xiaohongshu production environment, please refer to Previous Articles and Official Community Documents.

2. Introducing NUMA topology alignment strategy

The workloads running in Kubernetes clusters are increasingly diverse, particularly in fields such as machine learning, where the demand for high-performance computing resources is on the rise. In these fields, a significant amount of CPU resources is required, as well as other high-speed computing resources like GPUs and RDMA. Moreover, to achieve optimal performance, these resources often need to be located on the same NUMA node or even the same PCIe bus.

Kubernetes' kubelet includes a topology manager that manages the NUMA topology of resource allocation. It attempts to align the topologies of multiple resources at the node level during the admission phase. However, because the node component lacks a global view of the scheduler and the timing of node selection for pods, pods may be scheduled on nodes that are unable to meet the topology alignment policy. This can result in pods failing to start due to topology affinity errors.

To solve this problem, Koordinator moves NUMA topology selection and alignment to the central scheduler, optimizing resource NUMA topology at the cluster level. In this release, Koordinator introduces NUMA-aware scheduling of CPU resources (including Batch resources) and NUMA-aware scheduling of GPU devices as alpha features. The entire suite of NUMA-aware scheduling features is rapidly evolving.

Koordinator enables users to configure the NUMA topology alignment strategy for multiple resources on a node through the node's labels. The configurable strategies are as follows:

None, the default strategy, does not perform any topological alignment.
BestEffort indicates that the node does not strictly allocate resources according to NUMA topology alignment. The scheduler can always allocate such nodes to pods as long as the remaining resources meet the pods' needs.
Restricted means that nodes allocate resources in strict accordance with NUMA topology alignment. In other words, the scheduler must select the same one or more NUMA nodes when allocating multiple resources, otherwise, the node should not be considered. For instance, if a pod requests 33 CPU cores and each NUMA node has 32 cores, it can be allocated to use two NUMA nodes. Moreover, if the pod also requests GPUs or RDMA, these must be on the same NUMA node as the CPU.
SingleNUMANode is similar to Restricted, adhering strictly to NUMA topology alignment, but it differs in that Restricted permits the use of multiple NUMA nodes, whereas SingleNUMANode restricts allocation to a single NUMA node.

For example, to set the SingleNUMANode policy for node-0, you would do the following:

apiVersion: v1
kind: Node
metadata:
  labels:
    node.koordinator.sh/numa-topology-policy: "SingleNUMANode"
  name: node-0
spec:
  ...

In a production environment, users may have enabled kubelet's topology alignment policy, which will be reflected by the koordlet in the TopologyPolicies field of the NodeResourceTopology CR object. When kubelet's policy conflicts with the policy set by the user on the node, the kubelet policy shall take precedence. The koord-scheduler essentially adopts the same NUMA alignment policy semantics as the kubelet topology manager. The kubelet policies SingleNUMANodePodLevel and SingleNUMANodeContainerLevel are both mapped to SingleNUMANode.

After configuring the NUMA alignment strategy for the node, the scheduler can identify many suitable NUMA node allocation results for each pod. Koordinator currently provides the NodeNUMAResource plugin, which allows for configuring the NUMA node allocation result scoring strategy for CPU and memory resources. This includes LeastAllocated and MostAllocated strategies, with LeastAllocated being the default. Each resource can also be assigned a configurable weight. The scheduler will ultimately select the NUMA node allocation with the highest score. For instance, we can configure the NUMA node allocation result scoring strategy to MostAllocated, as shown in the following example:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
  - pluginConfig:
      - name: NodeNUMAResource
        args:
          apiVersion: kubescheduler.config.k8s.io/v1beta2
          kind: NodeNUMAResourceArgs
          scoringStrategy:  # Here configure Node level scoring strategy
            type: MostAllocated
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
              - name: "kubernetes.io/batch-cpu"
                weight: 1
              - name: "kubernetes.io/batch-memory"
                weight: 1
          numaScoringStrategy: # Here configure NUMA-Node level scoring strategy
            type: MostAllocated
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
              - name: "kubernetes.io/batch-cpu"
                weight: 1
              - name: "kubernetes.io/batch-memory"
                weight: 1

3. ElasticQuota evolves again

In order to fully utilize cluster resources and reduce management system costs, users often deploy workloads from multiple tenants in the same cluster. When cluster resources are limited, competition for these resources is inevitable between different tenants. As a result, the workloads of some tenants may always be satisfied, while others may never be executed, leading to demands for fairness. The quota mechanism is a very natural way to ensure fairness among tenants, where each tenant is allocated a specific quota, and they can use resources within that quota. Tasks exceeding the quota will not be scheduled or executed. However, simple quota management cannot fulfill tenants' expectations for elasticity in the cloud. Users hope that in addition to satisfying resource requests within the quota, requests for resources beyond the quota can also be met on demand.

In previous versions, Koordinator leveraged the upstream ElasticQuota protocol, which allowed tenants to set a 'Min' value to express their resource requests that must be satisfied, and a 'Max' value to limit the maximum resources they can use. 'Max' was also used to represent the shared weight of the remaining resources of the cluster when they were insufficient.

In addition to offering a flexible quota mechanism that accommodates tenants' on-demand resource requests, Koordinator enhances ElasticQuota with annotations to organize it into a tree structure, thereby simplifying the expression of hierarchical organizational structures for users.

The figure above depicts a common quota tree in a cluster utilizing Koordinator's elastic quota. The root quota serves as the link between the quota system and the actual resources within the cluster. In previous iterations, the root quota existed only within the scheduler's logic. In this release, we have made the root quota accessible to users in the form of a Custom Resource (CR). Users can now view information about the root quota through the ElasticQuota CR named koordinator-root-quota.

3.1 Introducing Multi QuotaTree

In large clusters, there are various types of nodes. For example, VMs provided by cloud vendors will have different architectures. The most common ones are amd64 and arm64. There are also different models with the same architecture. In addition, nodes generally have location attributes such as availability zone. When nodes of different types are managed in the same quota tree, their unique attributes will be lost. When users want to manage the unique attributes of machines in a refined manner, the current ElasticQuota appears not to be accurate enough. In order to meet users' requirements for flexible resource management or resource isolation, Koordinator supports users to divide the resources in the cluster into multiple parts, each part is managed by a quota tree, as shown in the following figure:

Additionally, to help users simplify management complexity, Koordinator introduced the ElasticQuotaProfile mechanism in version 1.4.0. Users can quickly associate nodes with different quota trees through the nodeSelector, as shown in the following example:

apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
  labels:
    kubernetes.io/arch: amd64
  name: amd64-profile
  namespace: kube-system
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/arch: amd64 // amd64 node
  quotaName: amd64-root-quota   // the name of root quota
---
apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
  labels:
    kubernetes.io/arch: arm64   
  name: arm64-profile
  namespace: kube-system
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/arch: arm64  // arm64 node
  quotaName: arm64-root-quota    // the name of root quota

After associating nodes with the quota tree, the user utilizes the ElasticQuota in each quota tree as before. When a user submits a pod to the corresponding quota, they currently still need to configure the pod's NodeAffinity to ensure that the pod runs on the correct node. In the future, we plan to add a feature that will help users automatically manage the mapping relationship from quota to node.

3.2 Support non-preemptible

Koordinator ElasticQuota supports sharing the unused part of 'Min' in ElasticQuota with other ElasticQuotas to improve resource utilization. However, when resources are tight, the pod that borrows the quota will be preempted and evicted through the preemption mechanism to get the resources back.

In actual production environments, if some critical online services borrow this part of the quota from other ElasticQuotas and preemption subsequently occurs, the quality of service may be adversely affected. Such workloads should not be subject to preemption.

To implement this safeguard, Koordinator v1.4.0 introduced a new API. Users can simply annotate a pod with quota.scheduling.koordinator.sh/preemptible: false to indicate that the pod should not be preempted.

When the scheduler detects that a pod is declared non-preemptible, it ensures that the available quota for such a pod does not exceed its 'Min'. Thus, it is important to note that when enabling this feature, the 'Min' of an ElasticQuota should be set judiciously, and the cluster must have appropriate resource guarantees in place. This feature maintains compatibility with the original behavior of Koordinator.

apiVersion: v1
kind: Pod
metadata:
  name: pod-example
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/name: "quota-example"
    quota.scheduling.koordinator.sh/preemptible: false
spec:
...

3.3 Other improvements

The koord-scheduler previously supported the use of a single ElasticQuota object across multiple namespaces. However, in some cases, it is desirable for the same object to be shared by only a select few namespaces. To accommodate this need, users can now annotate the ElasticQuota CR with quota.scheduling.koordinator.sh/namespaces, assigning a JSON string array as the value.
Performance optimization: Previously, whenever an ElasticQuota was modified, the ElasticQuota plugin would rebuild the entire quota tree. This process has been optimized in version 1.4.0.
Support ignoring overhead: When a pod utilizes secure containers, an overhead declaration is typically added to the pod specification to account for the resource consumption of the secure container itself. However, whether these additional resource costs should be passed on to the end user depends on the resource pricing strategy. If it is expected that users should not be responsible for these costs, the ElasticQuota can be configured to disregard overhead. With version 1.4.0, this can be achieved by enabling the feature gate ElasticQuotaIgnorePodOverhead.

4. CPU normalization

With the diversification of node hardware in Kubernetes clusters, significant performance differences exist between CPUs of various architectures and generations. Therefore, even if a pod's CPU request is identical, the actual computing power it receives can vary greatly, potentially leading to resource waste or diminished application performance. The objective of CPU normalization is to ensure that each CPU unit in Kubernetes provides consistent computing power across heterogeneous nodes by standardizing the performance of allocatable CPUs.

To address this issue, Koordinator has implemented a CPU normalization mechanism in version 1.4.0. This mechanism adjusts the amount of CPU resources that can be allocated on a node according to the node's resource amplification strategy, ensuring that each allocatable CPU in the cluster delivers a consistent level of computing power. The overall architecture is depicted in the figure below:

CPU normalization consists of two steps

CPU performance evaluation: To calculate the performance benchmarks of different CPUs, you can refer to the industrial performance evaluation standard, SPEC CPU. This part is not provided by the Koordinator project.
Configuration of the CPU normalization ratio in Koordinator: The scheduling system schedules resources based on the normalization ratio, which is provided by Koordinator.

Configure the CPU normalization ratio information into slo-controller-config of koord-manager. The configuration example is as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  cpu-normalization-config: |
    {
      "enable": true,
      "ratioModel": {
         "Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz": {
           "baseRatio": 1.29,
           "hyperThreadEnabledRatio": 0.82,
           "turboEnabledRatio": 1.52,
           "hyperThreadTurboEnabledRatio": 1.0
         },
         "Intel Xeon Platinum 8369B CPU @ 2.90GHz": {
           "baseRatio": 1.69,
           "hyperThreadEnabledRatio": 1.06,
           "turboEnabledRatio": 1.91,
           "hyperThreadTurboEnabledRatio": 1.20
         }
      }
    }
  # ...

For nodes configured with CPU normalization, Koordinator intercepts updates to Node.Status.Allocatable by kubelet through a webhook to achieve the amplification of CPU resources. This results in the display of the normalized amount of CPU resources available for allocation on the node.

5. Improved descheduling protection strategies

Pod migration is a complex process that involves steps such as auditing, resource allocation, and application startup. It is often intertwined with application upgrades, scaling scenarios, and the resource operations and maintenance performed by cluster administrators. Consequently, if a large number of pods are migrated simultaneously, the system's stability may be compromised. Furthermore, migrating many pods from the same workload at once can also affect the application's stability. Additionally, simultaneous migrations of pods from multiple jobs may lead to a 'thundering herd' effect. Therefore, it is preferable to process the pods in each job sequentially.

To address these issues, Koordinator previously provided the PodMigrationJob function with some protection strategies. In version v1.4.0, Koordinator has enhanced these protection strategies into an arbitration mechanism. When there is a large number of executable PodMigrationJobs, the arbiter decides which ones can proceed by employing sorting and filtering techniques.

The sorting process is as follows:

The time interval between the start of migration and the current, the smaller the interval, the higher the ranking.
The pod priority of PodMigrationJob, the lower the priority, the higher the ranking.
Disperse Jobs by workload, make PodMigrationJobs close in the same job.
If some pods in the job containing PodMigrationJob's pod is being migrated, the PodMigrationJob's ranking is higher.

The filtering process is as follows:

Group and filter PodMigrationJobs based on workload, node, namespace, etc.
Check the number of running podMigrationJobs in each workload, and those that reach a certain threshold will be excluded.
Check whether the number of unavailable replicas in each workload exceeds the maximum number of unavailable replicas, and those that exceed the number will be excluded.
Check whether the number of pods being migrated on the node where the target pod is located exceeds the maximum migration amount of a single node, and those that exceed will be excluded.

6. Cold Memory reporting

To improve system performance, the kernel generally tries not to free the page cache requested by an application but allocates as much as possible to the application. Although allocated by the kernel, this memory may no longer be accessed by applications and is referred to as cold memory.

Koordinator introduced the cold memory reporting function in version 1.4, primarily to lay the groundwork for future cold memory recycling capabilities. Cold memory recycling is designed to address two scenarios:

In standard Kubernetes clusters, when the node memory level is too high, sudden memory requests can lead to direct memory recycling of the system. This can affect the performance of running containers and, in extreme cases, may result in out-of-memory (OOM) events if recycling is not timely. Therefore, maintaining a relatively free pool of node memory resources is crucial for runtime stability.
In co-location scenarios, high-priority applications' unused requested resources can be recycled by lower-priority applications. Since memory not reclaimed by the operating system is invisible to the Koordinator scheduling system, reclaiming unused memory pages of a container is beneficial for improving resource utilization.

Koordlet has added a cold page collector to its collector plugins for reading the cgroup file memory.idle_stat, which is exported by kidled (Anolis kernel), kstaled (Google), or DAMON (Amazon). This file contains information about cold pages in the page cache and is present at every hierarchy level of memory. Koordlet already supports the kidled cold page collector and provides interfaces for other cold page collectors.

After collecting cold page information, the cold page collector stores the metrics, such as hot page usage and cold page size for nodes, pods, and containers, into metriccache. This data is then reported to the NodeMetric Custom Resource (CR).

Users can enable cold memory recycling and configure cold memory collection strategies through NodeMetric. Currently, three strategies are offered: usageWithHotPageCache, usageWithoutPageCache and usageWithPageCache. For more details, please see the community Design Document。

7. QoS management for non-containerized applications

In the process of enterprise containerization, there may be non-containerized applications running on the host alongside those already running on Kubernetes. In order to be better compatible with enterprises in the containerization process, Koordinator has developed a node resource reservation mechanism. This mechanism can reserve resources and assign specific QoS (Quality of Service) levels to applications that have not yet been containerized. Unlike the resource reservation configuration provided by kubelet, Koordinator's primary goal is to address QoS issues that arise during the runtime of both non-containerized and containerized applications. The overall solution is depicted in the figure below:

Currently, applications need to start processes into the corresponding cgroup according to specifications, and Koordinator does not provide an automatic cgroup relocation tool. For host non-containerized applications, QoS is supported as follows:

LS (Latency Sensitive)
- CPU QoS (Group Identity): The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet sets the Group Identity parameter for it according to the CPU QoS configuration;
- CPUSet Allocation: The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet will set all CPU cores in the CPU share pool for it.
BE (Best-effort)
- CPU QoS (Group Identity): The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet sets the Group Identity parameter for it according to the configuration of CPU QoS.

For detailed design of QoS management of non-containerized applications on the host, please refer to Community Documentation. In the future, we will gradually add support for other QoS strategies for host non-containerized applications.

8. Other features

In addition to the new features and functional enhancements mentioned above, Koordinator has also implemented the following bug fixes and optimizations in version 1.4.0:

RequiredCPUBindPolicy: Fine-grained CPU orchestration now supports the configuration of the required CPU binding policy, which means that CPUs are allocated strictly in accordance with the specified CPU binding policy; otherwise, scheduling will fail.
CICD: The Koordinator community provides a set of e2e testing Pipeline in v1.4.0; an ARM64 image is provided.
Batch resource calculation strategy optimization: There is support for the maxUsageRequest calculation strategy, which conservatively reclaims high-priority resources. This update also optimizes the underestimate of Batch allocatable when a large number of pods start and stop on a node in a short period of time and improves considerations for special circumstances such as host non-containerized application, third-party allocatable, and dangling pod usage.
Others: Optimizations include using libpfm4 and perf groups to improve CPI collection, allowing SystemResourceCollector to support customized expiration time configuration, enabling BE pods to calculate CPU satisfaction based on the evictByAllocatable policy, repairing koordlet's CPUSetAllocator filtering logic for pods with LS and None QoS, and enhancing RDT resource control to retrieve the task IDs of sandbox containers.

For a comprehensive list of new features in version 1.4.0, please visit the v1.4.0 Release page.

Future plan

In upcoming versions, Koordinator has planned the following features:

Core Scheduling: On the runtime side, Koordinator has begun exploring the next generation of CPU QoS capabilities. By leveraging kernel mechanisms such as Linux Core Scheduling, it aims to enhance resource isolation at the physical core level and reduce the security risks associated with co-location. For more details on this work, see Issue #1728.
Joint Allocation of Devices: In scenarios involving AI large model distributed training, GPUs from different machines often need to communicate through high-performance network cards. Performance is improved when GPUs and high-performance network cards are allocated in close proximity. Koordinator is advancing the joint allocation of multiple heterogeneous resources. Currently, it supports joint allocation in terms of protocol and scheduler logic; the reporting logic for network card resources on the node side is being explored.

For more information, please pay attention to Milestone v1.5.0.

Conclusion

Finally, we are immensely grateful to all the contributors and users of the Koordinator community. Your active participation and valuable advice have enabled Koordinator to continue improving. We eagerly look forward to your ongoing feedback and warmly welcome new contributors to join our ranks.

Koordinator v1.3: 增强资源预留，支持 NRI，提供节点画像的 Mid 资源超卖

August 16, 2023 · 12 min read

Rougang Han

Koordinator member

背景

Koordinator 是一个开源项目，旨在基于阿里巴巴在容器调度领域的多年经验，提供一个完整的混部解决方案，包含混部工作负载编排、资源调度、资源隔离及性能调优等多方面能力，来帮助用户优化容器性能，充分发掘空闲物理资源，提升资源效率，增强延迟敏感型工作负载和批处理作业的运行效率和可靠性。

在此，我们很高兴地向各位宣布 Koordinator v1.3.0 版本的发布。自 2022 年 4 月发布 v0.1.0 版本以来，Koordinator 迄今迭代发布了共 11 个版本，吸引了了包括阿里巴巴、Intel、小米、小红书、爱奇艺、360、有赞等企业在内的大量优秀工程师参与贡献。在 v1.3.0 版本中，Koordinator 带来了 NRI (Node Resource Interface) 支持、Mid 资源超卖等新特性，并在资源预留、负载感知调度、DeviceShare 调度、负载感知重调度、调度器框架、单机指标采集和资源超卖框架等特性上进行了稳定性修复、性能优化与功能增强。

在 v1.3.0 版本中，共有 12 位新加入的开发者参与到了 Koordinator 社区的建设，他们是 @bowen-intel，@BUPT-wxq，@Gala-R，@haoyann，@kangclzjc，@Solomonwisdom，@stulzq，@TheBeatles1994，@Tiana2018，@VinceCui，@wenchezhao，@zhouzijiang，感谢期间各位社区同学的积极参与和贡献，也感谢所有同学在社区的持续投入。

版本功能特性解读

资源预留增强

资源预留（Reservation）能力自 v0.5.0 版本提出后，经历了一年的打磨和迭代，在 v1.3.0 版本中针对抢占、设备预留、Coscheduling 等场景增强了预留机制，新增 allocatePolicy 字段用于定义不同的预留资源分配策略。最新的资源预留 API 如下：

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
  name: reservation-demo
spec:
  # template字段填写reservation对象的资源需求和affinity信息，就像调度pod一样.
  template:
    namespace: default
    spec:
      containers:
        - args:
            - '-c'
            - '1'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
      nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
         nodeSelectorTerms:
            - matchExpressions:
                 - key: topology.kubernetes.io/zone
                   operator: In
                   values:
                      - cn-hangzhou-i
      schedulerName: koord-scheduler # 指定koord-scheduler来负责reservation对象的调度.
  # 指定可分配预留资源的owners.
  owners:
    - labelSelector:
        matchLabels:
          app: app-demo
  ttl: 1h
  # 指定预留资源是否仅支持一次性的分配.
  allocateOnce: true
  # 指定预留资源的分配策略,当前支持以下策略:
  # - Default: 缺省配置，不限制对预留资源的分配，pod优先分配自节点上的预留资源；如果预留资源不足，则继续分配节点空闲资源。
  # - Aligned: pod优先分配自节点上的预留资源；如果预留资源不足，则继续分配节点空闲资源，但要求这部分资源满足Pod需求。该策略可用于规避pod同时分配多个reservation的资源。
  # - Restricted: 对于预留资源包含的各个资源维度，pod必须分配自预留资源；其余资源维度可以分配节点空闲资源。包含了Aligned策略的语义。
  # 同一节点尚不支持Default策略和Aligned策略或Restricted策略共存。
  allocatePolicy: "Aligned"
  # 控制预留资源是否可以使用
  unschedulable: false

此外，资源预留在 v1.3.0 中还包含了如下兼容性和性能优化：

增强 Reservation 的抢占，允许 Reservation 内的 Pod 间抢占，拒绝 Reservation 外的 Pod 抢占 Reservation 内的 Pod。
增强设备预留场景，如果节点上设备资源被部分预留并被 pod 使用，支持剩余资源的分配。
支持 Reservation 使用 Coscheduling。
新增 Reservation Affinity协议，允许用户一定从Reservation内分配资源。
优化 Reservation 兼容性，修复因 Reservation 导致原生打分插件失效的问题。
优化因引入 Reservation 导致的调度性能回归问题。
修复 Reservation 预留端口误删除的问题。

关于资源预留的设计，详见Designs - Resource Reservation。

其他调度增强

在 v1.3.0 中，koordinator 在调度和重调度方面还包含如下增强：

DeviceShare 调度
- 更改 GPU 资源使用方式，使用 GPU Share API 时，必须声明koordinator.sh/gpu-memory或koordinator.sh/gpu-memory-ratio，允许不声明koordinator.sh/gpu-core。
- 支持打分，可用于实现 GPU Share 场景和整卡分配场景的 bin-packing 或 spread，并支持卡粒度 binpacking 或 spread。
- 修复用户误删除 Device CRD 导致调度器内部状态异常重复分配设备的问题。
负载感知调度：修复对仅填写 Request 的 Pod 的调度逻辑。
调度器框架：优化 PreBind 阶段的 Patch 操作，将多个插件的 Patch 操作合并为一次提交，提升操作效率，降低 APIServer 压力。
重调度
- LowNodeLoad 支持按节点池设置不同的负载水位和参数等。自动兼容原有配置。
- 跳过 schedulerName 不是 koord-scheduler 的Pod，支持配置不同的 schedulerName。

NRI 资源管理模式

Koordinator 的 runtime hooks 支持两种模式，standalone 和 CRI proxy，然而这两种模式各自有着一些限制。当前，尽管在 standalone 模式做了很多优化，但当想获得更加及时的 Pod 或容器的事件或者环境变量的注入时还是需要依赖 proxy 模式。然而， proxy 模式要求单独部署 koord-runtime-proxy 组件来代理 CRI (Container Runtime Interface) 请求, 同时需要更改 Kubelet 的启动参数并重启 Kubelet。

NRI（Node Resource Interface），即节点资源接口，是 CRI 兼容的容器运行时插件扩展的通用框架，独立于具体的容器运行时（e.g. containerd, cri-o）, 提供不同生命周期事件的接口，允许用户在不修改容器运行时源代码的情况下添加自定义逻辑。特别的是，2.0 版本 NRI 只需要运行一个插件实例用于处理所有 NRI 事件和请求，容器运行时通过 Unix-Domain Socket 与插件通信，使用基于 Protobuf 的协议数据，和 1.0 版本 NRI 相比拥有更高的性能，能够实现有状态的 NRI 插件。

通过 NRI 的引入，既能及时的订阅 Pod 或者容器的生命周期事件，又避免了对 Kubelet 的侵入修改。在 Koordinator v1.3.0 中，我们引入 NRI 这种社区推荐的方式来管理 runtime hooks 来解决之前版本遇到的问题，大大提升了 Koordinator 部署的灵活性和处理的时效性，提供了一种优雅的云原生系统的资源管理标准化模式。

nri

注：NRI 模式不支持 docker 的容器运行时，使用 docker 的用户请继续使用 standalone 模式或 proxy 模式。

关于 Koordinator 启用 NRI 的部署方式，请见Installation - Enable NRI Mode Resource Management。

节点画像和 Mid 资源超卖

Koordinator 中将节点资源分为4种资源优先级模型 Prod、Mid、Batch 和 Free，低优先级资源可以复用高优先级已分配但未使用的物理资源，以提升物理资源利用率；同时，资源优先级越高，提供的资源也越稳定，例如 Batch 资源采用高优先级资源短期（short-term）已分配但未使用的超卖资源，而 Mid 资源采用高优先级资源长周期（long-term）已分配但未使用的超卖资源。不同资源优先级模型如下图所示：

resource-priority-model

Koordinator v1.3.0 新增了节点画像能力，基于 Prod 的历史资源用量进行峰值预测，以支持 Mid-tier 的资源超卖调度。Mid 资源的超卖计算公式如下：

MidAllocatable := min(ProdReclaimable, NodeAllocatable * thresholdRatio)
ProdReclaimable := max(0, ProdAllocated - ProdPeak * (1 + safeMargin))

ProdPeak：通过节点画像，预估的节点上已调度 Prod Pod 在中长周期内（e.g. 12h）的用量峰值。
ProdReclaimable：基于节点画像结果，预估在中长周期内可复用的 Prod 资源。
MidAllocatable：节点上可分配的 Mid 资源。

此外，Mid 资源的单机隔离保障将在下个版本得到完善，相关动态敬请关注Issue #1442。在 v1.3.0 版本中，用户可以查看和提交 Mid-tier 的超卖资源，也可以通过以下 Prometheus metrics 来观测节点画像的趋势变化。

# 查看节点的CPU资源画像，reclaimable指标表示预测的可回收资源量，predictor对应不同的预测模型
koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="cpu", unit="core"}
# 查看节点的内存资源画像，reclaimable指标表示预测的可回收资源量，predictor对应不同的预测模型
koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="memory", unit="byte"}

$ kubectl get node test-node -o yaml
apiVersion: v1
kind: Node
metadata:
  name: test-node
status:
  # ...
  allocatable:
    cpu: '32'
    memory: 129636240Ki
    pods: '110'
    kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods
    kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods
  capacity:
    cpu: '32'
    memory: 129636240Ki
    pods: '110'
    kubernetes.io/mid-cpu: '16000'
    kubernetes.io/mid-memory: 64818120Ki

关于 Koordinator 节点画像的设计，详见Design - Node Prediction。

其他功能

通过 v1.3.0 Release 页面，可以看到更多包含在 v1.3.0 版本的新增功能。

未来计划

在接下来的版本中，Koordinator 目前规划了以下功能：

硬件拓扑感知调度，综合考虑节点 CPU、内存、GPU 等多个资源维度的拓扑关系，在集群范围内进行调度优化。
提供节点可分配资源的放大机制。
NRI 资源管理模式的完善和增强。

更多信息，敬请关注 Milestone v1.4.0。

结语

最后，Koordinator 是一个开放的社区，欢迎广大云原生爱好者们随时通过各种方式参与共建，无论您在云原生领域是初学乍到还是驾轻就熟，我们都非常期待听到您的声音！

Koordinator v1.2: 支持节点资源预留，兼容社区重调度策略

April 7, 2023 · 13 min read

Zuowei Zhang

Koordinator maintainer

背景

Koordinator 是一个开源项目，基于阿里巴巴在容器调度领域多年累积的经验孵化诞生，可以提升容器性能，降低集群资源成本。通过混部、资源画像、调度优化等技术能力，能够提高延迟敏感的工作负载和批处理作业的运行效率和可靠性，优化集群资源使用效率。

从 2022 年 4 月发布以来，Koordinator 迄今一共迭代发布了 10 个版本，吸引了了包括阿里巴巴、小米、小红书、爱奇艺、360、有赞等在内的大量优秀工程师参与贡献。随着2023年春天的来临，Koordinator也迎来了它的一周年诞辰，在此我们很高兴的向大家宣布，Koordinator v1.2版本正式发布。新版本中Koordinator支持了节点资源预留功能，并兼容了K8s社区的重调度策略，同时在单机侧增加了对AMD环境L3 Cache和内存带宽隔离的支持。

在新版本中，共有12位新加入的开发者参与到了Koordiantor社区的建设，他们是@Re-Grh，@chengweiv5，@kingeasternsun，@shelwinnn，@yuexian1234，@Syulin7，@tzzcfrank @Dengerwei，@complone，@AlbeeSo，@xigang，@leason00，感谢以上开发者的贡献和参与。

版本功能特性解读

节点资源预留

混部场景中包含的应用形态多种多样，除了已经完成云原生化的容器，还包含很多尚未完成容器化的应用，这部分应用会以进程的形式在宿主机上与K8s容器共同运行。为了减少K8s应用和其他类型应用在节点侧的资源竞争，Koordinator 支持将一部分资源预留，使其既不参与调度器的资源调度，也不参与节点侧的资源分配，达到资源分隔使用的效果。在v1.2版本中，Koordiantor已经支持CPU和内存资源维度的预留，并允许直接指定预留的CPU编号，具体如下。

节点资源预留声明

在Node上可以配置需要预留的资源量或具体的CPU编号，举例如下：

apiVersion: v1
kind: Node
metadata:
  name: fake-node
  annotations: # specific 5 cores will be calculated, e.g. 0, 1, 2, 3, 4, and then those core will be reserved.
    node.koordinator.sh/reservation: '{"resources":{"cpu":"5"}}'
---
apiVersion: v1
kind: Node
metadata:
  name: fake-node
  annotations: # the cores 0, 1, 2, 3 will be reserved.
    node.koordinator.sh/reservation: '{"reservedCPUs":"0-3"}'

单机组件Koordlet在上报节点资源拓扑信息时，会将具体预留的CPU编号更新到NodeResourceTopology对象的Annotation中。

调度及重调度场景适配

调度器在分配资源的过程中，涉及了多种情况的资源校验，包括Quota管理，节点容量校验，CPU拓扑校验等等，这些场景都需要增加对节点预留资源的考虑，例如，调度器在计算节点CPU容量时，需要将节点预留的资源进行扣除。

cpus(alloc) = cpus(total) - cpus(allocated) - cpus(kubeletReserved) - cpus(nodeAnnoReserved)

此外，对于Batch混部超卖资源的计算同样需要将这部分资源扣除，而考虑到节点中还包括一部分系统进程的资源消耗，Koord-Manager在计算时会取节点预留和系统用量的最大值，具体为：

reserveRatio = (100-thresholdPercent) / 100.0
node.reserved = node.alloc * reserveRatio
system.used = max(node.used - pod.used, node.anno.reserved)
Node(BE).Alloc = Node.Alloc - Node.Reserved - System.Used - Pod(LS).Used

对于重调度，各插件策略需要在节点容量、利用率计算等场景感知节点预留资源量，此外，若已经有容器占用了节点的预留资源，重调度需要考虑将其进行驱逐，确保节点容量得到正确管理，避免资源竞争。这部分重调度相关的功能，我们将在后续版本进行支持，也欢迎广大爱好者们一起参与共建。

单机资源管理

对于LS类型的Pod，单机Koordlet组件会根据CPU分配情况动态计算共享CPU池，对于节点预留的CPU核心会将其排除在外，确保LS类型pod和其他非容器化的进程资源隔离。同时，对于单机相关的QoS策略，例如CPUSuppress压制策略在计算节点利用率时，会将预留资源量考虑在内。

suppress(BE) := node.Total * SLOPercent - pod(LS).Used - max(system.Used, node.anno.reserved)

关于节点资源预留功能的详细说明，可以参考设计文档中的介绍。

兼容社区重调度策略

得益于 Koordinator Descheduler 的框架日益成熟，在 Koordinator v1.2 版本中，通过引入一种接口适配机制，可以无缝的对 Kubernetes Desceheduler 已有插件进行兼容，在使用时您只需部署 Koordinator Descheduler 即可使用到上游的全部功能。

在实现上，Koordinator Descheduler 通过 import 上游代码不做任何侵入式的改动，保证完全兼容上游所有的插件、参数配置以及其运行策略。同时，Koordinator 允许用户为上游插件指定增强的 evictor，从而复用 Koordinator 提供的资源预留、工作负载可用性保障以及全局流控等安全性策略。

兼容的插件列表包括：

HighNodeUtilization
LowNodeUtilization
PodLifeTime
RemoveFailedPods
RemoveDuplicates
RemovePodsHavingTooManyRestarts
RemovePodsViolatingInterPodAntiAffinity
RemovePodsViolatingNodeAffinity
RemovePodsViolatingNodeTaints
RemovePodsViolatingTopologySpreadConstraint
DefaultEvictor

在使用时，可以参考如下的方式配置，以 RemovePodsHavingTooManyRestarts 为例：

apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
clientConnection:
  kubeconfig: "/Users/joseph/asi/koord-2/admin.kubeconfig"
leaderElection:
  leaderElect: false
  resourceName: test-descheduler
  resourceNamespace: kube-system
deschedulingInterval: 10s
dryRun: true
profiles:
- name: koord-descheduler
  plugins:
    evict:
      enabled:
        - name: MigrationController
   deschedule:
     enabled:
       - name: RemovePodsHavingTooManyRestarts
  pluginConfig:
    - name: RemovePodsHavingTooManyRestarts
      args:
        apiVersion: descheduler/v1alpha2
        kind: RemovePodsHavingTooManyRestartsArgs
        podRestartThreshold: 10

资源预留调度能力增强

Koordinator 在比较早期的版本中引入了 Reservation 机制，通过预留资源并复用给指定特征的 Pod 使用，用于帮助解决资源交付确定性问题。例如重调度场景中期望被驱逐的 Pod 一定有资源可以使用，而不是被驱逐后无资源可用导致引起稳定性问题；又或者需要扩容时，一些 PaaS 平台希望能够先确定是否满足应用调度编排的资源，再决定是否扩容，或者提前做一些预备工作等。

Koordinator Reservation 通过 CRD 定义，每个 Reservation 对象会在 koord-scheduler 内伪造成一个 Pod 进行调度，这样的 Pod 我们称为 Reserve PodReserve Pod 就可以复用已有的调度插件和打分插件找到合适的节点，并最终在调度器内部状态中占据对应的资源。 Reservation 在创建时都会指定预留的资源将来要给哪些 Pod 使用，可以指定具体某个 Pod，也可以指定某些 workload 对象，或者具备某些标签的 Pod 使用。当这些 Pod 通过 koord-scheduler 调度时，调度器会找到可以被该 Pod 使用的 Reservation 对象，并且优先使用 Reservation 的资源。并且 Reservation Status 中会记录被哪个 Pod 使用，以及 Pod Annotations 中也会记录使用了哪个 Reservation。 Reservation 被使用后，会自动的清理内部状态，确保其他 Pod 不会因为 Reservation 导致无法调度。

在 Koordinator v1.2 中，我们做了大幅度的优化。首先我们放开了只能使用 Reservation 持有的资源的限制，允许跨出 Reservation 的资源边界，既可以使用 Reservation 预留的资源，也可以使用节点上剩余的资源。而且我们通过非侵入式的方式扩展了 Kubernetes Scheduler Framework，支持预留精细化资源，即可以预留 CPU 核和 GPU 设备等。我们也修改了 Reservation 可以被重复使用的默认行为，改为 AllocateOnce，即 Reservation 一旦被某个 Pod 使用，该 Reservation 会被废弃。这样的改动是考虑到，AllocateOnce 更能覆盖大部分场景，这样作为默认行为，大家在使用时会更简单。

支持AMD环境下的L3 Cache和内存带宽隔离

在v0.3.0版本中，Koordiantor已经支持了Intel环境的L3 Cache和内存带宽隔离，在最新的1.2.0版本中我们新增了对AMD环境的支持。 Linux内核L3 Cache和内存带宽隔离能力提供了统一的resctrl接口，同时支持Intel和AMD环境，主要区别在于，Intel提供的内存带宽隔离接口为百分比格式，而AMD提供的内存带宽隔离接口为绝对值格式，具体如下。

# Intel Format
# resctrl schema
L3:0=3ff;1=3ff
MB:0=100;1=100

# AMD Format
# resctrl schema
L3:0=ffff;1=ffff;2=ffff;3=ffff;4=ffff;5=ffff;6=ffff;7=ffff;8=ffff;9=ffff;10=ffff;11=ffff;12=ffff;13=ffff;14=ffff;15=ffff
MB:0=2048;1=2048;2=2048;3=2048;4=2048;5=2048;6=2048;7=2048;8=2048;9=2048;10=2048;11=2048;12=2048;13=2048;14=2048;15=2048

接口格式包含两部分，L3表示对应的socket或CCD可用的“路数”（way），以16进制的数据格式表示，每个比特位表示一路 MB表示对应的socket或CCD可以使用的内存带宽范围，Intel可选范围为0~100的百分比格式，AMD对应的为绝对值格式，单位为Gb/s，2048表示不限制。 Koordiantor统一提供了百分比格式的接口，并自动感知节点环境是否为AMD，决定resctrl接口中填写的格式。

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  resource-qos-config: |-
    {
      "clusterStrategy": {
        "lsClass": {
           "resctrlQOS": {
             "enable": true,
             "catRangeStartPercent": 0,
             "catRangeEndPercent": 100,
             "MBAPercent": 100
           }
         },
        "beClass": {
           "resctrlQOS": {
             "enable": true,
             "catRangeStartPercent": 0,
             "catRangeEndPercent": 30,
             "MBAPercent": 100
           }
         }
      }
    }

其他功能

通过 v1.2 release 页面，可以看到更多版本所包含的新增功能。

未来计划

在接下来的版本中，Koordiantor重点规划了以下功能，具体包括：

硬件拓扑感知调度，综合考虑节点CPU、内存、GPU等多个资源维度的拓扑关系，在集群范围内进行调度优化。
对重调度器的可观测性和可追溯性进行增强。
GPU资源调度能力的增强。

Koordinator 是一个开放的社区，非常欢迎广大云原生爱好者们通过各种方式一起参与共建，无论您在云原生领域是初学乍练还是驾轻就熟，我们都非常期待听到您的声音！

龙蜥 plugsched 神器助力 Koordinator 云原生单机混部—— 内核 CPU QoS 揭秘

February 28, 2023 · 10 min read

Erwei Deng

Openanolis developer

什么是 CPU 混部

CPU 混部是指将不同类型的业务部署到同一台机器上运行，让它们共享机器上的 CPU 资源以提升 CPU 利用率，从而降低机器的采购和运营成本。但是，对于有些类型的任务来说，它们对延时非常的敏感，比如电商、搜索或 web 服务等，这类任务的实时性很高，但是通常对资源的消耗却不是很多，我们称之为在线任务；还有一类任务，它们更多的关注计算或者批处理，对延时没有要求，但是消耗的资源相对较多，我们称之为离线任务。

当这两类任务同时部署到同一台机器上时，由于离线任务对资源的占用较多，资源竞争导致在线任务的延时受到了很大的影响，而且，在超线程架构的机器上，即使离线任务和在线任务跑在不同的超线程 CPU 上，流水线和 cache 的竞争也会导致在线任务的运行受到影响。于是，CPU 混部技术诞生了，来解决离线任务对在线任务延时的影响，同时还能进一步提升 CPU 资源的利用率。

图1 单机混部 CPU 利用率示意图

内核 CPU 混部技术

CPU 混部技术，主要是通过单机操作系统调度器来实现的，通过任务类型来决定所分配到的 CPU 资源。Koordinator 社区主要使用的单机操作系统发行版有 Alibaba Cloud Linux 2/3（简称 Alinux2/3）和 CentOS7.9。对于 Alinux2/3，它使用的是龙蜥社区的 Group Identity CPU 混部技术，在操作系统内核中提供了 CPU 混部能力。Group Identity 在原有的 CFS 调度器中新增了另一个运行队列来区分在线和离线任务，而且，为了避免对端 CPU（超线程架构）上离线任务的干扰，Group Identity 会对其进行驱逐。龙蜥的 Group Identity 技术已经经过阿里双十一等大型活动以及大规模商业化的验证，其 CPU 混部能力也得到广大用户和开发者的认可。

但是对于 CentOS 发行版来说，到目前为止还没有提供任何 CPU 混部相关的技术和能力。对于 CentOS CPU 混部能力的缺失，可能有以下几种解决方案：

制作 CentOS 的衍生版系统，并包含 CPU 混部技术；
迁移到 Alibaba Cloud Linux 2/3 操作系统发行版；

对于第一种方案，需要从 CentOS 镜像站中下载其内核源码，将 CPU 混部技术移植到内核，编译后安装，然后重启系统便可以使用该技术，但这会涉及到业务迁移和停机，势必会给业务方带来昂贵的代价。对于第二种方案，虽然迁移工作会有一定的工作量，但是，Alinux2/3 或 Anolis OS 包含了完整的混部资源隔离方案（CPU 混部仅仅是其中一点），技术红利所带来的收益远比迁移代价要大得多。而且 CentOS 即将停服，为了解决 CentOS 停服问题，龙蜥社区推出了 Anolis OS 发行版操作系统，该发行版系统完全兼容 CentOS，用户可以进行无缝迁移。

龙蜥 CPU 混部插件

针对 Koordinator 云原生 CentOS 单机操作系统 CPU 混部能力的缺失，龙蜥社区开发人员给出了另一种方案，利用 plugsched 调度器热升级技术提供一种 CPU 混部技术的调度器插件包，该插件包含了阿里云早期（2017年）的 CPU 混部技术 bvt + noise clean，该技术采用的是 throttle 机制，当调度器选择下一个任务时，它会检测对端 CPU 上的任务类型以及当前 CPU 正在执行的任务类型，如果在、离线任务同时存在，则会将离线任务 throttle 掉，然后继续选择下一个任务进行调度，保证在线任务优先执行且不被对端 CPU 上的离线干扰。该 CPU 混部调度器插件可直接安装到 CentOS7.9，不需要停机和业务迁移等工作。

Plugsched SDK 神器

Plugsched 调度器热升级，是龙蜥社区推出的 plugsched SDK 调度器热升级开发工具，它可从 Linux 内核中将调度器解耦，形成一个独立的模块，然后将 CPU 混部技术移植到调度器模块，形成一个调度器插件，然后将其直接安装到运行的系统中就可以使用 CPU 混部技术。Plugsched，可以对内核调度器特性动态的进行增、删、改，来满足业务的需求，且无需进行业务迁移和停机升级，还可以回滚。内核开发人员可通过 plugsched SDK 生产出各种类型的调度器插件来满足不同的业务场景。

Plugsched 调度器热升级论文《Efficient Scheduler Live Update for Linux Kernel with Modularization》已被 ASPLOS 顶会收录，里面详细介绍了 plugsched 技术原理和应用价值，以及全面的测试和评估。目前，plugsched 生产的插件已在蚂蚁集团、阿里云和国内某大型互联网企业规模部署。

Plugsched 开源链接：https://gitee.com/anolis/plugsched

CPU 混部插件测试

开发人员对该调度器插件进行了 CPU 混部的测试，服务端配置：

测试机器：阿里云神龙裸金属服务器，104 CPU，384 GB 内存
系统配置：CentOS 7.9 发行版，内核版本 3.10，安装 CPU 混部调度器插件
测试内容：在线任务是 Nginx 服务，容器配置为 80C 10GB，Nginx workers 数量为 80；离线任务是 ffmpeg 视频转码，容器配置为 50C 20GB，线程数量为 50。
测试case：
- 基线：单独启动 Nginx 容器
- 对照组：同时启动 Nginx 容器和 ffmpeg 容器，但不设置优先级（不启用混部功能）
- 实验组：同时启动 Nginx 容器和 ffmpeg 容器，给 Nginx 设置在线高优先级，ffmpeg 为离线低优先级（启用混部功能）

在另一台压测机上使用 wrk 工具向 Nginx 服务发起请求，结果如下：（单位：ms）

	基线	对照组	实验组
RT-P50	0.223	0.245（+9.86%）	0.224（+0.44%）
RT-P75	0.322	0.387（+20.18%）	0.338（+4.96%）
RT-P90	0.444	0.575（+29.50)	0.504（+13.51%）
RT-P99	0.706	1.7（+140.79)	0.88（+24.64%）
CPU%	25.15%	71.7%	49.15%

从上面的结果来看，没有 CPU 混部插件，离线任务对在线任务的影响很大，P99 延时增长了一倍多，而安装 CPU 混部插件后，P99 长尾延时的影响显著降低，CPU 利用率也接近50%。

该插件虽然能显著降低离线对在线任务的干扰，但还是逊色于龙蜥社区的 Group Identity 技术。龙蜥的 Group Identity 技术能让在线受到的干扰小于 5%，而且整机利用率的提升也比该插件要更多一些，达到 60% 以上（可查阅：koordinator 混部最佳实践手册）。这些差异的原因在于，1）内核自身的差异，CentOS 7.9 使用的是比较早的 3.10 内核，而龙蜥使用的是 4.19/5.10 内核，3.10 内核调度器性能本身就不及 4.19/5.10；2）Group Identity 的实现原理相比 noise clean 更适合 CPU 混部场景。

结语

最后，欢迎广大技术人员、开源爱好者和读者用户加入 Koordinator、openanolis 社区，享受社区带来的技术，不论是 Group Identity 还是 Plugsched 神器，一定会给大家带来意想不到的收益和价值，欢迎大家共建社区，与社区共同交流、成长和发展。

Koordinator v1.1: 让调度感知负载与干扰检测采集

January 3, 2023 · 17 min read

Siyu Wang

Koordinator maintainer

背景

Koordinator 旨在为用户提供完整的混部工作负载编排、混部资源调度、混部资源隔离及性能调优解决方案，帮助用户提高延迟敏感服务的运行性能，挖掘空闲节点资源并分配给真正有需要的计算任务，从而提高全局的资源利用效率。

从 2022 年 4 月发布以来，Koordinator 迄今一共迭代发布了 9 个版本。项目经历的大半年发展过程中，社区吸纳了包括阿里巴巴、小米、小红书、爱奇艺、360、有赞等在内的大量优秀工程师，贡献了众多的想法、代码和场景，一起推动 Koordinator 项目的成熟。

今天，很高兴的宣布 Koordinator v1.1 正式发布，它包含了负载感知调度/重调度、cgroup v2 支持、干扰检测指标采集，以及其他一系列优化点。接下来我们就针对这些新增特性做深入解读与说明。

版本特性深入解读

负载感知调度

支持按工作负载类型统计和均衡负载水位

Koordinator v1.0 及之前的版本，提供了负载感知调度提供基本的利用率阈值过滤保护高负载水位的节点继续恶化影响工作负载的运行时质量，以及通过预估机制解决解决冷节点过载的情况。已有的负载感知调度能解决很多常见场景的问题。但负载感知调度作为一种优化手段，还有比较多的场景是需要完善的。

目前的负载感知调度主要解决了集群内整机维度的负载均衡效果，但有可能出现一些特殊的情况：节点部署了不少离线Pod运行，拉高了整机的利用率，但在线应用工作负载的整体利用率偏低。这个时候如果有新的在线Pod，且整个集群内的资源比较紧张时，会有如下的问题：

有可能因为整机利用率超过整机安全阈值导致无法调度到这个节点上的；
还可能出现一个节点的利用率虽然相对比较低，但上面跑的全是在线应用率，从在线应用角度看，利用率已经偏高了，但按照当前的调度策略，还会继续调度这个Pod上来，导致该节点堆积了大量的在线应用，整体的运行效果并不好。

在 Koordinator v1.1 中，koord-scheduler 支持感知工作负载类型，区分不同的水位和策略进行调度。

在 Filter 阶段，新增 threshold 配置 prodUsageThresholds，表示在线应用的安全阈值，默认为空。如果当前调度的 Pod 是 Prod 类型，koord-scheduler 会从当前节点的 NodeMetric 中统计所有在线应用的利用率之和，如果超过了 prodUsageThresholds 就过滤掉该节点；如果是离线 Pod，或者没有配置 prodUsageThresholds，保持原有的逻辑，按整机利用率处理。

在 Score 阶段，新增开关 scoreAccordingProdUsage 表示是否按 Prod 类型的利用率打分均衡。默认不启用。当开启后，且当前 Pod 是 Prod 类型的话，koord-scheduler 在预估算法中只处理 Prod 类型的 Pod，并对 NodeMetrics 中记录的其他的未使用预估机制处理的在线应用的 Pod 的当前利用率值进行求和，求和后的值参与最终的打分。如果没有开启 scoreAccordingProdUsage，或者是离线Pod，保持原有逻辑，按整机利用率处理。

支持按百分位数利用率均衡

Koordinator v1.0及以前的版本都是按照 koordlet 上报的平均利用率数据进行过滤和打分。但平均值隐藏了比较多的信息，因此在 Koordinator v1.1 中 koordlet 新增了根据百分位数统计的利用率聚合数据。调度器侧也跟着做了相应的适配。

更改调度器的 LoadAware 插件的配置，aggregated 表示按照百分位数聚合数据进行打分和过滤。aggregated.usageThresholds 表示过滤时的水位阈值；aggregated.usageAggregationType 表示过滤阶段要使用的百分位数类型，支持 avg，p99, p95, p90 和 p50；aggregated.usageAggregatedDuration 表示过滤阶段期望使用的聚合周期，如果不配置，调度器将使用 NodeMetrics 中上报的最大周期的数据；aggregated.scoreAggregationType 表示在打分阶段期望使用的百分位数类型；aggregated.scoreAggregatedDuration 表示打分阶段期望使用的聚合周期，如果不配置，调度器将使用 NodeMetrics 中上报的最大周期的数据。

在 Filter 阶段，如果配置了 aggregated.usageThresholds 以及对应的聚合类型，调度器将按该百分位数统计值进行过滤；

在 Score 阶段，如果配置了 aggregated.scoreAggregationType，调度器将会按该百分位数统计值打分；目前暂时不支持 Prod Pod 使用百分位数过滤。

负载感知重调度

Koordinator 在过去的几个版本中，持续的演进重调度器，先后了开源完整的框架，加强了安全性，避免因过度驱逐 Pod 影响在线应用的稳定性。这也影响了重调度功能的进展，过去 Koordinator 暂时没有太多力量建设重调度能力。这一情况将会得到改变。

Koordinator v1.1 中我们新增了负载感知重调度功能。新的插件称为 LowNodeLoad，该插件配合着调度器的负载感知调度能力，可以形成一个闭环，调度器的负载感知调度在调度时刻决策选择最优节点，但随着时间和集群环境以及工作负载面对的流量/请求的变化时，负载感知重调度可以介入进来，帮助优化负载水位超过安全阈值的节点。 LowNodeLoad 与 K8s descheduler 的插件 LowNodeUtilization 不同的是，LowNodeLoad是根据节点真实利用率的情况决策重调度，而 LowNodeUtilization 是根据资源分配率决策重调度。

LowNodeLoad 插件有两个最重要的参数，分别是 highThresholds 和 lowThresholds：

highThresholds 表示负载水位的警戒阈值，超过该阈值的节点上的Pod将参与重调度；
lowThresholds 表示负载水位的安全水位。低于该阈值的节点上的Pod不会被重调度。

以下图为例，lowThresholds 为45%，highThresholds 为 70%，那么低于 45% 的节点是安全的，因为水位已经很低了；高于45%，但是低于 70%的是区间是我们期望的负载水位范围；高于70%的节点就不安全了，应该把超过70%的这部分（假设当前节点A的负载水位是85%），那么 85% - 70% = 15% 的负载降低，筛选 Pod 后执行迁移。

LowNodeLoad 示例

迁移时，还要考虑到低于 45% 的这部分节点是我们重调度后要承载新Pod的节点，我们需要确保迁移的Pod的负载总量不会超过这些低负载节点的承载上限。这个承载上限即是 highThresholds - 节点当前负载，假设节点B的负载水位是20%，那么 70%-20% = 50%，这50%就是可以承载的容量了。因此迁移时每驱逐一个 Pod，这个承载容量就应该扣掉当前重调度 Pod 的当前负载或者预估负载或者画像值（这部分值与负载调度里的值对应）。这样就可以确保不会多迁移。

如果一个集群总是可能会出现某些节点的负载就是比较高，而且数量并不多，这个时候如果频繁的重调度这些节点，也会带来安全隐患，因此可以让用户按需设置 numberOfNodes。

另外，LowNodeLoad 识别出超过阈值的节点后会筛选 Pod，当筛选 Pod 时，可以配置要支持或者过滤的 namespace，或者配置 pod selector 筛选，也可以配置 nodeFit 检查每个备选 Pod 对应的 Node Affinity/Node Selector/Toleration 是否有与之匹配的 Node，如果没有的话，这种节点也会被忽略。当然可以考虑不启用这个能力，通过配置 nodeFit 为 false 后即可禁用，此时完全由底层的 MigrationController 通过 Koordinator Reservation 预留资源；

当筛选出 Pod 后，会对这些 Pod 进行排序。会依靠Koordinator QoSClass、Kubernetes QoSClass、Priority、用量和创建时间等多个维度排序。

cgroup v2 支持

背景

Koordinator 中众多单机 QoS 能力和资源压制/弹性策略构建在 Linux Control Group (cgroups) 机制上，比如 CPU QoS (cpu)、Memory QoS (memory)、CPU Burst (cpu)、CPU Suppress (cpu, cpuset)，koordlet 组件可以通过 cgroups (v1) 限制容器可用资源的时间片、权重、优先级、拓扑等属性。Linux 高版本内核也在持续增强和迭代了 cgroups 机制，带来了 cgroups v2 机制，统一 cgroups 目录结构，改善 v1 中不同 subsystem/cgroup controller 之间的协作，并进一步增强了部分子系统的资源管理和监控能力。Kubernetes 自 1.25 起将 cgroups v2 作为 GA (general availability) 特性，在 Kubelet 中启用该特性进行容器的资源管理，在统一的 cgroups 层次下设置容器的资源隔离参数，支持 MemoryQoS 的增强特性。

cgroup v1/v2 结构

在 Koordinator v1.1 中，单机组件 koordlet 新增对 cgroups v2 的支持，包括如下工作：

重构了 Resource Executor 模块，以统一相同或近似的 cgroup 接口在 v1 和 v2 不同版本上的文件操作，便于 koordlet 特性兼容 cgroups v2 和合并读写冲突。
在当前已开放的单机特性中适配 cgroups v2，采用新的 Resource Executor 模块替换 cgroup 操作，优化不同系统环境下的报错日志。

Koordinator v1.1 中大部分 koordlet 特性已经兼容 cgroups v2，包括但不限于：

资源利用率采集
动态资源超卖
Batch 资源隔离（BatchResource，废弃BECgroupReconcile）
CPU QoS（GroupIdentity）
Memory QoS（CgroupReconcile）
CPU 动态压制（BECPUSuppress）
内存驱逐（BEMemoryEvict）
CPU Burst（CPUBurst）
L3 Cache 及内存带宽隔离（RdtResctrl）

遗留的未兼容特性如 PSICollector 将在接下来的 v1.2 版本中进行适配，可以跟进 issue#407 获取最新进展。接下来的 Koordinator 版本中也将逐渐引入更多 cgroups v2 的增强功能，敬请期待。

使用 cgroups v2

在 Koordinator v1.1 中，koordlet 对 cgroups v2 的适配对上层功能配置透明，除了被废弃特性的 feature-gate 以外，您无需变动 ConfigMap slo-controller-config 和其他 feature-gate 配置。当 koordlet 运行在启用 cgroups v2 的节点上时，相应单机特性将自动切换到 cgroups-v2 系统接口进行操作。

此外，cgroups v2 是 Linux 高版本内核（建议 >=5.8）的特性，对系统内核版本和 Kubernetes 版本有一定依赖。建议采用默认启用 cgroups v2 的 Linux 发行版以及 Kubernetes v1.24 以上版本。

更多关于如何启用 cgroups v2 的说明，请参照 Kubernetes 社区文档。

干扰检测指标采集

在真实的生产环境下，单机的运行时状态是一个“混沌系统”，资源竞争产生的应用干扰无法绝对避免。Koordinator 正在建立干扰检测与优化的能力，通过提取应用运行状态的指标，进行实时的分析和检测，在发现干扰后对目标应用和干扰源采取更具针对性的策略。

当前 Koordinator 已经实现了一系列 Performance Collector，在单机侧采集与应用运行状态高相关性的底层指标，并通过 Prometheus 暴露出来，为干扰检测能力和集群应用调度提供支持。

指标采集

Performance Collector 由多个 feature-gate 进行控制，Koordinator 目前提供以下几个指标采集器：

CPICollector：用于控制 CPI 指标采集器。CPI：Cycles Per Instruction。指令在计算机中执行所需要的平均时钟周期数。CPI 采集器基于 Cycles 和 Instructions 这两个 Kernel PMU（Performance Monitoring Unit）事件以及 perf_event_open(2) 系统调用实现。
PSICollector：用于控制 PSI 指标采集器。PSI：Pressure Stall Information。表示容器在采集时间间隔内，因为等待 cpu、内存、IO 资源分配而阻塞的任务数。使用 PSI 采集器前，需要在 Anolis OS 中开启 PSI 功能，您可以参考文档获取开启方法。

Performance Collector 目前是默认关闭的。您可以通过修改 Koordlet 的 feature-gates 项来使用它，此项修改不会影响其他 feature-gate

kubectl edit ds koordlet -n koordinator-system

...
spec:
  ...
    spec:
      containers:
      - args:
        ...
        # modify here
        # - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
        - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,CPICollector=true,PSICollector=true

ServiceMonitor

v1.1.0 版本的 Koordinator 为 Koordlet 增加了 ServiceMonitor 的能力，将所采集指标通过 Prometheus 暴露出来，用户可基于此能力采集相应指标进行应用系统的分析与管理。

ServiceMonitor 由 Prometheus 引入，故在 helm chart 中设置默认不开启安装，可以通过以下命令安装ServiceMonitor：

helm install koordinator https://... --set koordlet.enableServiceMonitor=true

部署后可在 Prometheus UI 找到该 Targets。

# HELP koordlet_container_cpi Container cpi collected by koordlet
# TYPE koordlet_container_cpi gauge
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="cycles",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 2.228107503e+09
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="instructions",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 4.1456092e+09

可以期待的是，Koordinator 干扰检测的能力在更复杂的真实场景下还需要更多检测指标的补充，后续将在如内存、磁盘 IO 等其他诸多资源的指标采集建设方面持续发力。

其他更新点

通过 v1.1 release 页面，可以看到更多版本所包含的新增功能。

Koordinator v1.0: 正式发布

November 3, 2022 · 7 min read

Joseph

Koordinator maintainer

Koordinator 今年3月份开源以来，先后发布了7个版本，逐步的把阿里巴巴&阿里云内部的混部系统的核心能力输出到开源社区，并在中间过程中逐渐的被 Kubernetes、大数据、高性能计算、机器学习领域或者社区的关注，Koordinator 社区也逐步获得了一些贡献者的支持，并有一些企业开始逐步的在生产环境中使用 Koordinator 解决实际生产中遇到的成本问题、混部问题等。经过 Koordinator 社区的努力，我们怀着十分激动的心情向大家宣布 Koordinator 1.0 版本正式发布。

Koordinator 项目早期着重建设核心混部能力 -- 差异化 SLO，并且为了让用户更容易的使用 Koordinator 的混部能力，Koordinator 提供了 ClusterColocationProfile 机制帮助用户可以不用修改存量代码完成不同工作负载的混部，让用户逐步的熟悉混部技术。随后 Koordinaor 逐步在节点侧 QoS 保障机制上做了增强，提供了包括但不限于 CPU Suppress、CPU Burst、 Memory QoS、L3 Cache/MBA 资源隔离机制和基于满足度驱逐机制等多种能力，解决了大部分节点侧工作负载的稳定性问题。配合使用 Koordinator Runtime Proxy 组件，可以更好的兼容 Kubernetes kubelet 原生管理机制。

并且 Koordinator 在任务调度和 QoS 感知调度以及重调度等方面也都提供了一些创新方案，建设了全面兼容 Kubernetes CPU 管理机制的精细化 CPU 调度能力，面向节点实际负载的均衡调度能力。为了更好的让用户管理好资源， Koordinator 还提供了资源预留能力（Reservation)，并且 Koordinator 基于 Kubernetes 社区已有的Coscheduling、ElasticQuota Scheduling 能力做了进一步的增强，为任务调度领域注入了新的活力。Koordinator 提供了全新的重调度器框架，着重建设 Descheduler 的扩展性和安全性问题。

安装或升级 Koordinator v1.0.0

使用 Helm 安装

您可以通过 helm v3.5+ 非常方便的安装 Koordinator，Helm 是一个简单的命令行工具，您可以从这里获取它。

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 1.0.0

版本功能特性解读

Koordinator v1.0 整体新增的特性并不多，主要有以下一些变化

独立 API Repo

为了更方便集成和使用 Koordiantor 定义的 API，并避免因依赖 Koordiantor 引入额外的依赖或者依赖冲突问题，我们建立了独立的 API Repo: koordinator-sh/apis

新增 ElasticQuota Webhook

在 Koordinator v0.7 版本中，我们基于 Kubernetes sig-scheduler 提供的 ElasticQuota 做了诸多增强，提供了树形管理机制，并提供了公平性保障机制等，可以很好的帮助您解决使用 ElasticQuota 遇到的问题。在 Koordinator v1.0 版本中，我们进一步提供了 ElasticQuota Webhook，帮助您在使用 ElasticQuota 树形管理机制时，保障新的 ElasticQuota 对象遵循 Koordinator 定义的规范或约束：

除了根节点，其他所有子节点的 min 之和要小于父节点的 min。
不限制子节点 max，允许子节点的 max 大于父节点的 max。考虑以下场景，集群中有 2 个 ElasticQuota 子树：dev-parent 和 production-parent，每个子树都有几个子 ElasticQuota。当 production-parent 忙时，我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量，而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
Pod 不能使用父节点ElasticQuota。如果放开这个限制，会导致整个弹性 Quota 的机制变的异常复杂，暂时不考虑支持这种场景。
只有父节点可以挂子节点，不允许子节点挂子节点
暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

进一步完善 ElasticQuota Scheduling

在 Koordinator v0.7 版本中，koord-scheduler 的主副 Pod 都会启动 ElasticQuota Controller 并都会更新 ElasticQuota 对象。在 Koordinator v1.0 中我们修复了该问题，确保只有主 Pod 可以启动 Controller 并更新 ElasticQuota 对象。还优化了 ElasticQuota Controller 潜在的频繁更新 ElasticQuota 对象的问题，当检查到 ElasticQuota 各维度数据发生变化时才会更新，降低频繁更新给 APIServer 带来的压力。

Koordinator v1.0 中 koordlet 会上报 GPU 的型号和驱动版本到 Device CRD 对象中，并会由 koord-manager 同步更新到 Node 对象，追加相应的标签。

apiVersion: v1
kind: Node
metadata:
  labels:
    kubernetes.io/gpu-driver: 460.91.03
    kubernetes.io/gpu-model: Tesla-T4
    ...
  name: cn-hangzhou.10.0.4.164
spec:
  ...
status:
  ...

Koordinator Runtime Proxy 增强兼容性

在 Koordinator 之前的版本中，koord-runtime-proxy 和 koordlet 一起安装后，如果 koordlet 异常或者 koordlet 卸载/重装等场景下，会遇到新调度到节点的 Pod 无法创建容器的问题。为了解决这个问题，koord-runtime-proxy 会感知 Pod 是否具有特殊的 label runtimeproxy.koordinator.sh/skip-hookserver=true，如果 Pod 存在该标签，koord-runtime-proxy 会直接把 CRI 请求转发给 containerd/docker 等 runtime。

其他改动

你可以通过 Github release 页面，来查看更多的改动以及它们的作者与提交记录。

Koordinator v0.7: 为任务调度领域注入新活力

September 23, 2022 · 34 min read

Joseph

Koordinator maintainer

Koordinator[1] 继上次 v0.6版本[2] 发布后，经过 Koordinator 社区的努力，我们迎来了具有重大意义的 v0.7 版本。在这个版本中着重解决机器学习、大数据场景需要的任务调度能力，例如 CoScheduling、ElasticQuota和精细化的 GPU 共享调度能力。并在调度问题诊断分析方面得到了增强，重调度器也极大的提升了安全性，降低了重调度的风险。

版本功能特性解读

1. 任务调度

1.1 Enhanced Coscheduling

Gang scheduling是在并发系统中将多个相关联的进程调度到不同处理器上同时运行的策略，其最主要的原则是保证所有相关联的进程能够同时启动，防止部分进程的异常，导致整个关联进程组的阻塞。例如当提交一个Job时会产生多个任务，这些任务期望要么全部调度成功，要么全部失败。这种需求称为 All-or-Nothing，对应的实现被称作 Gang Scheduling(or Coscheduling) 。
Koordinator 在启动之初，期望支持 Kubernetes 多种工作负载的混部调度，提高工作负载的运行时效率和可靠性，其中就包括了机器学习和大数据领域中广泛存在的具备 All-or-Nothing 需求的作业负载。为了解决 All-or-Nothing 调度需求，Koordinator v0.7.0 基于社区已有的 Coscheduling 实现了 Enhanced Coscheduling。
Enhanced Coscheduling 秉承着 Koordiantor 兼容社区的原则，完全兼容社区 Coscheduling 和依赖的 PodGroup CRD。已经使用 PodGroup 的用户可以无缝升级到 Koordinator。
除此之外，Enhanced Coscheduling 还实现了如下增强能力：

支持 `Strict` 和 `NonStrict` 两种模式

两种模式的区别在于 Strict模式（即默认模式）下调度失败会 Reject 所有分配到资源并处于 Wait 状态的 Pod，而 NonStrict 模式不会发起 Reject。NonStrict 模式下，同属于一个 PodGroup 的 Pod A 和 PodB 调度时，如果 PodA 调度失败不会影响 PodB 调度， PodB 还会继续被调度。NonStrict 模式对于体量较大的 Job 比较友好，可以让这种大体量 Job 更快的调度完成，但同时也增加了资源死锁的风险。后续 Koordinator 会提供 NonStrict 模式下解决死锁的方案实现。
用户在使用时，可以在 PodGroup 或者 Pod 中追加 annotation gang.scheduling.koordinator.sh/mode=NonStrict开启 NonStrict 模式。

改进 PodGroup 调度失败的处理机制，实现更高效的重试调度

举个例子，PodGroup A 关联了5个Pod，其中前3个Pod通过Filter/Score，进入Wait阶段，第4个Pod调度失败，当调度第5个Pod时，发现第4个Pod已经失败，则拒绝调度。在社区 Coscheduling 实现中，调度失败的PodGroup 会加入到基于cache机制的 lastDeniedPG 对象中，当 cache 没有过期，则会拒绝调度；如果过期就允许继续调度。可以看到 cache 的过期时间很关键，过期时间设置的过长会导致Pod迟迟得不到调度机会，设置的过短会出现频繁的无效调度。
而在Enhanced Coscheduling 中，实现了一种基于 ScheduleCycle 的重试机制。以上场景为例，5个Pod的 ScheduleCycle 初始值为 0，PodGroup 对应的 ScheduleCycle 初始值为1；当每一次尝试调度 Pod 时，都会更新 Pod ScheduleCycle 为 PodGroup ScheduleCycle。如果其中一个 Pod 调度失败，会标记当前的 PodGroup ScheduleCycle 无效，之后所有小于 PodGroup ScheduleCycle 的 Pod 都会被拒绝调度。当同一个 PodGroup 下的所有 Pod 都尝试调度一轮后，Pod ScheduleCycle 都更新为当前 PodGroup ScheduleCycle，并递进 PodGroup ScheduleCycle，并标记允许调度。这种方式可以有效规避基于过期时间的缺陷，完全取决于调度队列的配置重试调度。

支持多个 PodGroup 为一组完成 Gang Scheduling

一些复杂的 Job 有多种角色，每个角色管理一批任务，每个角色的任务要求支持 All-or-Nothing ，每个角色的 MinMember 要求也不一样，并且每个角色之间也要求 All-or-Nothing。这就导致每个角色都有一个对应的 PodGroup ，并且还要求 PodGroup 即使满足了也需要等待其他角色的 PodGroup 必须满足。社区 Coscheduling 无法满足这种场景需求。而 Koordinator 实现的 Enhanced Coscheduling 支持用户在多个 PodGroup 中增加 anntation 相关关联实现，并支持跨Namespace。例如用户有2个PodGroup ，名字分别是PodGroupA和PodGroupB，可以按照如下例子关联两个 PodGroup：

apiVersion: v1alpha1
kind: PodGroup
metadata:
  name: podGroupA
  namespace: default
  annotations:
    gang.scheduling.koordinator.sh/groups: ["namespaceA/podGroupA", "namespaceB/podGroupB"]
spec:
    ...

支持轻量化 Gang 协议

如果用户不希望创建 PodGroup，认为创建 PodGroup 太繁琐，那么可以考虑在一组 Pod 中填充相同 annotation gang.scheduling.koordinator.sh/name=<podGroupName> 表示这一组 Pod 使用 Coscheduling 调度。如果期望设置 minMember ，可以追加 Annotation gang.scheduling.koordinator.sh/min-available=<availableNum>。举个例子：

apiVersion: v1
kind: Pod
metadata:
  annotations:
    gang.scheduling.koordinator.sh/name: "pod-group-a"
    gang.scheduling.koordinator.sh/min-available: "5"
  name: demo-pod
  namespace: default
spec:
    ...

1.2 ElasticQuota Scheduling

一家中大型公司内有多个产品和研发团队，共用多个比较大规模的 Kubernetes 集群，这些集群内含有的大量 CPU/Memory/Disk 等资源被资源运营团队统一管理。运营团队往往在采购资源前，通过额度预算的机制让公司内每个团队根据自身的需求提交额度预算。业务团队此时一般根据业务当前和对未来的预期做好额度预算。最理想的情况是每一份额度都能够被使用，但现实告诉我们这是不现实的。往往出现的问题是：

团队 A 高估了业务的发展速度，申请了太多的额度用不完
团队 B 低估了业务的发展速度，申请的额度不够用
团队 C 安排了一场活动，手上的额度不够多了，但是活动只持续几周，申请太多额度和资源也会浪费掉。
团队 D 下面还有各个子团队和业务，每个子团队内也会出现类似A B C 三个团队的情况，而且其中有些团队的业务临时突发需要提交一些计算任务要交个客户，但是没有额度了，走额度预算审批也不够了。
......

以上大家日常经常遇到的场景，在混部场景、大数据场景，临时性突发需求又是时常出现的，这些资源的需求都给额度管理工作带来了极大的挑战。做好额度管理工作，一方面避免过度采购资源降低成本，又要在临时需要额度时不采购资源或者尽量少的采购资源；另一方面不能因为额度问题限制资源使用率，额度管理不好就会导致即使有比较好的技术帮助复用资源，也无法发挥其价值。总之，额度管理工作是广大公司或组织需长期面对且必须面对的问题。
Kubernetes ResourceQuota 可以解决额度管理的部分问题。原生 Kubernetes ResourceQuota API 用于指定每个 Namespace 的最大资源额度量，并通过 admission 机制完成准入检查。如果 Namespace 当前资源分配总量超过ResourceQuota 指定的配额，则拒绝创建 Pod。 Kubernetes ResourceQuota 设计有一个局限性：Quota 用量是按照 Pod Requests 聚合的。虽然这种机制可以保证实际的资源消耗永远不会超过 ResourceQuota 的限制，但它可能会导致资源利用率低，因为一些 Pod 可能已经申请了资源但未能调度。
Kuberenetes Scheduler-Sig 后来给出了一个借鉴 Yarn Capacity Scheduling，称作 ElasticQuota 的设计方案并给出了具体的实现。允许用户设置 max 和 min：

max 表示用户可以消费的资源上限
min 表示需要保障用户实现基本功能/性能所需要的最小资源量

通过这两个参数可以帮助用户实现如下的需求：

用户设置 min < max 时，当有突发资源需求时，即使当前 ElasticQuota 的总用量超过了 min，但只要没有达到 max，那么用户可以继续创建新的 Pod 应对新的任务请求。
当用户需要更多资源时，用户可以从其他 ElasticQuota 中“借用(borrow)” 还没有被使用并且需要通保障的 min。
当一个 ElasticQuota 需要使用 min 资源时，会通过抢占机制从其他借用方抢回来，即驱逐一些其他ElasticQuota 超过 min 用量的 Pod。

ElasticQuota 还有一些局限性：没有很好的保障公平性。假如同一个 ElasticQuota 有大量新建的Pod，有可能会消耗所有其他可以被借用的Quota，从而导致后来的 Pod 可能拿不到 Quota。此时只能通过抢占机制抢回来一些 Quota。
另外 ElasticQuota 和 Kubernetes ResourceQuota 都是面向 Namespace的，不支持多级树形结构，对于一些本身具备复杂组织关系的企业/组织不能很好的使用ElasticQuota/Kubenretes ResourceQuota 完成额度管理工作。
Koordinator 针对这些额度管理问题，给出了一种基于社区 ElasticQuota 实现的支持多级管理方式的弹性Quota管理机制(multi hierarchy quota management)。具备如下特性：

兼容社区的 ElasticQuota API。用户可以无缝升级到 Koordinator
支持树形结构管理 Quota。
支持按照共享权重(shared weight)保障公平性。
允许用户设置是否允许借用Quota 给其他消费对象。

Pod 关联 ElasticQuota 方式

用户可以非常使用的使用该能力，可以完全按照 ElasticQuota 的用法，即每个 Namespace 设置一个 ElasticQuota 对象。也可以在 Pod 中追加 Label 关联 ElasticQuota：

apiVersion: v1
kind: Pod
metadata:
  labels:
    quota.scheduling.koordinator.sh/name: "elastic-quota-a"
  name: demo-pod
  namespace: default
spec:
    ...

树形结构管理机制和使用方法

需要使用树形结构管理 Quota 时，需要在 ElasticQuota 中追加 Label quota.scheduling.koordinator.sh/is-parent表示当前 ElasticQuota 是否是父节点，quota.scheduling.koordinator.sh/parent表示当前 ElasticQuota 的父节点 ElasticQuota 的名字。举个例子：

我们创建一个 ElasticQuota Root 作为根节点，资源总量为CPU 100C，内存200Gi，以及子节点 quota-a

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: parentA
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/is-parent: "true"
    quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
  max:
    cpu: 100
    memory: 200Gi
  min:
    cpu: 100
    memory: 200Gi
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
  name: childA1
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/is-parent: "false"
    quota.scheduling.koordinator.sh/parent: "parentA"
    quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
  max:
    cpu: 40
    memory: 100Gi
  min:
    cpu: 20
    memory: 40Gi

在使用树形结构管理 ElasticQuota 时，有一些需要遵循的约束：

除了根节点，其他所有子节点的 min 之和要小于父节点的 min。
不限制子节点 max，允许子节点的 max 大于父节点的 max。考虑以下场景，集群中有 2 个 ElasticQuota 子树：dev-parent 和 production-parent，每个子树都有几个子 ElasticQuota。当 production-parent 忙时，我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量，而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
Pod 不能使用父节点ElasticQuota。如果放开这个限制，会导致整个弹性 Quota 的机制变的异常复杂，暂时不考虑支持这种场景。
只有父节点可以挂子节点，不允许子节点挂子节点
暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

我们将在下个版本中通过 webhook 机制实现这些约束。

公平性保障机制

为了方便阅读和理解将要介绍的公平性保障机制，先明确几个新概念：

request 表示同一个 ElasticQuota 关联的所有 Pod 的资源请求量。如果一个 ElasticQuota A 的 request 小于 min，ElasticQuota B 的 request 大于 min，那么 ElasticQuota A 未使用的部分，即 min - request 剩余的量通过公平性保障机制借用给 ElasticQuota B. 当 ElasticQuota A 需要使用这些借走的量时，要求 ElasticQuota B 依据公平性保障机制归还给 ElasticQuota A。
runtime 表示 ElasticQuota 当前可以使用的实际资源量。如果 request 小于 min，runtime 等于 request。这也意味着，需要遵循 min 语义，应无条件满足 request。如果 request 大于 min，且 min 小于 max，公平性保障机制会分配 runtime 在min 与 max 之前，即 max >= runtime >= min。
shared-weight 表示一个 ElasticQuota 的竞争力，默认等于 ElasticQuota Max。

通过几个例子为大家介绍公平性保障机制的运行过程，假设当前集群的 CPU 总量为100C，并且有4个ElasticQuota，如下图所示，绿色部分为 Request 量：A 当前的request 为5，B当前的request为20，C当前的Request为30，D当前的Request为70。

并且我们注意到， A, B, C, D 的 min 之和是60，剩下 40 个空闲额度，同时 A 还可以借给 B, C, D 5个额度，所以一共有45个额度被B，C，D共享，根据各个ElasticQuota的 shared-weight，B，C，D分别对应60，50和80，计算出各自可以共享的量：

B 可以获取 14个额度， 45 * 60 / (60 + 50 + 80) = 14
C 可以获取 12个额度， 45 * 50 / (60 + 50 + 80) = 12
D 可以获取 19个额度， 45 * 80 / (60 + 50 + 80) = 19

但我们也要注意的是，C和D需要更多额度，而 B只需要5个额度就能满足 Request，并且 B 的min是15，也就意味着我们只需要给 B 5个额度，剩余的9个额度继续分给C和D。

C 可以获取 3个额度， 9 * 50 / (50 + 80) = 3
D 可以获取 6个额度， 9 * 80 / (50 + 80) = 6

最终我们得出如下的分配结果结果：

A runtime = 5
B runtime = 20
C runtime = 35
D runtime = 40

总结整个过程可以知道：

当前 request < min 时，需要借出 lent-to-quotas；当 request > min 时，需要借入 borrowed-qutoas
统计所有 runtime < min 的 Quota，这些总量就是接下来可被借出的量。
根据 shared-weight 计算每个ElasticQuota可以借入的量
如果最新的 runtime > reuqest，那么 runtime - request 剩余的量可以借给更需要的对象。

另外还有一种日常生产时会遇到的情况：即集群内资源总量会随着节点故障、资源运营等原因降低，导致所有ElasticQuota的 min 之和大于资源总量。当出现这种情况时，我们无法确保 min 的资源述求。此时我们会按照一定的比例调整各个ElasticQuota的min，确保所有min之和小于或者等于当前实际的资源总量。

抢占机制

Koordinator ElasticQuota 机制在调度阶段如果发现 Quota 不足，会进入抢占阶段，按照优先级排序，抢占属于同一个ElasticQuota 内的低优先级 Pod。同时，我们不支持跨 ElasticQuota 抢占其他 Pod。但是我们也提供了另外的机制支持从借用 Quota 的 ElasticQuota 抢回。
举个例子，在集群中，有两个 ElasticQuota，ElasticQuota A {min = 50, max = 100}， ElasticQuota B {min = 50, max = 100}。用户在上午10点使用 ElasticQuota A 提交了一个 Job， Request = 100 ，此时因为 ElasticQuota B 无人使用，ElasticQuota A 能从 B 手里借用50个Quota，满足了 Request = 100，并且此时 Used = 100。在11点钟时，另一个用户开始使用 ElasticQuota B 提交Job，Request = 100，因为 ElasticQuota B 的 min = 50，是必须保障的，通过公平性保障机制，此时 A 和 B 的 runtime 均为50。那么此时对于 ElasticQuota A ，Used = 100 是大于当前 runtime = 50 的，因此我们会提供一个 Controller，驱逐掉一部分 Pod ，使得当前 ElasticQuota A 的 Used 降低到 runtime 相等的水位。

2. 精细化资源调度

机器学习领域里依靠大量强大算力性能的 GPU 设备完成模型训练，但是 GPU 自身价格十分昂贵。如何更好地利用GPU设备，发挥GPU的价值，降低成本，是一个亟待解决的问题。 Kubernetes 社区现有的 GPU 分配机制中，GPU 是由 kubelet 分配的，并只支持分配一个或多个完整的 GPU 实例。这种方法简单可靠，但类似于 CPU 和 Memory，GPU 并不是一直处于高利用率水位，同样存在资源浪费的问题。因此，Koordinator 希望支持多工作负载共享使用 GPU 设备以节省成本。此外，GPU 有其特殊性。比如下面的 NVIDIA GPU 支持的 NVLink 和超卖场景，都需要通过调度器进行中央决策，以获得全局最优的分配结果。

从图中我们可以发现，虽然该节点有8个 GPU 实例，型号为A100/V100，但 GPU 实例之间的数据传输速度是不同的。当一个 Pod 需要多个 GPU 实例时，我们可以为 Pod 分配具有最大数据传输速度组合关系的 GPU 实例。此外，当我们希望一组 Pod 中的 GPU 实例具有最大数据传输速度组合关系时，调度器应该将最佳 GPU 实例批量分配给这些 Pod，并将它们分配到同一个节点。

GPU 资源协议

Koordinator 兼容社区已有的 nvidia.com/gpu资源协议，并且还自定义了扩展资源协议，支持用户更细粒度的分配 GPU 资源。

kubernetes.io/gpu-core 代表GPU的计算能力。与 Kuberetes MilliCPU 类似，我们将 GPU 的总算力抽象为100，用户可以根据需要申请相应数量的 GPU 算力。
kubernetes.io/gpu-memory 表示 GPU 的内存容量，以字节为单位。
kubernetes.io/gpu-memory-ratio 代表 GPU 内存的百分比。

假设一个节点有4个GPU设备实例，每个GPU设备实例有 8Gi 显存。用户如果期望申请一个完整的 GPU 实例，除了使用 nvidia.com/gpu之外，还可以按照如下方式申请：

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  namespace: default
spec:
  containers:
  - name: main
    resources:
      limits: 
        kubernetes.io/gpu-core: 100
        kubernetes.io/gpu-memory: "8Gi"
      requests:
        kubernetes.io/gpu-core: 100
        kubernetes.io/gpu-memory: "8Gi"

如果期望只使用一个 GPU 实例一半的资源，可以按照如下方式申请：

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  namespace: default
spec:
  containers:
  - name: main
    resources:
      limits: 
        kubernetes.io/gpu-core: 50
        kubernetes.io/gpu-memory: "4Gi"
      requests:
        kubernetes.io/gpu-core: 50
        kubernetes.io/gpu-memory: "4Gi"

设备信息和设备容量上报

在 Koordinator v0.7.0 版本中，单机侧 koordlet 安装后会自动识别节点上是否含有 GPU 设备，如果存在的话，会上报这些 GPU 设备的 Minor ID、 UUID、算力和显存大小到一个类型为 Device CRD 中。每个节点对应一个 Device CRD 实例。Device CRD 不仅支持描述 GPU，还支持类似于 FPGA/RDMA等设备类型，目前 v0.7.0 版本只支持 GPU，暂未支持这些设备类型。
Device CRD 会被 koord-manager 内的 NodeResource controller 和 koord-scheduler 消费。NodeResource controller 会根据 Device CRD 中描述的信息，换算成 Koordinator 支持的资源协议 kubernetes.io/gpu-core,kubernetes.io/gpu-memory 更新到 Node.Status.Allocatable 和 Node.Status.Capacity 字段，帮助调度器和 kubelet 完成资源调度。gpu-core 表示GPU 设备实例的算力，一个实例的完整算力为100。假设一个节点有 8 个 GPU 设备实例，那么节点的 gpu-core 容量为 8 100 = 800； gpu-memory 表示 GPU 设备实例的显存大小，单位为字节，同样的节点可以分配的显存总量为设备数量 每个实例的单位容量，例如一个 GPU 设备的显存是 8G，节点上有8 个 GPU 实例，总量为 8 * 8G = 64G。

apiVersion: v1
kind: Node
metadata:
  name: node-a
status:
  capacity:
    koordinator.sh/gpu-core: 800
    koordinator.sh/gpu-memory: "64Gi"
    koordinator.sh/gpu-memory-ratio: 800
  allocatable:
    koordinator.sh/gpu-core: 800
    koordinator.sh/gpu-memory: "64Gi"
    koordinator.sh/gpu-memory-ratio: 800

中心调度分配设备资源

Kuberetes 社区原生提供的设备调度机制中，调度器只负责校验设备容量是否满足 Pod，对于一些简单的设备类型是足够的，但是当需要更细粒度分配 GPU 时，需要中心调度器给予支持才能实现全局最优。
Koordinator 调度器 koord-scheduler 新增了调度插件 DeviceShare，负责精细度设备资源调度。DeviceShare 插件消费 Device CRD，记录每个节点可以分配的设备信息。DeviceShare 在调度时，会把 Pod 的GPU资源请求转换为 Koordinator 的资源协议，并过滤每个节点的未分配的 GPU 设备实例。确保有资源可用后，在 Reserve 阶段更新内部状态，并在 PreBind 阶段更新 Pod Annotation，记录当前 Pod 应该使用哪些 GPU 设备。
DeviceShare 将在后续版本支持 Binpacking 和 Spread 策略，实现更好的设备资源调度能力。

单机侧精准绑定设备信息

Kubernetes 社区在 kubelet 中提供了 DevicePlugin 机制，支持设备厂商在 kubelet 分配好设备后有机会获得设备信息，并填充到环境变量或者更新挂载路径。但是不能支持中心化的 GPU 精细化调度场景。
针对这个问题， Koordinator 扩展了 koord-runtime-proxy ，支持在 kubelet 创建容器时更新环境变量，注入调度器分配的 GPU 设备信息。

3. 调度器诊断分析

大家在使用 Kubernetes 时经常会遇到一些调度相关的问题：

我这个 Pod 为什么不能调度？
这个 Pod 为什么会调度到这个节点，不是应该被另一个打分插件影响到么？
我新开发了一个插件，发现调度结果不符合预期，但是有不知道哪里出了问题。

要诊断分析这些问题，除了要掌握 Kubernetes 基本的调度机制和资源分配机制外，还需要调度器自身给予支持。但是 Kubernetes kube-scheduler 提供的诊断能力比较有限，有时候甚至没有什么日志可以查看。kube-scheduler 原生是支持通过 HTTP 更改日志等级，可以获得更多日志信息，例如执行如下命令可以更改日志等级到5：

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/v --data '5' 
successfully set klog.logging.verbosity to 5

Koordinator 针对这些问题，实现了一套 Restful API ，帮助用户提升问题诊断分析的效率

分析 Score 结果

PUT /debug/flags/s 允许用户打开 Debug Score 开关，在打分结束后，以Markdown 格式打印 TopN 节点各个插件的分值。例如：

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/s --data '100'
successfully set debugTopNScores to 100

当有新 Pod 调度时，观察 scheduler log 可以看到如下信息

| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration |
| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:|
| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 |
| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 |
| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 |
| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 |

找个 Markdown 工具，就可以转为如下表格

#	Pod	Node	Score	LoadAwareScheduling	NodeResourcesFit	PodTopologySpread
0	default/curlimage-545745d8f8-rngp7	cn-hangzhou.10.0.4.51	577	87	94	200
1	default/curlimage-545745d8f8-rngp7	cn-hangzhou.10.0.4.50	574	85	93	200
2	default/curlimage-545745d8f8-rngp7	cn-hangzhou.10.0.4.19	541	55	91	200
3	default/curlimage-545745d8f8-rngp7	cn-hangzhou.10.0.4.18	487	15	82	200

调度插件导出内部状态

像 koord-scheduler 内部的 NodeNUMAResource 、 DeviceShare和ElasticQuota等插件内部都有维护一些状态帮助调度。 koord-scheduler 自定义了一个新的插件扩展接口（定义见下文），并会在初始化插件后，识别该插件是否实现了该接口并调用该接口，让插件注入需要暴露的 RestfulAPI。以 NodeNUMAResource 插件为例，会提供 /cpuTopologyOptions/:nodeName和 /availableCPUs/:nodeName两个Endpoints，可以查看插件内部记录的 CPU 拓扑信息和分配结果。

type APIServiceProvider interface {
    RegisterEndpoints(group *gin.RouterGroup)
}

用户在使用时，按照 /apis/v1/plugins/<pluginName>/<pluginEndpoints>方式构建 URL 查看数据，例如要查看 /cpuTopologyOptions/:nodeName：

$ curl schedulerLeaderIP:10252/apis/v1/plugins/NodeNUMAResources/cpuTopologyOptions/node-1
{"cpuTopology":{"numCPUs":32,"numCores":16,"numNodes":1,"numSockets":1,"cpuDetails":....

查看当前支持的插件 API

为了方便大家使用，koord-scheduler 提供了 /apis/v1/__services__ 查看支持的 API Endpoints

$ curl schedulerLeaderIP:10251/apis/v1/__services__
{
    "GET": [
        "/apis/v1/__services__",
        "/apis/v1/nodes/:nodeName",
        "/apis/v1/plugins/Coscheduling/gang/:namespace/:name",
        "/apis/v1/plugins/DeviceShare/nodeDeviceSummaries",
        "/apis/v1/plugins/DeviceShare/nodeDeviceSummaries/:name",
        "/apis/v1/plugins/ElasticQuota/quota/:name",
        "/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName",
        "/apis/v1/plugins/NodeNUMAResource/cpuTopologyOptions/:nodeName"
    ]
}

4. 更安全的重调度

在 Koordinator v0.6 版本中我们发布了全新的 koord-descheduler，支持插件化实现需要的重调度策略和自定义驱逐机制，并内置了面向 PodMigrationJob 的迁移控制器，通过 Koordinator Reservation 机制预留资源，确保有资源的情况下发起驱逐。解决了 Pod 被驱逐后无资源可用影响应用的可用性问题。
Koordinator v0.7 版本中，koord-descheduler 实现了更安全的重调度

支持 Evict 限流，用户可以根据需要配置限流策略，例如允许每分钟驱逐多少个 Pod
支持配置 Namespace 灰度重调度能力，让用户可以更放心的灰度
支持按照 Node/Namespace 配置驱逐数量，例如配置节点维度最多只驱逐两个，那么即使有插件要求驱逐该节点上的更多Pod，会被拒绝。
感知 Workload ，如果一个 Workload 正在发布、缩容、已经有一定量的 Pod 正在被驱逐或者一些Pod NotReady，重调度器会拒绝新的重调度请求。目前支持原生的 Deployment，StatefulSet 以及 Kruise CloneSet，Kruise AdvancedStatefulSet。

后续重调度器还会提升公平性，防止一直重复的重调度同一个 workload ，尽量降低重调度对应用的可用性的影响。

5. 其他改动

Koordinator 进一步增强了 CPU 精细化调度能力，完全兼容 kubelet ( <= v1.22) CPU Manager static 策略。调度器分配 CPU 时会避免分配被 kubelet 预留的 CPU，单机侧koordlet完整适配了kubelet从1.18到1.22版本的分配策略，有效避免了 CPU 冲突。
资源预留机制支持 AllocateOnce 语义，满足单次预留场景。并改进了 Reservation 状态语义，更加准确描述 Reservation 对象当前的状态。
改进了离线资源(Batch CPU/Memory) 的声明方式，支持limit大于request的资源描述形式，可以方便原burstable类型的任务直接转换为混部模式运行。

你可以通过 Github release[6] 页面，来查看更多的改动以及它们的作者与提交记录。

Background​

Key Features​

1. GPU Topology-Aware Scheduling: Accelerating GPU Interconnects Within AI Applications​

2. End-to-End GDR Support: Enhancing Cross-Machine Task Interconnect Performance​

3. Strong GPU Sharing Isolation: Improving Resource Utilization for AI Inference Tasks​

4. Differentiated GPU Scheduling Policies: Effectively Reducing GPU Fragmentation​

5. Fine-Grained Resource Reservation: Meeting Efficient Operation Needs for AI Tasks​

6. Co-location: Mid-tier Supports Idle Resource Reallocation, Enhances Pod-Level QoS Configuration​

7. Scheduling, Rescheduling: Continuously Improved Operational Efficiency​

Future Plans​

Background​

Key Features​

Pod-level NUMA Policy​

Terway Net QoS​

Core Scheduling​

Runtime Isolation of Physical Core​

Next-Gen CPU QoS Policy​

Other Changes​

Contributors​

Future Plan​

Acknowledgement​

Background​

Interpretation of Version Features​

1. Support Kubernetes and YARN workload co-location​

2. Introducing NUMA topology alignment strategy​

3. ElasticQuota evolves again​

3.1 Introducing Multi QuotaTree​

3.2 Support non-preemptible​

3.3 Other improvements​

4. CPU normalization​

5. Improved descheduling protection strategies​

6. Cold Memory reporting​

7. QoS management for non-containerized applications​

8. Other features​

Future plan​

Conclusion​

背景​

版本功能特性解读​

资源预留增强​

其他调度增强​

NRI 资源管理模式​

节点画像和 Mid 资源超卖​

其他功能​

未来计划​

结语​

背景​

版本功能特性解读​

节点资源预留​

节点资源预留声明​

调度及重调度场景适配​

单机资源管理​

兼容社区重调度策略​

资源预留调度能力增强​

支持AMD环境下的L3 Cache和内存带宽隔离​

其他功能​

未来计划​

什么是 CPU 混部​

内核 CPU 混部技术​

龙蜥 CPU 混部插件​

Plugsched SDK 神器​

CPU 混部插件测试​

结语​

背景​

版本特性深入解读​

负载感知调度​

支持按工作负载类型统计和均衡负载水位​

支持按百分位数利用率均衡​

负载感知重调度​

cgroup v2 支持​

背景​

使用 cgroups v2​

干扰检测指标采集​

指标采集​

ServiceMonitor​

其他更新点​

安装或升级 Koordinator v1.0.0

使用 Helm 安装​

版本功能特性解读

独立 API Repo​

新增 ElasticQuota Webhook​

Background

Key Features

1. GPU Topology-Aware Scheduling: Accelerating GPU Interconnects Within AI Applications

2. End-to-End GDR Support: Enhancing Cross-Machine Task Interconnect Performance

3. Strong GPU Sharing Isolation: Improving Resource Utilization for AI Inference Tasks

4. Differentiated GPU Scheduling Policies: Effectively Reducing GPU Fragmentation

5. Fine-Grained Resource Reservation: Meeting Efficient Operation Needs for AI Tasks

6. Co-location: Mid-tier Supports Idle Resource Reallocation, Enhances Pod-Level QoS Configuration

7. Scheduling, Rescheduling: Continuously Improved Operational Efficiency

Future Plans

Background

Key Features

Pod-level NUMA Policy

Terway Net QoS

Core Scheduling

Runtime Isolation of Physical Core

Next-Gen CPU QoS Policy

Other Changes

Contributors

Future Plan

Acknowledgement

Background

Interpretation of Version Features

1. Support Kubernetes and YARN workload co-location

2. Introducing NUMA topology alignment strategy

3. ElasticQuota evolves again

3.1 Introducing Multi QuotaTree

3.2 Support non-preemptible

3.3 Other improvements

4. CPU normalization

5. Improved descheduling protection strategies

6. Cold Memory reporting

7. QoS management for non-containerized applications

8. Other features

Future plan

Conclusion

背景

版本功能特性解读

资源预留增强

其他调度增强

NRI 资源管理模式

节点画像和 Mid 资源超卖

其他功能

未来计划

结语

背景

版本功能特性解读

节点资源预留

节点资源预留声明

调度及重调度场景适配

单机资源管理

兼容社区重调度策略

资源预留调度能力增强

支持AMD环境下的L3 Cache和内存带宽隔离

其他功能

未来计划

什么是 CPU 混部

内核 CPU 混部技术

龙蜥 CPU 混部插件

Plugsched SDK 神器

CPU 混部插件测试

结语

背景

版本特性深入解读

负载感知调度

支持按工作负载类型统计和均衡负载水位

支持按百分位数利用率均衡

负载感知重调度

cgroup v2 支持

背景

使用 cgroups v2

干扰检测指标采集

指标采集

ServiceMonitor

其他更新点

使用 Helm 安装

独立 API Repo

新增 ElasticQuota Webhook

进一步完善 ElasticQuota Scheduling

进一步完善 Device Share Scheduling