Skip to main content

· 12 min read
Rougang Han
Jianyu Wang

Background

Koordinator is an open source project, born from the accumulated experience of the container scheduling industry in Alibaba for more than two years. It has been iterating continuously to provide comprehensive solutions for workload consolidation, co-located resource scheduling, mixed resource isolation and mixed performance tuning. It aims to help users optimize container performance and improve the efficiency of cluster resource usage and management and optimization of latency-sensitive workloads and batch jobs.

Today, Koordinator v1.5.0 is released. It is the 13th major release of Koordinator since its officially open-sourced in April 2022. The Koordinator community is grateful to involve all the excellent engineers from Alibaba, Ant Technology Group, Intel, XiaoHongShu, Xiaomi, iQiyi, 360, YouZan, etc., who have contributed great ideas, code, and various scenarios. In v1.5.0, Koordinator brings a lot of feature improvements, including Pod-level NUMA alignment strategy, network QoS, Core Scheduling, etc.

Besides, Koordinator has been accepted by the CNCF TOC members as a Sandbox project. CNCF (Cloud Native Computing Foundation) is an independent, non-profit organization that supports and promotes cloud native software like Kubernetes, Prometheus, and etc.

koordinator-aboard-cncf-sandbox-img Vote address: https://github.com/cncf/sandbox/issues/51

Key Features

Pod-level NUMA Policy

In the past version of v1.4.0, Koordinator supports users to set different NUMA alignment policies for different nodes in the cluster. However, this means that users need to pre-split the nodes into different node pools with different NUMA alignment policies, which cause additional overhead of the node operations.

In v1.5.0, Koordinator introduces Pod-level NUMA alignment policies to solve this problem. For example, we can set SingleNUMANode for pod-1:

apiVersion: v1
kind: Pod
metadata:
name: pod-1
annotations:
scheduling.koordinator.sh/numa-topology-spec: |-
{
"numaTopologyPolicy": "SingleNUMANode",
}
spec:
containers:
- name: container-1
resources:
requests:
cpu: '1'
limits:
cpu: '1'

After introducing Pod-level NUMA policies, it is possible that there are multiple NUMA policies on the same node. For example, node-1 has two NUMA nodes, pod-1 uses SingleNUMANode policy on numa-0, and pod-2 uses Restricted policy on numa-0 and numa-1.

Since setting the resource requirements for the Pods can only limit the maximum resources they can use on the machines, it cannot limit the resources they can use on a NUMA node. So pod-2 may use more resources than the resources allocated on numa-0. This leads to resource contention between pod-2 and pod-1 on numa-0.

To solve this problem, Koordinator supports configuring the exclusive policy for Pods with SingleNUMANode policy. For example, we can configure pod-1 to use SingleNUMANode policy and not co-exist with other Pods on the same machine:

apiVersion: v1
kind: Pod
metadata:
name: pod-1
annotations:
scheduling.koordinator.sh/numa-topology-spec: |-
{
"numaTopologyPolicy": "SingleNUMANode",
"singleNUMANodeExclusive": "Required", # Required or Preferred
}
spec:
containers:
- name: container-1
resources:
requests:
cpu: '1'
limits:
cpu: '1'

Moreover, the introduction of Pod-level NUMA policies does not mean that the Node-level NUMA policies will be deprecated. Instead, they are compatible. If the Pod and Node-level NUMA policies are different, the Pod will not be scheduled to the node; if the Node-level NUMA policy is "", it means that the node can place any Pod; if the Pod-level NUMA policy is "", it means that the Pod can be scheduled to any node.

SingleNUMANode nodeRestricted nodeBestEffort node
SingleNUMANode pod[✓][x][x]
Restricted pod[x][✓][x]
BestEffort pod[x][x][✓]
""[✓][✓][✓]

For more information about Pod-level NUMA policies, please see Proposal: Pod-level NUMA Policy.

Terway Net QoS

In v1.5.0, Koordinator cooperates with the Terway community to build the Network QoS.

Terway QoS is born to solve the network bandwidth contention problem in workload consolidation and co-location scenarios. It supports limiting the bandwidth of Pods or QoS classes, which is different from other solutions:

  1. It supports limiting the bandwidth according to the business type, which is suitable for workload consolidation scenarios where multiple applications can be co-located at the same node.
  2. It supports dynamic adjustment of Pod bandwidth limits.
  3. It can limit the whole machine bandwidth, supporting multiple network cards, supporting to limit the container network and HostNetwork Pods.

Terway QoS has 3 types of network bandwidth priority, and the corresponding Koordinator default QoS mapping is as follows:

Koordinator QoSKubernetes QoSTerway Net QoS
SYSTEM--L0
LSEGuaranteedL1
LSRGuaranteedL1
LSGuaranteed/BurstableL1
BEBestEffortL2

In the co-location scenario, we want to ensure the maximum bandwidth of online applications to avoid contention. When the node is idle, offline jobs can also fully utilize all bandwidth resources.

Therefore, users can define business traffic as 3 priorities, from high to low, respectively as L0, L1, and L2. We define the contention scenario as: when the sum of the bandwidth of L0, L1, and L2 exceeds the whole machine bandwidth.

L0's maximum bandwidth will be dynamically adjusted according to the real-time bandwidth of L1 and L2. It can be high to the total machine bandwidth and low to "total machine bandwidth - L1 minimum bandwidth - L2 minimum bandwidth". In any case, the bandwidth of L1 and L2 will not exceed their upper limits. In the contention scenario, the bandwidth of L1 and L2 will not be lower than their lower limits, and the bandwidth will be limited in the order of L2, L1, and L0. Since Terway QoS only has three priorities, only the total machine bandwidth limit can be set for LS and BE. The remaining of L0 can be calculated according to the upper bandwidth limit of the whole machine.

Here is an example of the configuration:

# unit: bps
resource-qos-config: |
{
"clusterStrategy": {
"policies": {"netQOSPolicy":"terway-qos"},
"lsClass": {
"networkQOS": {
"enable": true,
"ingressRequest": "50M",
"ingressLimit": "100M",
"egressRequest": "50M",
"egressLimit": "100M"
}
},
"beClass": {
"networkQOS": {
"enable": true,
"ingressRequest": "10M",
"ingressLimit": "200M",
"egressRequest": "10M",
"egressLimit": "200M"
}
}
}
}
system-config: |-
{
"clusterStrategy": {
"totalNetworkBandwidth": "600M"
}
}

Besides, Koordinator supports Pod-level bandwidth limits through the following annotations:

KeyValue
koordinator.sh/networkQOS'{"IngressLimit": "10M", "EgressLimit": "20M"}'

For more information about the Network QoS, please see Network Bandwidth Limitation Using Terway QoS and Terway Community.

Core Scheduling

In v1.5.0, Koordinator provides container-level Core Scheduling ability. It reduces the risk of Side Channel Attacks (SCA) in multi-tenant scenarios, and can be used as a CPU QoS enhancement for the co-location scenarios.

Linux Core Scheduling supports defining a task group in user space that can share physical cores. Tasks belonging to the same group are assigned the same cookie as an identifier. And only tasks of one cookie will be run on a physical core (SMT dimension) at the same time. By applying this mechanism to security or performance scenarios, we can achieve the following things:

  1. Isolate physical cores for tasks of different tenants.
  2. Avoid the contention between offline jobs and online services.

Koordinator enables the kernel mechanism Core Scheduling to achieve container-level group isolation policies, and finally forms the following two capabilities:

  1. Runtime isolation of physical core: Pods can be grouped by the tenants, so pods in different groups cannot share physical cores at the same time for multi-tenant isolation.
  2. Next-gen CPU QoS policy: It can achieve a new CPU QoS policy which ensures both the CPU performance and the security.

Runtime Isolation of Physical Core

Koordinator provides Pod Label protocol to identify the Core Scheduling group of Pods.

KeyValue
koordinator.sh/coreSchedulingGroup"xxx-group"

Different groups of Pods are running exclusively at the physical core level, which can avoid some side channel attacks on the physical cores, L1 cache or L2 cache for multi-tenant scenarios.

container-core-scheduling-img

Different from the cpuset scheduling, the scope of the running cpus of Pods is not fixed. The physical cores can run Pods of different groups at different moments. Thus, the physical cores can be shared by time-division multiplexing.

Next-Gen CPU QoS Policy

Koordinator build a new CPU QoS policy based on the Core Scheduling and CGroup Idle mechanism provided by the Anolis OS kernel.

  • BE containers enable the CGroup Idle feature to lower scheduling weights and priorities.
  • LSR/LS containers enable Core Scheduling feature to expel BE tasks of the same group on the physical cores.

Users can enable the Core Scheduling policy by specifying cpuPolicy="coreSched" in the slo-controller-config.

# Example of the slo-controller-config ConfigMap.
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
resource-qos-config: |
{
"clusterStrategy": {
"policies": {
"cpuPolicy": "coreSched"
},
"lsClass": {
"cpuQOS": {
"enable": true,
"coreExpeller": true,
"schedIdle": 0
}
},
"beClass": {
"cpuQOS": {
"enable": true,
"coreExpeller": false,
"schedIdle": 1
}
}
}
}

For more information about the Core Scheduling, please see CPU QoS.

Other Changes

Koordinator v1.5.0 also includes the following enhancements and reliability improvements:

  1. Enhancements: Reservation Restricted mode supports controlling which resources strictly follow the Restricted semantic through Annotation. NUMA align policy adapts upstream; Coscheduling implements the fair scheduling queuing to ensure that Pods in the same GangGroup are dequeued together, and different Gangs and bare Pods are sorted by last scheduling time. NRI mode supports reconnection mechanism. Koordlet improves the monitoring metrics and adds performance metrics. BlkioReconcile updates the configurations.
  2. BugFixes: Fix the memory leak of koordlet CPUSuppress feature. Fix the panic problem of runtimeproxy. Revise the calculation logic of CPICollector, BECPUEvict, and CPUBurst modules.
  3. Environment compatibility: All components are upgraded to K8S 1.28. koordlet supports to run on a non-CUDA images. Koordlet adapts the kubelet 1.28 configuration and optimizes the compatibility logic for the cpu manager. Koordlet adapts cri-o runtime.
  4. Refactoring and improvement: Koordlet improves the resctrl updating logic. Koordlet improves the eviction logic. Revise the GPU resources and card model reporting. Revise the Batch resource calculation logic.
  5. CI/CD: Fix some flaky tests.

For more information about the v1.5.0 changes, please see v1.5.0 Release.

Contributors

Koordinator is an open source community. In v1.5.0, there are 10 new developers who contributed to Koordinator main repo. They are @georgexiang, @googs1025, @l1b0k, @ls-2018, @PeterChg, @sjtufl, @testwill, @yangfeiyu20102011, @zhifanggao, @zwForrest.

Koordinator community now has many enterprise contributors, some of which became Maintainers and Members. During the v1.5.0 release, the new Maintainers are

  • @songzh215
  • @j4ckstraw
  • @lucming
  • @kangclzjc

Thanks for the elders for their consistent efforts and the newbies for their active contributions. We welcome more contributors to join the Koordinator community.

Future Plan

In next versions, Koordinator plans the following works:

  • Scheduling performance optimization: The scheduling performance is the key indicator of whether the scheduler can handle large-scale clusters. In the next version, Koordinator will provide a setup guide of the basic benchmark environment and common benchmark scenarios, and start to improve the scheduling performance of Koord-Scheduler.
  • Device union allocation: In the LLC distributed training of AI scenarios, GPUs of different machines usually need to communicate with each other through high-performance network card, and GPU and high-performance network card are allocated near each other for better performance. Koordinator is working on the support of union allocation for multiple heterogeneous resources. The union allocation has been supported on the protocol and the scheduling logic. The single-node logic for reporting network card resources is being explored.
  • Job-level quota preemption: In the large-scale cluster scenario, some quotas can be busy, while other quotas can be idle. In the ElasticQuota plugin, we have supported borrowing resources from the idle quotas. But the scheduler has not considered the Job information when the borrowed quotas expect to take back resources. For the Pods belonging to the same Job, the scheduler should do preempt in the Job-level to ensure the job scheduling and improve the efficiency.
  • Load-aware scheduling for in-flight pods: Currently, the load-aware scheduling filters and scores nodes based on the resource utilization. It can improve the distribution of utilization to nodes, reduce the risks of scheduling pods to overload nodes. However, the accuracy of the utilization can be affected by the in-flight pods since the node metrics reporting has a lag. In the coming version, the load-aware scheduling will take consideration of the in-flight pods, guarantee pods not to schedule to overload nodes, and further improve the distribution of utilization to nodes.
  • Fine-grained isolation strategy for last-level cache and memory bandwidth: Contention of the last-level cache and memory bandwidth between containers can cause performance degradation of the memory access. By isolating the last-level cache and memory bandwidth in the QoS-level without exceeding the capacity of the RDT groups, koordlet provides the Resctrl QoS to reduce the contention between the offline workloads with the online services. In the next version, koordlet will enhance the isolation strategy based on NRI (Node Resource Interface) mode introduced in v1.3. It will provide the pod-level isolation capability, which greatly improves the feature's flexibility and timeliness.

Acknowledgement

Since the project was open-sourced, Koordinator has been released for more than 19 versions, getting 80+ contributors involved to contribute. The community is growing and has been continuously improving. We thank all the community members for their active participation and valuable feedback. We also want to thank the CNCF organization and related community members for supporting the project.

Welcome more developers and end users to join us! It is your participation and feedback that make Koordinator keep improving. Whether you are a beginner or an expert in the Cloud Native communities, we look forward to hearing your voice!

· 20 min read
Jianyu Wang

Background

As an actively developing open source project, Koordinator has undergo multiple version iterations since the release of v0.1.0 in April 2022, continuously bringing innovations and enhancements to the Kubernetes ecosystem. The core objective of the project is to provide comprehensive solutions for orchestrating collocated workloads, scheduling resources, ensuring resource isolation, and tuning performance to help users optimize container performance and improve cluster resource utilization.

In past version iterations, the Koordinator community has continued to grow, receiving active participation and contributions from engineers at well-known companies. These include Alibaba, Ant Technology Group, Intel, Xiaomi, Xiaohongshu, iQIYI, Qihoo 360, Youzan, Quwan, Meiya Pico, PITS, among others. Each version has advanced through the collective efforts of the community, demonstrating the project's capability to address challenges in actual production environments.

Today, we are pleased to announce that Koordinator v1.4.0 is officially released. This version introduces several new features, including Kubernetes and YARN workload co-location, a NUMA topology alignment strategy, CPU normalization, and cold memory reporting. It also enhances features in key areas such as elastic quota management, QoS management for non-containerized applications on hosts, and descheduling protection strategies. These innovations and improvements aim to better support enterprise-level Kubernetes cluster environments, particularly in complex and diverse application scenarios.

The release of version v1.4.0 will bring users support for more types of computing workloads and more flexible resource management mechanisms. We look forward to these improvements helping users to address a broader range of enterprise resource management challenges. In the v1.4.0 release, a total of 11 new developers have joined the development of the Koordinator community. They are @shaloulcy, @baowj-678, @zqzten, @tan90github, @pheianox, @zxh326, @qinfustu, @ikaven1024, @peiqiaoWang, @bogo-y, and @xujihui1985. We thank all community members for their active participation and contributions during this period and for their ongoing commitment to the community.

Interpretation of Version Features

1. Support Kubernetes and YARN workload co-location

Koordinator already supports the co-location of online and offline workloads within the Kubernetes ecosystem. However, outside the Kubernetes ecosystem, a considerable number of big data workloads still run on traditional Hadoop YARN.

In response, the Koordinator community, together with developers from Alibaba Cloud, Xiaohongshu, and Ant Financial, has jointly launched the Hadoop YARN and Kubernetes co-location project, Koordinator YARN Copilot. This initiative enables the running of Hadoop NodeManager within Kubernetes clusters, fully leveraging the technical value of peak-shaving and resource reuse for different types of workloads. Koordinator YARN Copilot has the following features:

  • Embracing the open-source ecosystem: Built upon the open-source version of Hadoop YARN without any intrusive modifications to YARN.
  • Unified resource priority and QoS policy: YARN NodeManager utilizes Koordinator’s Batch priority resources and adheres to Koordinator's QoS management policies.
  • Node-level resource sharing: The co-location resources provided by Koordinator can be used by both Kubernetes pod and YARN tasks. Different types of offline applications can run on the same node.

img

For the detailed design of Koordinator YARN Copilot and its use in the Xiaohongshu production environment, please refer to Previous Articles and Official Community Documents.

2. Introducing NUMA topology alignment strategy

The workloads running in Kubernetes clusters are increasingly diverse, particularly in fields such as machine learning, where the demand for high-performance computing resources is on the rise. In these fields, a significant amount of CPU resources is required, as well as other high-speed computing resources like GPUs and RDMA. Moreover, to achieve optimal performance, these resources often need to be located on the same NUMA node or even the same PCIe bus.

Kubernetes' kubelet includes a topology manager that manages the NUMA topology of resource allocation. It attempts to align the topologies of multiple resources at the node level during the admission phase. However, because the node component lacks a global view of the scheduler and the timing of node selection for pods, pods may be scheduled on nodes that are unable to meet the topology alignment policy. This can result in pods failing to start due to topology affinity errors.

To solve this problem, Koordinator moves NUMA topology selection and alignment to the central scheduler, optimizing resource NUMA topology at the cluster level. In this release, Koordinator introduces NUMA-aware scheduling of CPU resources (including Batch resources) and NUMA-aware scheduling of GPU devices as alpha features. The entire suite of NUMA-aware scheduling features is rapidly evolving.

Koordinator enables users to configure the NUMA topology alignment strategy for multiple resources on a node through the node's labels. The configurable strategies are as follows:

  • None, the default strategy, does not perform any topological alignment.
  • BestEffort indicates that the node does not strictly allocate resources according to NUMA topology alignment. The scheduler can always allocate such nodes to pods as long as the remaining resources meet the pods' needs.
  • Restricted means that nodes allocate resources in strict accordance with NUMA topology alignment. In other words, the scheduler must select the same one or more NUMA nodes when allocating multiple resources, otherwise, the node should not be considered. For instance, if a pod requests 33 CPU cores and each NUMA node has 32 cores, it can be allocated to use two NUMA nodes. Moreover, if the pod also requests GPUs or RDMA, these must be on the same NUMA node as the CPU.
  • SingleNUMANode is similar to Restricted, adhering strictly to NUMA topology alignment, but it differs in that Restricted permits the use of multiple NUMA nodes, whereas SingleNUMANode restricts allocation to a single NUMA node.

For example, to set the SingleNUMANode policy for node-0, you would do the following:

apiVersion: v1
kind: Node
metadata:
labels:
node.koordinator.sh/numa-topology-policy: "SingleNUMANode"
name: node-0
spec:
...

In a production environment, users may have enabled kubelet's topology alignment policy, which will be reflected by the koordlet in the TopologyPolicies field of the NodeResourceTopology CR object. When kubelet's policy conflicts with the policy set by the user on the node, the kubelet policy shall take precedence. The koord-scheduler essentially adopts the same NUMA alignment policy semantics as the kubelet topology manager. The kubelet policies SingleNUMANodePodLevel and SingleNUMANodeContainerLevel are both mapped to SingleNUMANode.

After configuring the NUMA alignment strategy for the node, the scheduler can identify many suitable NUMA node allocation results for each pod. Koordinator currently provides the NodeNUMAResource plugin, which allows for configuring the NUMA node allocation result scoring strategy for CPU and memory resources. This includes LeastAllocated and MostAllocated strategies, with LeastAllocated being the default. Each resource can also be assigned a configurable weight. The scheduler will ultimately select the NUMA node allocation with the highest score. For instance, we can configure the NUMA node allocation result scoring strategy to MostAllocated, as shown in the following example:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeNUMAResource
args:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: NodeNUMAResourceArgs
scoringStrategy: # Here configure Node level scoring strategy
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 1
numaScoringStrategy: # Here configure NUMA-Node level scoring strategy
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 1

3. ElasticQuota evolves again

In order to fully utilize cluster resources and reduce management system costs, users often deploy workloads from multiple tenants in the same cluster. When cluster resources are limited, competition for these resources is inevitable between different tenants. As a result, the workloads of some tenants may always be satisfied, while others may never be executed, leading to demands for fairness. The quota mechanism is a very natural way to ensure fairness among tenants, where each tenant is allocated a specific quota, and they can use resources within that quota. Tasks exceeding the quota will not be scheduled or executed. However, simple quota management cannot fulfill tenants' expectations for elasticity in the cloud. Users hope that in addition to satisfying resource requests within the quota, requests for resources beyond the quota can also be met on demand.

In previous versions, Koordinator leveraged the upstream ElasticQuota protocol, which allowed tenants to set a 'Min' value to express their resource requests that must be satisfied, and a 'Max' value to limit the maximum resources they can use. 'Max' was also used to represent the shared weight of the remaining resources of the cluster when they were insufficient.

In addition to offering a flexible quota mechanism that accommodates tenants' on-demand resource requests, Koordinator enhances ElasticQuota with annotations to organize it into a tree structure, thereby simplifying the expression of hierarchical organizational structures for users.

img

The figure above depicts a common quota tree in a cluster utilizing Koordinator's elastic quota. The root quota serves as the link between the quota system and the actual resources within the cluster. In previous iterations, the root quota existed only within the scheduler's logic. In this release, we have made the root quota accessible to users in the form of a Custom Resource (CR). Users can now view information about the root quota through the ElasticQuota CR named koordinator-root-quota.

3.1 Introducing Multi QuotaTree

In large clusters, there are various types of nodes. For example, VMs provided by cloud vendors will have different architectures. The most common ones are amd64 and arm64. There are also different models with the same architecture. In addition, nodes generally have location attributes such as availability zone. When nodes of different types are managed in the same quota tree, their unique attributes will be lost. When users want to manage the unique attributes of machines in a refined manner, the current ElasticQuota appears not to be accurate enough. In order to meet users' requirements for flexible resource management or resource isolation, Koordinator supports users to divide the resources in the cluster into multiple parts, each part is managed by a quota tree, as shown in the following figure:

img

Additionally, to help users simplify management complexity, Koordinator introduced the ElasticQuotaProfile mechanism in version 1.4.0. Users can quickly associate nodes with different quota trees through the nodeSelector, as shown in the following example:

apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
labels:
kubernetes.io/arch: amd64
name: amd64-profile
namespace: kube-system
spec:
nodeSelector:
matchLabels:
kubernetes.io/arch: amd64 // amd64 node
quotaName: amd64-root-quota // the name of root quota
---
apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
labels:
kubernetes.io/arch: arm64
name: arm64-profile
namespace: kube-system
spec:
nodeSelector:
matchLabels:
kubernetes.io/arch: arm64 // arm64 node
quotaName: arm64-root-quota // the name of root quota

After associating nodes with the quota tree, the user utilizes the ElasticQuota in each quota tree as before. When a user submits a pod to the corresponding quota, they currently still need to configure the pod's NodeAffinity to ensure that the pod runs on the correct node. In the future, we plan to add a feature that will help users automatically manage the mapping relationship from quota to node.

3.2 Support non-preemptible

Koordinator ElasticQuota supports sharing the unused part of 'Min' in ElasticQuota with other ElasticQuotas to improve resource utilization. However, when resources are tight, the pod that borrows the quota will be preempted and evicted through the preemption mechanism to get the resources back.

In actual production environments, if some critical online services borrow this part of the quota from other ElasticQuotas and preemption subsequently occurs, the quality of service may be adversely affected. Such workloads should not be subject to preemption.

To implement this safeguard, Koordinator v1.4.0 introduced a new API. Users can simply annotate a pod with quota.scheduling.koordinator.sh/preemptible: false to indicate that the pod should not be preempted.

When the scheduler detects that a pod is declared non-preemptible, it ensures that the available quota for such a pod does not exceed its 'Min'. Thus, it is important to note that when enabling this feature, the 'Min' of an ElasticQuota should be set judiciously, and the cluster must have appropriate resource guarantees in place. This feature maintains compatibility with the original behavior of Koordinator.

apiVersion: v1
kind: Pod
metadata:
name: pod-example
namespace: default
labels:
quota.scheduling.koordinator.sh/name: "quota-example"
quota.scheduling.koordinator.sh/preemptible: false
spec:
...

3.3 Other improvements

  1. The koord-scheduler previously supported the use of a single ElasticQuota object across multiple namespaces. However, in some cases, it is desirable for the same object to be shared by only a select few namespaces. To accommodate this need, users can now annotate the ElasticQuota CR with quota.scheduling.koordinator.sh/namespaces, assigning a JSON string array as the value.
  2. Performance optimization: Previously, whenever an ElasticQuota was modified, the ElasticQuota plugin would rebuild the entire quota tree. This process has been optimized in version 1.4.0.
  3. Support ignoring overhead: When a pod utilizes secure containers, an overhead declaration is typically added to the pod specification to account for the resource consumption of the secure container itself. However, whether these additional resource costs should be passed on to the end user depends on the resource pricing strategy. If it is expected that users should not be responsible for these costs, the ElasticQuota can be configured to disregard overhead. With version 1.4.0, this can be achieved by enabling the feature gate ElasticQuotaIgnorePodOverhead.

4. CPU normalization

With the diversification of node hardware in Kubernetes clusters, significant performance differences exist between CPUs of various architectures and generations. Therefore, even if a pod's CPU request is identical, the actual computing power it receives can vary greatly, potentially leading to resource waste or diminished application performance. The objective of CPU normalization is to ensure that each CPU unit in Kubernetes provides consistent computing power across heterogeneous nodes by standardizing the performance of allocatable CPUs.

To address this issue, Koordinator has implemented a CPU normalization mechanism in version 1.4.0. This mechanism adjusts the amount of CPU resources that can be allocated on a node according to the node's resource amplification strategy, ensuring that each allocatable CPU in the cluster delivers a consistent level of computing power. The overall architecture is depicted in the figure below:

img

CPU normalization consists of two steps

  1. CPU performance evaluation: To calculate the performance benchmarks of different CPUs, you can refer to the industrial performance evaluation standard, SPEC CPU. This part is not provided by the Koordinator project.
  2. Configuration of the CPU normalization ratio in Koordinator: The scheduling system schedules resources based on the normalization ratio, which is provided by Koordinator.

Configure the CPU normalization ratio information into slo-controller-config of koord-manager. The configuration example is as follows:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
cpu-normalization-config: |
{
"enable": true,
"ratioModel": {
"Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz": {
"baseRatio": 1.29,
"hyperThreadEnabledRatio": 0.82,
"turboEnabledRatio": 1.52,
"hyperThreadTurboEnabledRatio": 1.0
},
"Intel Xeon Platinum 8369B CPU @ 2.90GHz": {
"baseRatio": 1.69,
"hyperThreadEnabledRatio": 1.06,
"turboEnabledRatio": 1.91,
"hyperThreadTurboEnabledRatio": 1.20
}
}
}
# ...

For nodes configured with CPU normalization, Koordinator intercepts updates to Node.Status.Allocatable by kubelet through a webhook to achieve the amplification of CPU resources. This results in the display of the normalized amount of CPU resources available for allocation on the node.

5. Improved descheduling protection strategies

Pod migration is a complex process that involves steps such as auditing, resource allocation, and application startup. It is often intertwined with application upgrades, scaling scenarios, and the resource operations and maintenance performed by cluster administrators. Consequently, if a large number of pods are migrated simultaneously, the system's stability may be compromised. Furthermore, migrating many pods from the same workload at once can also affect the application's stability. Additionally, simultaneous migrations of pods from multiple jobs may lead to a 'thundering herd' effect. Therefore, it is preferable to process the pods in each job sequentially.

To address these issues, Koordinator previously provided the PodMigrationJob function with some protection strategies. In version v1.4.0, Koordinator has enhanced these protection strategies into an arbitration mechanism. When there is a large number of executable PodMigrationJobs, the arbiter decides which ones can proceed by employing sorting and filtering techniques.

The sorting process is as follows:

  • The time interval between the start of migration and the current, the smaller the interval, the higher the ranking.
  • The pod priority of PodMigrationJob, the lower the priority, the higher the ranking.
  • Disperse Jobs by workload, make PodMigrationJobs close in the same job.
  • If some pods in the job containing PodMigrationJob's pod is being migrated, the PodMigrationJob's ranking is higher.

The filtering process is as follows:

  • Group and filter PodMigrationJobs based on workload, node, namespace, etc.
  • Check the number of running podMigrationJobs in each workload, and those that reach a certain threshold will be excluded.
  • Check whether the number of unavailable replicas in each workload exceeds the maximum number of unavailable replicas, and those that exceed the number will be excluded.
  • Check whether the number of pods being migrated on the node where the target pod is located exceeds the maximum migration amount of a single node, and those that exceed will be excluded.

6. Cold Memory reporting

To improve system performance, the kernel generally tries not to free the page cache requested by an application but allocates as much as possible to the application. Although allocated by the kernel, this memory may no longer be accessed by applications and is referred to as cold memory.

Koordinator introduced the cold memory reporting function in version 1.4, primarily to lay the groundwork for future cold memory recycling capabilities. Cold memory recycling is designed to address two scenarios:

  1. In standard Kubernetes clusters, when the node memory level is too high, sudden memory requests can lead to direct memory recycling of the system. This can affect the performance of running containers and, in extreme cases, may result in out-of-memory (OOM) events if recycling is not timely. Therefore, maintaining a relatively free pool of node memory resources is crucial for runtime stability.
  2. In co-location scenarios, high-priority applications' unused requested resources can be recycled by lower-priority applications. Since memory not reclaimed by the operating system is invisible to the Koordinator scheduling system, reclaiming unused memory pages of a container is beneficial for improving resource utilization.

Koordlet has added a cold page collector to its collector plugins for reading the cgroup file memory.idle_stat, which is exported by kidled (Anolis kernel), kstaled (Google), or DAMON (Amazon). This file contains information about cold pages in the page cache and is present at every hierarchy level of memory. Koordlet already supports the kidled cold page collector and provides interfaces for other cold page collectors.

After collecting cold page information, the cold page collector stores the metrics, such as hot page usage and cold page size for nodes, pods, and containers, into metriccache. This data is then reported to the NodeMetric Custom Resource (CR).

Users can enable cold memory recycling and configure cold memory collection strategies through NodeMetric. Currently, three strategies are offered: usageWithHotPageCache, usageWithoutPageCache and usageWithPageCache. For more details, please see the community Design Document

7. QoS management for non-containerized applications

In the process of enterprise containerization, there may be non-containerized applications running on the host alongside those already running on Kubernetes. In order to be better compatible with enterprises in the containerization process, Koordinator has developed a node resource reservation mechanism. This mechanism can reserve resources and assign specific QoS (Quality of Service) levels to applications that have not yet been containerized. Unlike the resource reservation configuration provided by kubelet, Koordinator's primary goal is to address QoS issues that arise during the runtime of both non-containerized and containerized applications. The overall solution is depicted in the figure below:

img

Currently, applications need to start processes into the corresponding cgroup according to specifications, and Koordinator does not provide an automatic cgroup relocation tool. For host non-containerized applications, QoS is supported as follows:

  • LS (Latency Sensitive)

    • CPU QoS (Group Identity): The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet sets the Group Identity parameter for it according to the CPU QoS configuration;
    • CPUSet Allocation: The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet will set all CPU cores in the CPU share pool for it.
  • BE (Best-effort)

    • CPU QoS (Group Identity): The application runs the process in the CPU subsystem of the cgroup according to the specification, and the koordlet sets the Group Identity parameter for it according to the configuration of CPU QoS.

For detailed design of QoS management of non-containerized applications on the host, please refer to Community Documentation. In the future, we will gradually add support for other QoS strategies for host non-containerized applications.

8. Other features

In addition to the new features and functional enhancements mentioned above, Koordinator has also implemented the following bug fixes and optimizations in version 1.4.0:

  1. RequiredCPUBindPolicy: Fine-grained CPU orchestration now supports the configuration of the required CPU binding policy, which means that CPUs are allocated strictly in accordance with the specified CPU binding policy; otherwise, scheduling will fail.
  2. CICD: The Koordinator community provides a set of e2e testing Pipeline in v1.4.0; an ARM64 image is provided.
  3. Batch resource calculation strategy optimization: There is support for the maxUsageRequest calculation strategy, which conservatively reclaims high-priority resources. This update also optimizes the underestimate of Batch allocatable when a large number of pods start and stop on a node in a short period of time and improves considerations for special circumstances such as host non-containerized application, third-party allocatable, and dangling pod usage.
  4. Others: Optimizations include using libpfm4 and perf groups to improve CPI collection, allowing SystemResourceCollector to support customized expiration time configuration, enabling BE pods to calculate CPU satisfaction based on the evictByAllocatable policy, repairing koordlet's CPUSetAllocator filtering logic for pods with LS and None QoS, and enhancing RDT resource control to retrieve the task IDs of sandbox containers.

For a comprehensive list of new features in version 1.4.0, please visit the v1.4.0 Release page.

Future plan

In upcoming versions, Koordinator has planned the following features:

  • Core Scheduling: On the runtime side, Koordinator has begun exploring the next generation of CPU QoS capabilities. By leveraging kernel mechanisms such as Linux Core Scheduling, it aims to enhance resource isolation at the physical core level and reduce the security risks associated with co-location. For more details on this work, see Issue #1728.
  • Joint Allocation of Devices: In scenarios involving AI large model distributed training, GPUs from different machines often need to communicate through high-performance network cards. Performance is improved when GPUs and high-performance network cards are allocated in close proximity. Koordinator is advancing the joint allocation of multiple heterogeneous resources. Currently, it supports joint allocation in terms of protocol and scheduler logic; the reporting logic for network card resources on the node side is being explored.

For more information, please pay attention to Milestone v1.5.0.

Conclusion

Finally, we are immensely grateful to all the contributors and users of the Koordinator community. Your active participation and valuable advice have enabled Koordinator to continue improving. We eagerly look forward to your ongoing feedback and warmly welcome new contributors to join our ranks.

· 12 min read
Rougang Han

背景

Koordinator 是一个开源项目,旨在基于阿里巴巴在容器调度领域的多年经验,提供一个完整的混部解决方案,包含混部工作负载编排、资源调度、资源隔离及性能调优等多方面能力,来帮助用户优化容器性能,充分发掘空闲物理资源,提升资源效率,增强延迟敏感型工作负载和批处理作业的运行效率和可靠性。

在此,我们很高兴地向各位宣布 Koordinator v1.3.0 版本的发布。自 2022 年 4 月发布 v0.1.0 版本以来,Koordinator 迄今迭代发布了共 11 个版本,吸引了了包括阿里巴巴、Intel、小米、小红书、爱奇艺、360、有赞等企业在内的大量优秀工程师参与贡献。在 v1.3.0 版本中,Koordinator 带来了 NRI (Node Resource Interface) 支持、Mid 资源超卖等新特性,并在资源预留、负载感知调度、DeviceShare 调度、负载感知重调度、调度器框架、单机指标采集和资源超卖框架等特性上进行了稳定性修复、性能优化与功能增强。

在 v1.3.0 版本中,共有 12 位新加入的开发者参与到了 Koordinator 社区的建设,他们是 @bowen-intel,@BUPT-wxq,@Gala-R,@haoyann,@kangclzjc,@Solomonwisdom,@stulzq,@TheBeatles1994,@Tiana2018,@VinceCui,@wenchezhao,@zhouzijiang,感谢期间各位社区同学的积极参与和贡献,也感谢所有同学在社区的持续投入。

版本功能特性解读

资源预留增强

资源预留(Reservation)能力自 v0.5.0 版本提出后,经历了一年的打磨和迭代,在 v1.3.0 版本中针对抢占、设备预留、Coscheduling 等场景增强了预留机制,新增 allocatePolicy 字段用于定义不同的预留资源分配策略。最新的资源预留 API 如下:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Reservation
metadata:
name: reservation-demo
spec:
# template字段填写reservation对象的资源需求和affinity信息,就像调度pod一样.
template:
namespace: default
spec:
containers:
- args:
- '-c'
- '1'
command:
- stress
image: polinux/stress
imagePullPolicy: Always
name: stress
resources:
requests:
cpu: 500m
memory: 1Gi
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- cn-hangzhou-i
schedulerName: koord-scheduler # 指定koord-scheduler来负责reservation对象的调度.
# 指定可分配预留资源的owners.
owners:
- labelSelector:
matchLabels:
app: app-demo
ttl: 1h
# 指定预留资源是否仅支持一次性的分配.
allocateOnce: true
# 指定预留资源的分配策略,当前支持以下策略:
# - Default: 缺省配置,不限制对预留资源的分配,pod优先分配自节点上的预留资源;如果预留资源不足,则继续分配节点空闲资源。
# - Aligned: pod优先分配自节点上的预留资源;如果预留资源不足,则继续分配节点空闲资源,但要求这部分资源满足Pod需求。该策略可用于规避pod同时分配多个reservation的资源。
# - Restricted: 对于预留资源包含的各个资源维度,pod必须分配自预留资源;其余资源维度可以分配节点空闲资源。包含了Aligned策略的语义。
# 同一节点尚不支持Default策略和Aligned策略或Restricted策略共存。
allocatePolicy: "Aligned"
# 控制预留资源是否可以使用
unschedulable: false

此外,资源预留在 v1.3.0 中还包含了如下兼容性和性能优化:

  1. 增强 Reservation 的抢占,允许 Reservation 内的 Pod 间抢占,拒绝 Reservation 外的 Pod 抢占 Reservation 内的 Pod。
  2. 增强设备预留场景,如果节点上设备资源被部分预留并被 pod 使用,支持剩余资源的分配。
  3. 支持 Reservation 使用 Coscheduling。
  4. 新增 Reservation Affinity协议,允许用户一定从Reservation内分配资源。
  5. 优化 Reservation 兼容性,修复因 Reservation 导致原生打分插件失效的问题。
  6. 优化因引入 Reservation 导致的调度性能回归问题。
  7. 修复 Reservation 预留端口误删除的问题。

关于资源预留的设计,详见Designs - Resource Reservation

其他调度增强

在 v1.3.0 中,koordinator 在调度和重调度方面还包含如下增强:

  1. DeviceShare 调度

    • 更改 GPU 资源使用方式,使用 GPU Share API 时,必须声明koordinator.sh/gpu-memorykoordinator.sh/gpu-memory-ratio,允许不声明koordinator.sh/gpu-core
    • 支持打分,可用于实现 GPU Share 场景和整卡分配场景的 bin-packing 或 spread,并支持卡粒度 binpacking 或 spread。
    • 修复用户误删除 Device CRD 导致调度器内部状态异常重复分配设备的问题。
  2. 负载感知调度:修复对仅填写 Request 的 Pod 的调度逻辑。

  3. 调度器框架:优化 PreBind 阶段的 Patch 操作,将多个插件的 Patch 操作合并为一次提交,提升操作效率,降低 APIServer 压力。

  4. 重调度

    • LowNodeLoad 支持按节点池设置不同的负载水位和参数等。自动兼容原有配置。
    • 跳过 schedulerName 不是 koord-scheduler 的Pod,支持配置不同的 schedulerName。

NRI 资源管理模式

Koordinator 的 runtime hooks 支持两种模式,standalone 和 CRI proxy,然而这两种模式各自有着一些限制。当前,尽管在 standalone 模式做了很多优化,但当想获得更加及时的 Pod 或容器的事件或者环境变量的注入时还是需要依赖 proxy 模式。然而, proxy 模式要求单独部署 koord-runtime-proxy 组件来代理 CRI (Container Runtime Interface) 请求, 同时需要更改 Kubelet 的启动参数并重启 Kubelet。

NRI(Node Resource Interface),即节点资源接口,是 CRI 兼容的容器运行时插件扩展的通用框架,独立于具体的容器运行时(e.g. containerd, cri-o), 提供不同生命周期事件的接口,允许用户在不修改容器运行时源代码的情况下添加自定义逻辑。特别的是,2.0 版本 NRI 只需要运行一个插件实例用于处理所有 NRI 事件和请求,容器运行时通过 Unix-Domain Socket 与插件通信,使用基于 Protobuf 的协议数据,和 1.0 版本 NRI 相比拥有更高的性能,能够实现有状态的 NRI 插件。

通过 NRI 的引入,既能及时的订阅 Pod 或者容器的生命周期事件,又避免了对 Kubelet 的侵入修改。在 Koordinator v1.3.0 中,我们引入 NRI 这种社区推荐的方式来管理 runtime hooks 来解决之前版本遇到的问题,大大提升了 Koordinator 部署的灵活性和处理的时效性,提供了一种优雅的云原生系统的资源管理标准化模式。

nri

注:NRI 模式不支持 docker 的容器运行时,使用 docker 的用户请继续使用 standalone 模式或 proxy 模式。

关于 Koordinator 启用 NRI 的部署方式,请见Installation - Enable NRI Mode Resource Management

节点画像和 Mid 资源超卖

Koordinator 中将节点资源分为4种资源优先级模型 Prod、Mid、Batch 和 Free,低优先级资源可以复用高优先级已分配但未使用的物理资源,以提升物理资源利用率;同时,资源优先级越高,提供的资源也越稳定,例如 Batch 资源采用高优先级资源短期(short-term)已分配但未使用的超卖资源,而 Mid 资源采用高优先级资源长周期(long-term)已分配但未使用的超卖资源。不同资源优先级模型如下图所示:

resource-priority-model

Koordinator v1.3.0 新增了节点画像能力,基于 Prod 的历史资源用量进行峰值预测,以支持 Mid-tier 的资源超卖调度。Mid 资源的超卖计算公式如下:

MidAllocatable := min(ProdReclaimable, NodeAllocatable * thresholdRatio)
ProdReclaimable := max(0, ProdAllocated - ProdPeak * (1 + safeMargin))
  • ProdPeak:通过节点画像,预估的节点上已调度 Prod Pod 在中长周期内(e.g. 12h)的用量峰值。
  • ProdReclaimable:基于节点画像结果,预估在中长周期内可复用的 Prod 资源。
  • MidAllocatable:节点上可分配的 Mid 资源。

此外,Mid 资源的单机隔离保障将在下个版本得到完善,相关动态敬请关注Issue #1442。 在 v1.3.0 版本中,用户可以查看和提交 Mid-tier 的超卖资源,也可以通过以下 Prometheus metrics 来观测节点画像的趋势变化。

# 查看节点的CPU资源画像,reclaimable指标表示预测的可回收资源量,predictor对应不同的预测模型
koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="cpu", unit="core"}
# 查看节点的内存资源画像,reclaimable指标表示预测的可回收资源量,predictor对应不同的预测模型
koordlet_node_predicted_resource_reclaimable{node="test-node", predictor="minPredictor", resource="memory", unit="byte"}
$ kubectl get node test-node -o yaml
apiVersion: v1
kind: Node
metadata:
name: test-node
status:
# ...
allocatable:
cpu: '32'
memory: 129636240Ki
pods: '110'
kubernetes.io/mid-cpu: '16000' # allocatable cpu milli-cores for Mid-tier pods
kubernetes.io/mid-memory: 64818120Ki # allocatable memory bytes for Mid-tier pods
capacity:
cpu: '32'
memory: 129636240Ki
pods: '110'
kubernetes.io/mid-cpu: '16000'
kubernetes.io/mid-memory: 64818120Ki

关于 Koordinator 节点画像的设计,详见Design - Node Prediction

其他功能

通过 v1.3.0 Release 页面,可以看到更多包含在 v1.3.0 版本的新增功能。

未来计划

在接下来的版本中,Koordinator 目前规划了以下功能:

  • 硬件拓扑感知调度,综合考虑节点 CPU、内存、GPU 等多个资源维度的拓扑关系,在集群范围内进行调度优化。
  • 提供节点可分配资源的放大机制。
  • NRI 资源管理模式的完善和增强。

更多信息,敬请关注 Milestone v1.4.0

结语

最后,Koordinator 是一个开放的社区,欢迎广大云原生爱好者们随时通过各种方式参与共建,无论您在云原生领域是初学乍到还是驾轻就熟,我们都非常期待听到您的声音!

· 13 min read
Zuowei Zhang

背景

Koordinator 是一个开源项目,基于阿里巴巴在容器调度领域多年累积的经验孵化诞生,可以提升容器性能,降低集群资源成本。通过混部、资源画像、调度优化等技术能力, 能够提高延迟敏感的工作负载和批处理作业的运行效率和可靠性,优化集群资源使用效率。

从 2022 年 4 月发布以来,Koordinator 迄今一共迭代发布了 10 个版本,吸引了了包括阿里巴巴、小米、小红书、爱奇艺、360、有赞 等在内的大量优秀工程师参与贡献。 随着2023年春天的来临,Koordinator也迎来了它的一周年诞辰,在此我们很高兴的向大家宣布,Koordinator v1.2版本正式发布。新版本中Koordinator支持了节点资源预留功能, 并兼容了K8s社区的重调度策略,同时在单机侧增加了对AMD环境L3 Cache和内存带宽隔离的支持。

在新版本中,共有12位新加入的开发者参与到了Koordiantor社区的建设,他们是@Re-Grh,@chengweiv5,@kingeasternsun,@shelwinnn,@yuexian1234,@Syulin7,@tzzcfrank @Dengerwei,@complone,@AlbeeSo,@xigang,@leason00,感谢以上开发者的贡献和参与。

版本功能特性解读

节点资源预留

混部场景中包含的应用形态多种多样,除了已经完成云原生化的容器,还包含很多尚未完成容器化的应用,这部分应用会以进程的形式在宿主机上与K8s容器共同运行。 为了减少K8s应用和其他类型应用在节点侧的资源竞争,Koordinator 支持将一部分资源预留,使其既不参与调度器的资源调度,也不参与节点侧的资源分配,达到资源分隔使用的效果。 在v1.2版本中,Koordiantor已经支持CPU和内存资源维度的预留,并允许直接指定预留的CPU编号,具体如下。

节点资源预留声明

在Node上可以配置需要预留的资源量或具体的CPU编号,举例如下:

apiVersion: v1
kind: Node
metadata:
name: fake-node
annotations: # specific 5 cores will be calculated, e.g. 0, 1, 2, 3, 4, and then those core will be reserved.
node.koordinator.sh/reservation: '{"resources":{"cpu":"5"}}'
---
apiVersion: v1
kind: Node
metadata:
name: fake-node
annotations: # the cores 0, 1, 2, 3 will be reserved.
node.koordinator.sh/reservation: '{"reservedCPUs":"0-3"}'

单机组件Koordlet在上报节点资源拓扑信息时,会将具体预留的CPU编号更新到NodeResourceTopology对象的Annotation中。

调度及重调度场景适配

调度器在分配资源的过程中,涉及了多种情况的资源校验,包括Quota管理,节点容量校验,CPU拓扑校验等等,这些场景都需要增加对节点预留资源的考虑,例如,调度器在计算节点CPU容量时,需要将节点预留的资源进行扣除。

cpus(alloc) = cpus(total) - cpus(allocated) - cpus(kubeletReserved) - cpus(nodeAnnoReserved)

此外,对于Batch混部超卖资源的计算同样需要将这部分资源扣除,而考虑到节点中还包括一部分系统进程的资源消耗,Koord-Manager在计算时会取节点预留和系统用量的最大值,具体为:

reserveRatio = (100-thresholdPercent) / 100.0
node.reserved = node.alloc * reserveRatio
system.used = max(node.used - pod.used, node.anno.reserved)
Node(BE).Alloc = Node.Alloc - Node.Reserved - System.Used - Pod(LS).Used

对于重调度,各插件策略需要在节点容量、利用率计算等场景感知节点预留资源量,此外,若已经有容器占用了节点的预留资源,重调度需要考虑将其进行驱逐,确保节点容量得到正确管理, 避免资源竞争。这部分重调度相关的功能,我们将在后续版本进行支持,也欢迎广大爱好者们一起参与共建。

单机资源管理

对于LS类型的Pod,单机Koordlet组件会根据CPU分配情况动态计算共享CPU池,对于节点预留的CPU核心会将其排除在外,确保LS类型pod和其他非容器化的进程资源隔离。 同时,对于单机相关的QoS策略,例如CPUSuppress压制策略在计算节点利用率时,会将预留资源量考虑在内。

suppress(BE) := node.Total * SLOPercent - pod(LS).Used - max(system.Used, node.anno.reserved)

关于节点资源预留功能的详细说明,可以参考 设计文档 中的介绍。

兼容社区重调度策略

得益于 Koordinator Descheduler 的框架日益成熟,在 Koordinator v1.2 版本中,通过引入一种接口适配机制,可以无缝的对 Kubernetes Desceheduler 已有插件进行兼容,在使用时您只需部署 Koordinator Descheduler 即可使用到上游的全部功能。

在实现上,Koordinator Descheduler 通过 import 上游代码不做任何侵入式的改动,保证完全兼容上游所有的插件、参数配置以及其运行策略。同时,Koordinator 允许用户为上游插件指定增强的 evictor,从而复用 Koordinator 提供的资源预留、工作负载可用性保障以及全局流控等安全性策略。

兼容的插件列表包括:

  • HighNodeUtilization
  • LowNodeUtilization
  • PodLifeTime
  • RemoveFailedPods
  • RemoveDuplicates
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingInterPodAntiAffinity
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingTopologySpreadConstraint
  • DefaultEvictor

在使用时,可以参考如下的方式配置,以 RemovePodsHavingTooManyRestarts 为例:

apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
clientConnection:
kubeconfig: "/Users/joseph/asi/koord-2/admin.kubeconfig"
leaderElection:
leaderElect: false
resourceName: test-descheduler
resourceNamespace: kube-system
deschedulingInterval: 10s
dryRun: true
profiles:
- name: koord-descheduler
plugins:
evict:
enabled:
- name: MigrationController
deschedule:
enabled:
- name: RemovePodsHavingTooManyRestarts
pluginConfig:
- name: RemovePodsHavingTooManyRestarts
args:
apiVersion: descheduler/v1alpha2
kind: RemovePodsHavingTooManyRestartsArgs
podRestartThreshold: 10

资源预留调度能力增强

Koordinator 在比较早期的版本中引入了 Reservation 机制,通过预留资源并复用给指定特征的 Pod 使用,用于帮助解决资源交付确定性问题。 例如重调度场景中期望被驱逐的 Pod 一定有资源可以使用,而不是被驱逐后无资源可用导致引起稳定性问题;又或者需要扩容时, 一些 PaaS 平台希望能够先确定是否满足应用调度编排的资源,再决定是否扩容,或者提前做一些预备工作等。

Koordinator Reservation 通过 CRD 定义,每个 Reservation 对象会在 koord-scheduler 内伪造成一个 Pod 进行调度, 这样的 Pod 我们称为 Reserve PodReserve Pod 就可以复用已有的调度插件和打分插件找到合适的节点,并最终在调度器内部状态中占据对应的资源。 Reservation 在创建时都会指定预留的资源将来要给哪些 Pod 使用,可以指定具体某个 Pod,也可以指定某些 workload 对象,或者具备某些标签的 Pod 使用。 当这些 Pod 通过 koord-scheduler 调度时,调度器会找到可以被该 Pod 使用的 Reservation 对象,并且优先使用 Reservation 的资源。 并且 Reservation Status 中会记录被哪个 Pod 使用,以及 Pod Annotations 中也会记录使用了哪个 Reservation。 Reservation 被使用后,会自动的清理内部状态,确保其他 Pod 不会因为 Reservation 导致无法调度。

在 Koordinator v1.2 中,我们做了大幅度的优化。首先我们放开了只能使用 Reservation 持有的资源的限制,允许跨出 Reservation 的资源边界, 既可以使用 Reservation 预留的资源,也可以使用节点上剩余的资源。而且我们通过非侵入式的方式扩展了 Kubernetes Scheduler Framework, 支持预留精细化资源,即可以预留 CPU 核和 GPU 设备等。我们也修改了 Reservation 可以被重复使用的默认行为,改为 AllocateOnce, 即 Reservation 一旦被某个 Pod 使用,该 Reservation 会被废弃。这样的改动是考虑到,AllocateOnce 更能覆盖大部分场景,这样作为默认行为,大家在使用时会更简单。

支持AMD环境下的L3 Cache和内存带宽隔离

在v0.3.0版本中,Koordiantor已经支持了Intel环境的L3 Cache和内存带宽隔离,在最新的1.2.0版本中我们新增了对AMD环境的支持。 Linux内核L3 Cache和内存带宽隔离能力提供了统一的resctrl接口,同时支持Intel和AMD环境,主要区别在于,Intel提供的内存带宽隔离接口为百分比格式, 而AMD提供的内存带宽隔离接口为绝对值格式,具体如下。

# Intel Format
# resctrl schema
L3:0=3ff;1=3ff
MB:0=100;1=100

# AMD Format
# resctrl schema
L3:0=ffff;1=ffff;2=ffff;3=ffff;4=ffff;5=ffff;6=ffff;7=ffff;8=ffff;9=ffff;10=ffff;11=ffff;12=ffff;13=ffff;14=ffff;15=ffff
MB:0=2048;1=2048;2=2048;3=2048;4=2048;5=2048;6=2048;7=2048;8=2048;9=2048;10=2048;11=2048;12=2048;13=2048;14=2048;15=2048

接口格式包含两部分,L3表示对应的socket或CCD可用的“路数”(way),以16进制的数据格式表示,每个比特位表示一路 MB表示对应的socket或CCD可以使用的内存带宽范围,Intel可选范围为0~100的百分比格式,AMD对应的为绝对值格式,单位为Gb/s,2048表示不限制。 Koordiantor统一提供了百分比格式的接口,并自动感知节点环境是否为AMD,决定resctrl接口中填写的格式。

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
resource-qos-config: |-
{
"clusterStrategy": {
"lsClass": {
"resctrlQOS": {
"enable": true,
"catRangeStartPercent": 0,
"catRangeEndPercent": 100,
"MBAPercent": 100
}
},
"beClass": {
"resctrlQOS": {
"enable": true,
"catRangeStartPercent": 0,
"catRangeEndPercent": 30,
"MBAPercent": 100
}
}
}
}

其他功能

通过 v1.2 release 页面,可以看到更多版本所包含的新增功能。

未来计划

在接下来的版本中,Koordiantor重点规划了以下功能,具体包括:

  • 硬件拓扑感知调度,综合考虑节点CPU、内存、GPU等多个资源维度的拓扑关系,在集群范围内进行调度优化。
  • 对重调度器的可观测性和可追溯性进行增强。
  • GPU资源调度能力的增强。

Koordinator 是一个开放的社区,非常欢迎广大云原生爱好者们通过各种方式一起参与共建,无论您在云原生领域是初学乍练还是驾轻就熟,我们都非常期待听到您的声音!

· 10 min read
Erwei Deng

什么是 CPU 混部

CPU 混部是指将不同类型的业务部署到同一台机器上运行,让它们共享机器上的 CPU 资源以提升 CPU 利用率,从而降低机器的采购和运营成本。但是,对于有些类型的任务来说,它们对延时非常的敏感,比如电商、搜索或 web 服务等,这类任务的实时性很高,但是通常对资源的消耗却不是很多,我们称之为在线任务;还有一类任务,它们更多的关注计算或者批处理,对延时没有要求,但是消耗的资源相对较多,我们称之为离线任务。

当这两类任务同时部署到同一台机器上时,由于离线任务对资源的占用较多,资源竞争导致在线任务的延时受到了很大的影响,而且,在超线程架构的机器上,即使离线任务和在线任务跑在不同的超线程 CPU 上,流水线和 cache 的竞争也会导致在线任务的运行受到影响。于是,CPU 混部技术诞生了,来解决离线任务对在线任务延时的影响,同时还能进一步提升 CPU 资源的利用率。

图1 单机混部 CPU 利用率示意图

内核 CPU 混部技术

CPU 混部技术,主要是通过单机操作系统调度器来实现的,通过任务类型来决定所分配到的 CPU 资源。Koordinator 社区主要使用的单机操作系统发行版有 Alibaba Cloud Linux 2/3(简称 Alinux2/3) 和 CentOS7.9。对于 Alinux2/3,它使用的是龙蜥社区的 Group Identity CPU 混部技术,在操作系统内核中提供了 CPU 混部能力。Group Identity 在原有的 CFS 调度器中新增了另一个运行队列来区分在线和离线任务,而且,为了避免对端 CPU(超线程架构)上离线任务的干扰,Group Identity 会对其进行驱逐。龙蜥的 Group Identity 技术已经经过阿里双十一等大型活动以及大规模商业化的验证,其 CPU 混部能力也得到广大用户和开发者的认可。

但是对于 CentOS 发行版来说,到目前为止还没有提供任何 CPU 混部相关的技术和能力。对于 CentOS CPU 混部能力的缺失,可能有以下几种解决方案:

  • 制作 CentOS 的衍生版系统,并包含 CPU 混部技术;
  • 迁移到 Alibaba Cloud Linux 2/3 操作系统发行版;

对于第一种方案,需要从 CentOS 镜像站中下载其内核源码,将 CPU 混部技术移植到内核,编译后安装,然后重启系统便可以使用该技术,但这会涉及到业务迁移和停机,势必会给业务方带来昂贵的代价。 对于第二种方案,虽然迁移工作会有一定的工作量,但是,Alinux2/3 或 Anolis OS 包含了完整的混部资源隔离方案(CPU 混部仅仅是其中一点),技术红利所带来的收益远比迁移代价要大得多。而且 CentOS 即将停服,为了解决 CentOS 停服问题,龙蜥社区推出了 Anolis OS 发行版操作系统,该发行版系统完全兼容 CentOS,用户可以进行无缝迁移。

龙蜥 CPU 混部插件

针对 Koordinator 云原生 CentOS 单机操作系统 CPU 混部能力的缺失,龙蜥社区开发人员给出了另一种方案,利用 plugsched 调度器热升级技术提供一种 CPU 混部技术的调度器插件包,该插件包含了阿里云早期(2017年)的 CPU 混部技术 bvt + noise clean,该技术采用的是 throttle 机制,当调度器选择下一个任务时,它会检测对端 CPU 上的任务类型以及当前 CPU 正在执行的任务类型,如果在、离线任务同时存在,则会将离线任务 throttle 掉,然后继续选择下一个任务进行调度,保证在线任务优先执行且不被对端 CPU 上的离线干扰。该 CPU 混部调度器插件可直接安装到 CentOS7.9,不需要停机和业务迁移等工作。

Plugsched SDK 神器

Plugsched 调度器热升级,是龙蜥社区推出的 plugsched SDK 调度器热升级开发工具,它可从 Linux 内核中将调度器解耦,形成一个独立的模块,然后将 CPU 混部技术移植到调度器模块,形成一个调度器插件,然后将其直接安装到运行的系统中就可以使用 CPU 混部技术。Plugsched,可以对内核调度器特性动态的进行增、删、改,来满足业务的需求,且无需进行业务迁移和停机升级,还可以回滚。内核开发人员可通过 plugsched SDK 生产出各种类型的调度器插件来满足不同的业务场景。

Plugsched 调度器热升级论文《Efficient Scheduler Live Update for Linux Kernel with Modularization》已被 ASPLOS 顶会收录,里面详细介绍了 plugsched 技术原理和应用价值,以及全面的测试和评估。目前,plugsched 生产的插件已在蚂蚁集团、阿里云和国内某大型互联网企业规模部署。

Plugsched 开源链接:https://gitee.com/anolis/plugsched

CPU 混部插件测试

开发人员对该调度器插件进行了 CPU 混部的测试,服务端配置:

  • 测试机器:阿里云神龙裸金属服务器,104 CPU,384 GB 内存
  • 系统配置:CentOS 7.9 发行版,内核版本 3.10,安装 CPU 混部调度器插件
  • 测试内容:在线任务是 Nginx 服务,容器配置为 80C 10GB,Nginx workers 数量为 80;离线任务是 ffmpeg 视频转码,容器配置为 50C 20GB,线程数量为 50。
  • 测试case:
    • 基线:单独启动 Nginx 容器
    • 对照组:同时启动 Nginx 容器和 ffmpeg 容器,但不设置优先级(不启用混部功能)
    • 实验组:同时启动 Nginx 容器和 ffmpeg 容器,给 Nginx 设置在线高优先级,ffmpeg 为离线低优先级(启用混部功能)

在另一台压测机上使用 wrk 工具向 Nginx 服务发起请求,结果如下:(单位:ms)

基线对照组实验组
RT-P500.2230.245(+9.86%)0.224(+0.44%)
RT-P750.3220.387(+20.18%)0.338(+4.96%)
RT-P900.4440.575(+29.50)0.504(+13.51%)
RT-P990.7061.7(+140.79)0.88(+24.64%)
CPU%25.15%71.7%49.15%

从上面的结果来看,没有 CPU 混部插件,离线任务对在线任务的影响很大,P99 延时增长了一倍多,而安装 CPU 混部插件后,P99 长尾延时的影响显著降低,CPU 利用率也接近50%。

该插件虽然能显著降低离线对在线任务的干扰,但还是逊色于龙蜥社区的 Group Identity 技术。龙蜥的 Group Identity 技术能让在线受到的干扰小于 5%,而且整机利用率的提升也比该插件要更多一些,达到 60% 以上(可查阅:koordinator 混部最佳实践手册)。这些差异的原因在于,1)内核自身的差异,CentOS 7.9 使用的是比较早的 3.10 内核,而龙蜥使用的是 4.19/5.10 内核,3.10 内核调度器性能本身就不及 4.19/5.10;2)Group Identity 的实现原理相比 noise clean 更适合 CPU 混部场景。

结语

最后,欢迎广大技术人员、开源爱好者和读者用户加入 Koordinator、openanolis 社区,享受社区带来的技术,不论是 Group Identity 还是 Plugsched 神器,一定会给大家带来意想不到的收益和价值,欢迎大家共建社区,与社区共同交流、成长和发展。

· 17 min read
Siyu Wang

背景

Koordinator 旨在为用户提供完整的混部工作负载编排、混部资源调度、混部资源隔离及性能调优解决方案,帮助用户提高延迟敏感服务的运行性能,挖掘空闲节点资源并分配给真正有需要的计算任务,从而提高全局的资源利用效率。

从 2022 年 4 月发布以来,Koordinator 迄今一共迭代发布了 9 个版本。项目经历的大半年发展过程中,社区吸纳了包括阿里巴巴、小米、小红书、爱奇艺、360、有赞 等在内的大量优秀工程师,贡献了众多的想法、代码和场景,一起推动 Koordinator 项目的成熟。

今天,很高兴的宣布 Koordinator v1.1 正式发布,它包含了负载感知调度/重调度、cgroup v2 支持、干扰检测指标采集,以及其他一系列优化点。接下来我们就针对这些新增特性做深入解读与说明。

版本特性深入解读

负载感知调度

支持按工作负载类型统计和均衡负载水位

Koordinator v1.0 及之前的版本,提供了负载感知调度提供基本的利用率阈值过滤保护高负载水位的节点继续恶化影响工作负载的运行时质量,以及通过预估机制解决解决冷节点过载的情况。已有的负载感知调度能解决很多常见场景的问题。但负载感知调度作为一种优化手段,还有比较多的场景是需要完善的。

目前的负载感知调度主要解决了集群内整机维度的负载均衡效果,但有可能出现一些特殊的情况:节点部署了不少离线Pod运行,拉高了整机的利用率,但在线应用工作负载的整体利用率偏低。这个时候如果有新的在线Pod,且整个集群内的资源比较紧张时,会有如下的问题:

  1. 有可能因为整机利用率超过整机安全阈值导致无法调度到这个节点上的;
  2. 还可能出现一个节点的利用率虽然相对比较低,但上面跑的全是在线应用率,从在线应用角度看,利用率已经偏高了,但按照当前的调度策略,还会继续调度这个Pod上来,导致该节点堆积了大量的在线应用,整体的运行效果并不好。

在 Koordinator v1.1 中,koord-scheduler 支持感知工作负载类型,区分不同的水位和策略进行调度。

在 Filter 阶段,新增 threshold 配置 prodUsageThresholds,表示在线应用的安全阈值,默认为空。如果当前调度的 Pod 是 Prod 类型,koord-scheduler 会从当前节点的 NodeMetric 中统计所有在线应用的利用率之和,如果超过了 prodUsageThresholds 就过滤掉该节点;如果是离线 Pod,或者没有配置 prodUsageThresholds,保持原有的逻辑,按整机利用率处理。

在 Score 阶段,新增开关 scoreAccordingProdUsage 表示是否按 Prod 类型的利用率打分均衡。默认不启用。当开启后,且当前 Pod 是 Prod 类型的话,koord-scheduler 在预估算法中只处理 Prod 类型的 Pod,并对 NodeMetrics 中记录的其他的未使用预估机制处理的在线应用的 Pod 的当前利用率值进行求和,求和后的值参与最终的打分。如果没有开启 scoreAccordingProdUsage,或者是离线Pod,保持原有逻辑,按整机利用率处理。

支持按百分位数利用率均衡

Koordinator v1.0及以前的版本都是按照 koordlet 上报的平均利用率数据进行过滤和打分。但平均值隐藏了比较多的信息,因此在 Koordinator v1.1 中 koordlet 新增了根据百分位数统计的利用率聚合数据。调度器侧也跟着做了相应的适配。

更改调度器的 LoadAware 插件的配置,aggregated 表示按照百分位数聚合数据进行打分和过滤。aggregated.usageThresholds 表示过滤时的水位阈值;aggregated.usageAggregationType 表示过滤阶段要使用的百分位数类型,支持 avgp99, p95, p90p50aggregated.usageAggregatedDuration 表示过滤阶段期望使用的聚合周期,如果不配置,调度器将使用 NodeMetrics 中上报的最大周期的数据;aggregated.scoreAggregationType 表示在打分阶段期望使用的百分位数类型;aggregated.scoreAggregatedDuration 表示打分阶段期望使用的聚合周期,如果不配置,调度器将使用 NodeMetrics 中上报的最大周期的数据。

在 Filter 阶段,如果配置了 aggregated.usageThresholds 以及对应的聚合类型,调度器将按该百分位数统计值进行过滤;

在 Score 阶段,如果配置了 aggregated.scoreAggregationType,调度器将会按该百分位数统计值打分;目前暂时不支持 Prod Pod 使用百分位数过滤。

负载感知重调度

Koordinator 在过去的几个版本中,持续的演进重调度器,先后了开源完整的框架,加强了安全性,避免因过度驱逐 Pod 影响在线应用的稳定性。这也影响了重调度功能的进展,过去 Koordinator 暂时没有太多力量建设重调度能力。这一情况将会得到改变。

Koordinator v1.1 中我们新增了负载感知重调度功能。新的插件称为 LowNodeLoad,该插件配合着调度器的负载感知调度能力,可以形成一个闭环,调度器的负载感知调度在调度时刻决策选择最优节点,但随着时间和集群环境以及工作负载面对的流量/请求的变化时,负载感知重调度可以介入进来,帮助优化负载水位超过安全阈值的节点。 LowNodeLoad 与 K8s descheduler 的插件 LowNodeUtilization 不同的是,LowNodeLoad是根据节点真实利用率的情况决策重调度,而 LowNodeUtilization 是根据资源分配率决策重调度。

LowNodeLoad 插件有两个最重要的参数,分别是 highThresholdslowThresholds

  • highThresholds 表示负载水位的警戒阈值,超过该阈值的节点上的Pod将参与重调度;
  • lowThresholds 表示负载水位的安全水位。低于该阈值的节点上的Pod不会被重调度。

以下图为例,lowThresholds 为45%,highThresholds 为 70%,那么低于 45% 的节点是安全的,因为水位已经很低了;高于45%,但是低于 70%的是区间是我们期望的负载水位范围;高于70%的节点就不安全了,应该把超过70%的这部分(假设当前节点A的负载水位是85%),那么 85% - 70% = 15% 的负载降低,筛选 Pod 后执行迁移。

LowNodeLoad 示例

迁移时,还要考虑到低于 45% 的这部分节点是我们重调度后要承载新Pod的节点,我们需要确保迁移的Pod的负载总量不会超过这些低负载节点的承载上限。这个承载上限即是 highThresholds - 节点当前负载,假设节点B的负载水位是20%,那么 70%-20% = 50%,这50%就是可以承载的容量了。因此迁移时每驱逐一个 Pod,这个承载容量就应该扣掉当前重调度 Pod 的当前负载或者预估负载或者画像值(这部分值与负载调度里的值对应)。这样就可以确保不会多迁移。

如果一个集群总是可能会出现某些节点的负载就是比较高,而且数量并不多,这个时候如果频繁的重调度这些节点,也会带来安全隐患,因此可以让用户按需设置 numberOfNodes

另外,LowNodeLoad 识别出超过阈值的节点后会筛选 Pod,当筛选 Pod 时,可以配置要支持或者过滤的 namespace,或者配置 pod selector 筛选,也可以配置 nodeFit 检查每个备选 Pod 对应的 Node Affinity/Node Selector/Toleration 是否有与之匹配的 Node,如果没有的话,这种节点也会被忽略。当然可以考虑不启用这个能力,通过配置 nodeFit 为 false 后即可禁用,此时完全由底层的 MigrationController 通过 Koordinator Reservation 预留资源;

当筛选出 Pod 后,会对这些 Pod 进行排序。会依靠Koordinator QoSClass、Kubernetes QoSClass、Priority、用量和创建时间等多个维度排序。

cgroup v2 支持

背景

Koordinator 中众多单机 QoS 能力和资源压制/弹性策略构建在 Linux Control Group (cgroups) 机制上,比如 CPU QoS (cpu)、Memory QoS (memory)、CPU Burst (cpu)、CPU Suppress (cpu, cpuset),koordlet 组件可以通过 cgroups (v1) 限制容器可用资源的时间片、权重、优先级、拓扑等属性。Linux 高版本内核也在持续增强和迭代了 cgroups 机制,带来了 cgroups v2 机制,统一 cgroups 目录结构,改善 v1 中不同 subsystem/cgroup controller 之间的协作,并进一步增强了部分子系统的资源管理和监控能力。Kubernetes 自 1.25 起将 cgroups v2 作为 GA (general availability) 特性,在 Kubelet 中启用该特性进行容器的资源管理,在统一的 cgroups 层次下设置容器的资源隔离参数,支持 MemoryQoS 的增强特性。

cgroup v1/v2 结构

在 Koordinator v1.1 中,单机组件 koordlet 新增对 cgroups v2 的支持,包括如下工作:

  • 重构了 Resource Executor 模块,以统一相同或近似的 cgroup 接口在 v1 和 v2 不同版本上的文件操作,便于 koordlet 特性兼容 cgroups v2 和合并读写冲突。
  • 在当前已开放的单机特性中适配 cgroups v2,采用新的 Resource Executor 模块替换 cgroup 操作,优化不同系统环境下的报错日志。

Koordinator v1.1 中大部分 koordlet 特性已经兼容 cgroups v2,包括但不限于:

  • 资源利用率采集
  • 动态资源超卖
  • Batch 资源隔离(BatchResource,废弃BECgroupReconcile)
  • CPU QoS(GroupIdentity)
  • Memory QoS(CgroupReconcile)
  • CPU 动态压制(BECPUSuppress)
  • 内存驱逐(BEMemoryEvict)
  • CPU Burst(CPUBurst)
  • L3 Cache 及内存带宽隔离(RdtResctrl)

遗留的未兼容特性如 PSICollector 将在接下来的 v1.2 版本中进行适配,可以跟进 issue#407 获取最新进展。接下来的 Koordinator 版本中也将逐渐引入更多 cgroups v2 的增强功能,敬请期待。

使用 cgroups v2

在 Koordinator v1.1 中,koordlet 对 cgroups v2 的适配对上层功能配置透明,除了被废弃特性的 feature-gate 以外,您无需变动 ConfigMap slo-controller-config 和其他 feature-gate 配置。当 koordlet 运行在启用 cgroups v2 的节点上时,相应单机特性将自动切换到 cgroups-v2 系统接口进行操作。

此外,cgroups v2 是 Linux 高版本内核(建议 >=5.8)的特性,对系统内核版本和 Kubernetes 版本有一定依赖。建议采用默认启用 cgroups v2 的 Linux 发行版以及 Kubernetes v1.24 以上版本。

更多关于如何启用 cgroups v2 的说明,请参照 Kubernetes 社区文档

干扰检测指标采集

在真实的生产环境下,单机的运行时状态是一个“混沌系统”,资源竞争产生的应用干扰无法绝对避免。Koordinator 正在建立干扰检测与优化的能力,通过提取应用运行状态的指标,进行实时的分析和检测,在发现干扰后对目标应用和干扰源采取更具针对性的策略。

当前 Koordinator 已经实现了一系列 Performance Collector,在单机侧采集与应用运行状态高相关性的底层指标,并通过 Prometheus 暴露出来,为干扰检测能力和集群应用调度提供支持。

指标采集

Performance Collector 由多个 feature-gate 进行控制,Koordinator 目前提供以下几个指标采集器:

  • CPICollector:用于控制 CPI 指标采集器。CPI:Cycles Per Instruction。指令在计算机中执行所需要的平均时钟周期数。CPI 采集器基于 Cycles 和 Instructions 这两个 Kernel PMU(Performance Monitoring Unit)事件以及 perf_event_open(2) 系统调用实现。
  • PSICollector:用于控制 PSI 指标采集器。PSI:Pressure Stall Information。表示容器在采集时间间隔内,因为等待 cpu、内存、IO 资源分配而阻塞的任务数。使用 PSI 采集器前,需要在 Anolis OS 中开启 PSI 功能,您可以参考文档获取开启方法。

Performance Collector 目前是默认关闭的。您可以通过修改 Koordlet 的 feature-gates 项来使用它,此项修改不会影响其他 feature-gate

kubectl edit ds koordlet -n koordinator-system
...
spec:
...
spec:
containers:
- args:
...
# modify here
# - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
- -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,CPICollector=true,PSICollector=true

ServiceMonitor

v1.1.0 版本的 Koordinator 为 Koordlet 增加了 ServiceMonitor 的能力,将所采集指标通过 Prometheus 暴露出来,用户可基于此能力采集相应指标进行应用系统的分析与管理。

ServiceMonitor 由 Prometheus 引入,故在 helm chart 中设置默认不开启安装,可以通过以下命令安装ServiceMonitor:

helm install koordinator https://... --set koordlet.enableServiceMonitor=true

部署后可在 Prometheus UI 找到该 Targets。

# HELP koordlet_container_cpi Container cpi collected by koordlet
# TYPE koordlet_container_cpi gauge
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="cycles",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 2.228107503e+09
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="instructions",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 4.1456092e+09

可以期待的是,Koordinator 干扰检测的能力在更复杂的真实场景下还需要更多检测指标的补充,后续将在如内存、磁盘 IO 等其他诸多资源的指标采集建设方面持续发力。

其他更新点

通过 v1.1 release 页面,可以看到更多版本所包含的新增功能。

· 7 min read
Joseph

Koordinator 今年3月份开源以来,先后发布了7个版本,逐步的把阿里巴巴&阿里云内部的混部系统的核心能力输出到开源社区,并在中间过程中逐渐的被 Kubernetes、大数据、高性能计算、机器学习领域或者社区的关注,Koordinator 社区也逐步获得了一些贡献者的支持,并有一些企业开始逐步的在生产环境中使用 Koordinator 解决实际生产中遇到的成本问题、混部问题等。 经过 Koordinator 社区的努力,我们怀着十分激动的心情向大家宣布 Koordinator 1.0 版本正式发布。

Koordinator 项目早期着重建设核心混部能力 -- 差异化 SLO,并且为了让用户更容易的使用 Koordinator 的混部能力,Koordinator 提供了 ClusterColocationProfile 机制帮助用户可以不用修改存量代码完成不同工作负载的混部,让用户逐步的熟悉混部技术。随后 Koordinaor 逐步在节点侧 QoS 保障机制上做了增强,提供了包括但不限于 CPU Suppress、CPU Burst、 Memory QoS、L3 Cache/MBA 资源隔离机制和基于满足度驱逐机制等多种能力,解决了大部分节点侧工作负载的稳定性问题。配合使用 Koordinator Runtime Proxy 组件,可以更好的兼容 Kubernetes kubelet 原生管理机制。

并且 Koordinator 在任务调度和 QoS 感知调度以及重调度等方面也都提供了一些创新方案,建设了全面兼容 Kubernetes CPU 管理机制的精细化 CPU 调度能力,面向节点实际负载的均衡调度能力。为了更好的让用户管理好资源, Koordinator 还提供了资源预留能力(Reservation),并且 Koordinator 基于 Kubernetes 社区已有的Coscheduling、ElasticQuota Scheduling 能力做了进一步的增强,为任务调度领域注入了新的活力。Koordinator 提供了全新的重调度器框架,着重建设 Descheduler 的扩展性和安全性问题。

安装或升级 Koordinator v1.0.0

使用 Helm 安装

您可以通过 helm v3.5+ 非常方便的安装 Koordinator,Helm 是一个简单的命令行工具,您可以从 这里 获取它。

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 1.0.0

版本功能特性解读

Koordinator v1.0 整体新增的特性并不多,主要有以下一些变化

独立 API Repo

为了更方便集成和使用 Koordiantor 定义的 API,并避免因依赖 Koordiantor 引入额外的依赖或者依赖冲突问题,我们建立了独立的 API Repo: koordinator-sh/apis

新增 ElasticQuota Webhook

在 Koordinator v0.7 版本中,我们基于 Kubernetes sig-scheduler 提供的 ElasticQuota 做了诸多增强,提供了树形管理机制,并提供了公平性保障机制等,可以很好的帮助您解决使用 ElasticQuota 遇到的问题。在 Koordinator v1.0 版本中,我们进一步提供了 ElasticQuota Webhook,帮助您在使用 ElasticQuota 树形管理机制时,保障新的 ElasticQuota 对象遵循 Koordinator 定义的规范或约束:

  1. 除了根节点,其他所有子节点的 min 之和要小于父节点的 min。
  2. 不限制子节点 max,允许子节点的 max 大于父节点的 max。考虑以下场景,集群中有 2 个 ElasticQuota 子树:dev-parent 和 production-parent,每个子树都有几个子 ElasticQuota。 当 production-parent 忙时,我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量,而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
  3. Pod 不能使用父节点ElasticQuota。如果放开这个限制,会导致整个弹性 Quota 的机制变的异常复杂,暂时不考虑支持这种场景。
  4. 只有父节点可以挂子节点,不允许子节点挂子节点
  5. 暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

进一步完善 ElasticQuota Scheduling

在 Koordinator v0.7 版本中,koord-scheduler 的主副 Pod 都会启动 ElasticQuota Controller 并都会更新 ElasticQuota 对象。在 Koordinator v1.0 中我们修复了该问题,确保只有主 Pod 可以启动 Controller 并更新 ElasticQuota 对象。 还优化了 ElasticQuota Controller 潜在的频繁更新 ElasticQuota 对象的问题,当检查到 ElasticQuota 各维度数据发生变化时才会更新,降低频繁更新给 APIServer 带来的压力。

进一步完善 Device Share Scheduling

Koordinator v1.0 中 koordlet 会上报 GPU 的型号和驱动版本到 Device CRD 对象中,并会由 koord-manager 同步更新到 Node 对象,追加相应的标签。

apiVersion: v1
kind: Node
metadata:
labels:
kubernetes.io/gpu-driver: 460.91.03
kubernetes.io/gpu-model: Tesla-T4
...
name: cn-hangzhou.10.0.4.164
spec:
...
status:
...

Koordinator Runtime Proxy 增强兼容性

在 Koordinator 之前的版本中,koord-runtime-proxy 和 koordlet 一起安装后,如果 koordlet 异常或者 koordlet 卸载/重装等场景下,会遇到新调度到节点的 Pod 无法创建容器的问题。为了解决这个问题,koord-runtime-proxy 会感知 Pod 是否具有特殊的 label runtimeproxy.koordinator.sh/skip-hookserver=true,如果 Pod 存在该标签,koord-runtime-proxy 会直接把 CRI 请求转发给 containerd/docker 等 runtime。

其他改动

你可以通过 Github release 页面,来查看更多的改动以及它们的作者与提交记录。

· 34 min read
Joseph

Koordinator[1] 继上次 v0.6版本[2] 发布后,经过 Koordinator 社区的努力,我们迎来了具有重大意义的 v0.7 版本。在这个版本中着重解决机器学习、大数据场景需要的任务调度能力,例如 CoScheduling、ElasticQuota和精细化的 GPU 共享调度能力。并在调度问题诊断分析方面得到了增强,重调度器也极大的提升了安全性,降低了重调度的风险。

版本功能特性解读

1. 任务调度

1.1 Enhanced Coscheduling

Gang scheduling是在并发系统中将多个相关联的进程调度到不同处理器上同时运行的策略,其最主要的原则是保证所有相关联的进程能够同时启动,防止部分进程的异常,导致整个关联进程组的阻塞。例如当提交一个Job时会产生多个任务,这些任务期望要么全部调度成功,要么全部失败。这种需求称为 All-or-Nothing,对应的实现被称作 Gang Scheduling(or Coscheduling) 。
Koordinator 在启动之初,期望支持 Kubernetes 多种工作负载的混部调度,提高工作负载的运行时效率和可靠性,其中就包括了机器学习和大数据领域中广泛存在的具备 All-or-Nothing 需求的作业负载。 为了解决 All-or-Nothing 调度需求,Koordinator v0.7.0 基于社区已有的 Coscheduling 实现了 Enhanced Coscheduling。
Enhanced Coscheduling 秉承着 Koordiantor 兼容社区的原则,完全兼容社区 Coscheduling 和 依赖的 PodGroup CRD。已经使用 PodGroup 的用户可以无缝升级到 Koordinator。
除此之外,Enhanced Coscheduling 还实现了如下增强能力:

支持 StrictNonStrict 两种模式

两种模式的区别在于 Strict模式(即默认模式)下调度失败会 Reject 所有分配到资源并处于 Wait 状态的 Pod,而 NonStrict 模式不会发起 Reject。NonStrict 模式下,同属于一个 PodGroup 的 Pod A 和 PodB 调度时,如果 PodA 调度失败不会影响 PodB 调度, PodB 还会继续被调度。NonStrict 模式对于体量较大的 Job 比较友好,可以让这种大体量 Job 更快的调度完成,但同时也增加了资源死锁的风险。后续 Koordinator 会提供 NonStrict 模式下解决死锁的方案实现。
用户在使用时,可以在 PodGroup 或者 Pod 中追加 annotation gang.scheduling.koordinator.sh/mode=NonStrict开启 NonStrict 模式。

改进 PodGroup 调度失败的处理机制,实现更高效的重试调度

举个例子,PodGroup A 关联了5个Pod,其中前3个Pod通过Filter/Score,进入Wait阶段,第4个Pod调度失败,当调度第5个Pod时,发现第4个Pod已经失败,则拒绝调度。在社区 Coscheduling 实现中,调度失败的PodGroup 会加入到基于cache机制的 lastDeniedPG 对象中,当 cache 没有过期,则会拒绝调度;如果过期就允许继续调度。可以看到 cache 的过期时间很关键,过期时间设置的过长会导致Pod迟迟得不到调度机会,设置的过短会出现频繁的无效调度。
而在Enhanced Coscheduling 中,实现了一种基于 ScheduleCycle 的重试机制。以上场景为例,5个Pod的 ScheduleCycle 初始值为 0,PodGroup 对应的 ScheduleCycle 初始值为1;当每一次尝试调度 Pod 时,都会更新 Pod ScheduleCycle 为 PodGroup ScheduleCycle。如果其中一个 Pod 调度失败,会标记当前的 PodGroup ScheduleCycle 无效,之后所有小于 PodGroup ScheduleCycle 的 Pod 都会被拒绝调度。当同一个 PodGroup 下的所有 Pod 都尝试调度一轮后,Pod ScheduleCycle 都更新为当前 PodGroup ScheduleCycle,并递进 PodGroup ScheduleCycle,并标记允许调度。这种方式可以有效规避基于过期时间的缺陷,完全取决于调度队列的配置重试调度。
image.png

支持多个 PodGroup 为一组完成 Gang Scheduling

一些复杂的 Job 有多种角色,每个角色管理一批任务,每个角色的任务要求支持 All-or-Nothing ,每个角色的 MinMember 要求也不一样,并且每个角色之间也要求 All-or-Nothing。这就导致每个角色都有一个对应的 PodGroup ,并且还要求 PodGroup 即使满足了也需要等待其他角色的 PodGroup 必须满足。社区 Coscheduling 无法满足这种场景需求。而 Koordinator 实现的 Enhanced Coscheduling 支持用户在多个 PodGroup 中增加 anntation 相关关联实现,并支持跨Namespace。例如用户有2个PodGroup ,名字分别是PodGroupA和PodGroupB,可以按照如下例子关联两个 PodGroup:

apiVersion: v1alpha1
kind: PodGroup
metadata:
name: podGroupA
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["namespaceA/podGroupA", "namespaceB/podGroupB"]
spec:
...

支持轻量化 Gang 协议

如果用户不希望创建 PodGroup,认为创建 PodGroup 太繁琐,那么可以考虑在一组 Pod 中填充相同 annotation gang.scheduling.koordinator.sh/name=<podGroupName> 表示这一组 Pod 使用 Coscheduling 调度。如果期望设置 minMember ,可以追加 Annotation gang.scheduling.koordinator.sh/min-available=<availableNum>。举个例子:

apiVersion: v1
kind: Pod
metadata:
annotations:
gang.scheduling.koordinator.sh/name: "pod-group-a"
gang.scheduling.koordinator.sh/min-available: "5"
name: demo-pod
namespace: default
spec:
...

1.2 ElasticQuota Scheduling

一家中大型公司内有多个产品和研发团队,共用多个比较大规模的 Kubernetes 集群,这些集群内含有的大量 CPU/Memory/Disk 等资源被资源运营团队统一管理。运营团队往往在采购资源前,通过额度预算的机制让公司内每个团队根据自身的需求提交额度预算。业务团队此时一般根据业务当前和对未来的预期做好额度预算。最理想的情况是每一份额度都能够被使用,但现实告诉我们这是不现实的。往往出现的问题是:

  1. 团队 A 高估了业务的发展速度,申请了太多的额度用不完
  2. 团队 B 低估了业务的发展速度,申请的额度不够用
  3. 团队 C 安排了一场活动,手上的额度不够多了,但是活动只持续几周,申请太多额度和资源也会浪费掉。
  4. 团队 D 下面还有各个子团队和业务,每个子团队内也会出现类似A B C 三个团队的情况,而且其中有些团队的业务临时突发需要提交一些计算任务要交个客户,但是没有额度了,走额度预算审批也不够了。
  5. ......

以上大家日常经常遇到的场景,在混部场景、大数据场景,临时性突发需求又是时常出现的,这些资源的需求都给额度管理工作带来了极大的挑战。做好额度管理工作,一方面避免过度采购资源降低成本,又要在临时需要额度时不采购资源或者尽量少的采购资源;另一方面不能因为额度问题限制资源使用率,额度管理不好就会导致即使有比较好的技术帮助复用资源,也无法发挥其价值。 总之,额度管理工作是广大公司或组织需长期面对且必须面对的问题。
Kubernetes ResourceQuota 可以解决额度管理的部分问题。 原生 Kubernetes ResourceQuota API 用于指定每个 Namespace 的最大资源额度量,并通过 admission 机制完成准入检查。如果 Namespace 当前资源分配总量超过ResourceQuota 指定的配额,则拒绝创建 Pod。 Kubernetes ResourceQuota 设计有一个局限性:Quota 用量是按照 Pod Requests 聚合的。 虽然这种机制可以保证实际的资源消耗永远不会超过 ResourceQuota 的限制,但它可能会导致资源利用率低,因为一些 Pod 可能已经申请了资源但未能调度。
Kuberenetes Scheduler-Sig 后来给出了一个借鉴 Yarn Capacity Scheduling,称作 ElasticQuota 的设计方案并给出了具体的实现。允许用户设置 max 和 min:

  • max 表示用户可以消费的资源上限
  • min 表示需要保障用户实现基本功能/性能所需要的最小资源量

通过这两个参数可以帮助用户实现如下的需求:

  1. 用户设置 min < max 时,当有突发资源需求时,即使当前 ElasticQuota 的总用量超过了 min, 但只要没有达到 max,那么用户可以继续创建新的 Pod 应对新的任务请求。
  2. 当用户需要更多资源时,用户可以从其他 ElasticQuota 中“借用(borrow)” 还没有被使用并且需要通保障的 min。
  3. 当一个 ElasticQuota 需要使用 min 资源时,会通过抢占机制从其他借用方抢回来,即驱逐一些其他ElasticQuota 超过 min 用量的 Pod。

ElasticQuota 还有一些局限性:没有很好的保障公平性。假如同一个 ElasticQuota 有大量新建的Pod,有可能会消耗所有其他可以被借用的Quota,从而导致后来的 Pod 可能拿不到 Quota。此时只能通过抢占机制抢回来一些 Quota。
另外 ElasticQuota 和 Kubernetes ResourceQuota 都是面向 Namespace的,不支持多级树形结构,对于一些本身具备复杂组织关系的企业/组织不能很好的使用ElasticQuota/Kubenretes ResourceQuota 完成额度管理工作。
Koordinator 针对这些额度管理问题,给出了一种基于社区 ElasticQuota 实现的支持多级管理方式的弹性Quota管理机制(multi hierarchy quota management)。具备如下特性:

  • 兼容社区的 ElasticQuota API。用户可以无缝升级到 Koordinator
  • 支持树形结构管理 Quota。
  • 支持按照共享权重(shared weight)保障公平性。
  • 允许用户设置是否允许借用Quota 给其他消费对象。

Pod 关联 ElasticQuota 方式

用户可以非常使用的使用该能力,可以完全按照 ElasticQuota 的用法,即每个 Namespace 设置一个 ElasticQuota 对象。也可以在 Pod 中追加 Label 关联 ElasticQuota:

apiVersion: v1
kind: Pod
metadata:
labels:
quota.scheduling.koordinator.sh/name: "elastic-quota-a"
name: demo-pod
namespace: default
spec:
...

树形结构管理机制和使用方法

需要使用树形结构管理 Quota 时,需要在 ElasticQuota 中追加 Label quota.scheduling.koordinator.sh/is-parent表示当前 ElasticQuota 是否是父节点,quota.scheduling.koordinator.sh/parent表示当前 ElasticQuota 的父节点 ElasticQuota 的名字。举个例子:
image.png
我们创建一个 ElasticQuota Root 作为根节点,资源总量为CPU 100C,内存200Gi,以及子节点 quota-a

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: parentA
namespace: default
labels:
quota.scheduling.koordinator.sh/is-parent: "true"
quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
max:
cpu: 100
memory: 200Gi
min:
cpu: 100
memory: 200Gi
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: childA1
namespace: default
labels:
quota.scheduling.koordinator.sh/is-parent: "false"
quota.scheduling.koordinator.sh/parent: "parentA"
quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
max:
cpu: 40
memory: 100Gi
min:
cpu: 20
memory: 40Gi

在使用树形结构管理 ElasticQuota 时,有一些需要遵循的约束:

  1. 除了根节点,其他所有子节点的 min 之和要小于父节点的 min。
  2. 不限制子节点 max,允许子节点的 max 大于父节点的 max。考虑以下场景,集群中有 2 个 ElasticQuota 子树:dev-parent 和 production-parent,每个子树都有几个子 ElasticQuota。 当 production-parent 忙时,我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量,而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
  3. Pod 不能使用父节点ElasticQuota。如果放开这个限制,会导致整个弹性 Quota 的机制变的异常复杂,暂时不考虑支持这种场景。
  4. 只有父节点可以挂子节点,不允许子节点挂子节点
  5. 暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

我们将在下个版本中通过 webhook 机制实现这些约束。

公平性保障机制

为了方便阅读和理解将要介绍的公平性保障机制,先明确几个新概念:

  • request 表示同一个 ElasticQuota 关联的所有 Pod 的资源请求量。如果一个 ElasticQuota A 的 request 小于 min,ElasticQuota B 的 request 大于 min,那么 ElasticQuota A 未使用的部分,即 min - request 剩余的量通过公平性保障机制借用给 ElasticQuota B. 当 ElasticQuota A 需要使用这些借走的量时,要求 ElasticQuota B 依据公平性保障机制归还给 ElasticQuota A。
  • runtime 表示 ElasticQuota 当前可以使用的实际资源量。如果 request 小于 min,runtime 等于 request。这也意味着,需要遵循 min 语义,应无条件满足 request。如果 request 大于 min,且 min 小于 max,公平性保障机制会分配 runtime 在min 与 max 之前,即 max >= runtime >= min。
  • shared-weight 表示一个 ElasticQuota 的竞争力,默认等于 ElasticQuota Max。

通过几个例子为大家介绍公平性保障机制的运行过程,假设当前集群的 CPU 总量为100C,并且有4个ElasticQuota,如下图所示,绿色部分为 Request 量:A 当前的request 为5,B当前的request为20,C当前的Request为30,D当前的Request为70。
image.png
并且我们注意到, A, B, C, D 的 min 之和是60,剩下 40 个空闲额度, 同时 A 还可以借给 B, C, D 5个额度,所以一共有45个额度被B,C,D共享,根据各个ElasticQuota的 shared-weight,B,C,D分别对应60,50和80,计算出各自可以共享的量:

  • B 可以获取 14个额度, 45 * 60 / (60 + 50 + 80) = 14
  • C 可以获取 12个额度, 45 * 50 / (60 + 50 + 80) = 12
  • D 可以获取 19个额度, 45 * 80 / (60 + 50 + 80) = 19

image.png
但我们也要注意的是,C和D需要更多额度,而 B只需要5个额度就能满足 Request,并且 B 的min是15,也就意味着我们只需要给 B 5个额度,剩余的9个额度继续分给C和D。

  • C 可以获取 3个额度, 9 * 50 / (50 + 80) = 3
  • D 可以获取 6个额度, 9 * 80 / (50 + 80) = 6


最终我们得出如下的分配结果结果:

  • A runtime = 5
  • B runtime = 20
  • C runtime = 35
  • D runtime = 40


总结整个过程可以知道:

  1. 当前 request < min 时,需要借出 lent-to-quotas;当 request > min 时,需要借入 borrowed-qutoas
  2. 统计所有 runtime < min 的 Quota,这些总量就是接下来可被借出的量。
  3. 根据 shared-weight 计算每个ElasticQuota可以借入的量
  4. 如果最新的 runtime > reuqest,那么 runtime - request 剩余的量可以借给更需要的对象。

另外还有一种日常生产时会遇到的情况:即集群内资源总量会随着节点故障、资源运营等原因降低,导致所有ElasticQuota的 min 之和大于资源总量。当出现这种情况时,我们无法确保 min 的资源述求。此时我们会按照一定的比例调整各个ElasticQuota的min,确保所有min之和小于或者等于当前实际的资源总量。

抢占机制

Koordinator ElasticQuota 机制在调度阶段如果发现 Quota 不足,会进入抢占阶段,按照优先级排序,抢占属于同一个ElasticQuota 内的 低优先级 Pod。 同时,我们不支持跨 ElasticQuota 抢占其他 Pod。但是我们也提供了另外的机制支持从借用 Quota 的 ElasticQuota 抢回。
举个例子,在集群中,有两个 ElasticQuota,ElasticQuota A {min = 50, max = 100}, ElasticQuota B {min = 50, max = 100}。用户在上午10点使用 ElasticQuota A 提交了一个 Job, Request = 100 ,此时因为 ElasticQuota B 无人使用,ElasticQuota A 能从 B 手里借用50个Quota,满足了 Request = 100, 并且此时 Used = 100。在11点钟时,另一个用户开始使用 ElasticQuota B 提交Job,Request = 100,因为 ElasticQuota B 的 min = 50,是必须保障的,通过公平性保障机制,此时 A 和 B 的 runtime 均为50。那么此时对于 ElasticQuota A ,Used = 100 是大于当前 runtime = 50 的,因此我们会提供一个 Controller,驱逐掉一部分 Pod ,使得当前 ElasticQuota A 的 Used 降低到 runtime 相等的水位。

2. 精细化资源调度

Device Share Scheduling

机器学习领域里依靠大量强大算力性能的 GPU 设备完成模型训练,但是 GPU 自身价格十分昂贵。如何更好地利用GPU设备,发挥GPU的价值,降低成本,是一个亟待解决的问题。 Kubernetes 社区现有的 GPU 分配机制中,GPU 是由 kubelet 分配的,并只支持分配一个或多个完整的 GPU 实例。 这种方法简单可靠,但类似于 CPU 和 Memory,GPU 并不是一直处于高利用率水位,同样存在资源浪费的问题。 因此,Koordinator 希望支持多工作负载共享使用 GPU 设备以节省成本。 此外,GPU 有其特殊性。 比如下面的 NVIDIA GPU 支持的 NVLink 和超卖场景,都需要通过调度器进行中央决策,以获得全局最优的分配结果。
image.png

从图中我们可以发现,虽然该节点有8个 GPU 实例,型号为A100/V100,但 GPU 实例之间的数据传输速度是不同的。 当一个 Pod 需要多个 GPU 实例时,我们可以为 Pod 分配具有最大数据传输速度组合关系的 GPU 实例。 此外,当我们希望一组 Pod 中的 GPU 实例具有最大数据传输速度组合关系时,调度器应该将最佳 GPU 实例批量分配给这些 Pod,并将它们分配到同一个节点。

GPU 资源协议

Koordinator 兼容社区已有的 nvidia.com/gpu资源协议,并且还自定义了扩展资源协议,支持用户更细粒度的分配 GPU 资源。

  • kubernetes.io/gpu-core 代表GPU的计算能力。 与 Kuberetes MilliCPU 类似,我们将 GPU 的总算力抽象为100,用户可以根据需要申请相应数量的 GPU 算力。
  • kubernetes.io/gpu-memory 表示 GPU 的内存容量,以字节为单位。
  • kubernetes.io/gpu-memory-ratio 代表 GPU 内存的百分比。

假设一个节点有4个GPU设备实例,每个GPU设备实例有 8Gi 显存。用户如果期望申请一个完整的 GPU 实例,除了使用 nvidia.com/gpu之外,还可以按照如下方式申请:

apiVersion: v1
kind: Pod
metadata:
name: demo-pod
namespace: default
spec:
containers:
- name: main
resources:
limits:
kubernetes.io/gpu-core: 100
kubernetes.io/gpu-memory: "8Gi"
requests:
kubernetes.io/gpu-core: 100
kubernetes.io/gpu-memory: "8Gi"

如果期望只使用一个 GPU 实例一半的资源,可以按照如下方式申请:

apiVersion: v1
kind: Pod
metadata:
name: demo-pod
namespace: default
spec:
containers:
- name: main
resources:
limits:
kubernetes.io/gpu-core: 50
kubernetes.io/gpu-memory: "4Gi"
requests:
kubernetes.io/gpu-core: 50
kubernetes.io/gpu-memory: "4Gi"

设备信息和设备容量上报

在 Koordinator v0.7.0 版本中,单机侧 koordlet 安装后会自动识别节点上是否含有 GPU 设备,如果存在的话,会上报这些 GPU 设备的 Minor ID、 UUID、算力和显存大小到一个类型为 Device CRD 中。每个节点对应一个 Device CRD 实例。Device CRD 不仅支持描述 GPU,还支持类似于 FPGA/RDMA等设备类型,目前 v0.7.0 版本只支持 GPU, 暂未支持这些设备类型。
Device CRD 会被 koord-manager 内的 NodeResource controller 和 koord-scheduler 消费。NodeResource controller 会根据 Device CRD 中描述的信息,换算成 Koordinator 支持的资源协议 kubernetes.io/gpu-core,kubernetes.io/gpu-memory 更新到 Node.Status.Allocatable 和 Node.Status.Capacity 字段,帮助调度器和 kubelet 完成资源调度。gpu-core 表示GPU 设备实例的算力,一个实例的完整算力为100。假设一个节点有 8 个 GPU 设备实例,那么节点的 gpu-core 容量为 8 100 = 800; gpu-memory 表示 GPU 设备实例的显存大小,单位为字节,同样的节点可以分配的显存总量为 设备数量 每个实例的单位容量,例如一个 GPU 设备的显存是 8G,节点上有8 个 GPU 实例,总量为 8 * 8G = 64G。

apiVersion: v1
kind: Node
metadata:
name: node-a
status:
capacity:
koordinator.sh/gpu-core: 800
koordinator.sh/gpu-memory: "64Gi"
koordinator.sh/gpu-memory-ratio: 800
allocatable:
koordinator.sh/gpu-core: 800
koordinator.sh/gpu-memory: "64Gi"
koordinator.sh/gpu-memory-ratio: 800

中心调度分配设备资源

Kuberetes 社区原生提供的设备调度机制中,调度器只负责校验设备容量是否满足 Pod,对于一些简单的设备类型是足够的,但是当需要更细粒度分配 GPU 时,需要中心调度器给予支持才能实现全局最优。
Koordinator 调度器 koord-scheduler 新增了调度插件 DeviceShare,负责精细度设备资源调度。DeviceShare 插件消费 Device CRD,记录每个节点可以分配的设备信息。DeviceShare 在调度时,会把 Pod 的GPU资源请求转换为 Koordinator 的资源协议,并过滤每个节点的未分配的 GPU 设备实例。确保有资源可用后,在 Reserve 阶段更新内部状态,并在 PreBind 阶段更新 Pod Annotation,记录当前 Pod 应该使用哪些 GPU 设备。
DeviceShare 将在后续版本支持 Binpacking 和 Spread 策略,实现更好的设备资源调度能力。

单机侧精准绑定设备信息

Kubernetes 社区在 kubelet 中提供了 DevicePlugin 机制,支持设备厂商在 kubelet 分配好设备后有机会获得设备信息,并填充到环境变量或者更新挂载路径。但是不能支持 中心化的 GPU 精细化调度场景。
针对这个问题, Koordinator 扩展了 koord-runtime-proxy ,支持在 kubelet 创建容器时更新环境变量,注入调度器分配的 GPU 设备信息。

3. 调度器诊断分析

大家在使用 Kubernetes 时经常会遇到一些调度相关的问题:

  1. 我这个 Pod 为什么不能调度?
  2. 这个 Pod 为什么会调度到这个节点,不是应该被另一个打分插件影响到么?
  3. 我新开发了一个插件,发现调度结果不符合预期,但是有不知道哪里出了问题。

要诊断分析这些问题,除了要掌握 Kubernetes 基本的调度机制和资源分配机制外,还需要调度器自身给予支持。但是 Kubernetes kube-scheduler 提供的诊断能力比较有限,有时候甚至没有什么日志可以查看。kube-scheduler 原生是支持通过 HTTP 更改日志等级,可以获得更多日志信息,例如执行如下命令可以更改日志等级到5:

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/v --data '5' 
successfully set klog.logging.verbosity to 5

Koordinator 针对这些问题,实现了一套 Restful API ,帮助用户提升问题诊断分析的效率

分析 Score 结果

PUT /debug/flags/s 允许用户打开 Debug Score 开关,在打分结束后,以Markdown 格式打印 TopN 节点各个插件的分值。例如:

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/s --data '100'
successfully set debugTopNScores to 100

当有新 Pod 调度时,观察 scheduler log 可以看到如下信息

| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration |
| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:|
| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 |
| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 |
| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 |
| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 |

找个 Markdown 工具,就可以转为如下表格

#PodNodeScoreLoadAwareSchedulingNodeNUMAResourceNodeResourcesFitPodTopologySpread
0default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.5157787094200
1default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.5057485093200
2default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.1954155091200
3default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.1848715082200

调度插件导出内部状态

像 koord-scheduler 内部的 NodeNUMAResource 、 DeviceShare和ElasticQuota等插件内部都有维护一些状态帮助调度。 koord-scheduler 自定义了一个新的插件扩展接口(定义见下文),并会在初始化插件后,识别该插件是否实现了该接口并调用该接口,让插件注入需要暴露的 RestfulAPI。以 NodeNUMAResource 插件为例,会提供 /cpuTopologyOptions/:nodeName/availableCPUs/:nodeName两个Endpoints,可以查看插件内部记录的 CPU 拓扑信息和分配结果。

type APIServiceProvider interface {
RegisterEndpoints(group *gin.RouterGroup)
}

用户在使用时,按照 /apis/v1/plugins/<pluginName>/<pluginEndpoints>方 式构建 URL 查看数据,例如要查看 /cpuTopologyOptions/:nodeName

$ curl schedulerLeaderIP:10252/apis/v1/plugins/NodeNUMAResources/cpuTopologyOptions/node-1
{"cpuTopology":{"numCPUs":32,"numCores":16,"numNodes":1,"numSockets":1,"cpuDetails":....

查看当前支持的插件 API

为了方便大家使用,koord-scheduler 提供了 /apis/v1/__services__ 查看支持的 API Endpoints

$ curl schedulerLeaderIP:10251/apis/v1/__services__
{
"GET": [
"/apis/v1/__services__",
"/apis/v1/nodes/:nodeName",
"/apis/v1/plugins/Coscheduling/gang/:namespace/:name",
"/apis/v1/plugins/DeviceShare/nodeDeviceSummaries",
"/apis/v1/plugins/DeviceShare/nodeDeviceSummaries/:name",
"/apis/v1/plugins/ElasticQuota/quota/:name",
"/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName",
"/apis/v1/plugins/NodeNUMAResource/cpuTopologyOptions/:nodeName"
]
}

4. 更安全的重调度

在 Koordinator v0.6 版本中我们发布了全新的 koord-descheduler,支持插件化实现需要的重调度策略和自定义驱逐机制,并内置了面向 PodMigrationJob 的迁移控制器,通过 Koordinator Reservation 机制预留资源,确保有资源的情况下发起驱逐。解决了 Pod 被驱逐后无资源可用影响应用的可用性问题。
Koordinator v0.7 版本中,koord-descheduler 实现了更安全的重调度

  • 支持 Evict 限流,用户可以根据需要配置限流策略,例如允许每分钟驱逐多少个 Pod
  • 支持配置 Namespace 灰度重调度能力,让用户可以更放心的灰度
  • 支持按照 Node/Namespace 配置驱逐数量,例如配置节点维度最多只驱逐两个,那么即使有插件要求驱逐该节点上的更多Pod,会被拒绝。
  • 感知 Workload ,如果一个 Workload 正在发布、缩容、已经有一定量的 Pod 正在被驱逐或者一些Pod NotReady,重调度器会拒绝新的重调度请求。目前支持原生的 Deployment,StatefulSet 以及 Kruise CloneSet,Kruise AdvancedStatefulSet。

后续重调度器还会提升公平性,防止一直重复的重调度同一个 workload ,尽量降低重调度对应用的可用性的影响。

5. 其他改动

  • Koordinator 进一步增强了 CPU 精细化调度能力,完全兼容 kubelet ( <= v1.22) CPU Manager static 策略。调度器分配 CPU 时会避免分配被 kubelet 预留的 CPU,单机侧koordlet完整适配了kubelet从1.18到1.22版本的分配策略,有效避免了 CPU 冲突。
  • 资源预留机制支持 AllocateOnce 语义,满足单次预留场景。并改进了 Reservation 状态语义,更加准确描述 Reservation 对象当前的状态。
  • 改进了离线资源(Batch CPU/Memory) 的声明方式,支持limit大于request的资源描述形式,可以方便原burstable类型的任务直接转换为混部模式运行。

你可以通过 Github release[6] 页面,来查看更多的改动以及它们的作者与提交记录。

相关链接

· 9 min read
Joseph

We are happy to announce the release of Koordinator v0.6.0. Koordinator v0.6.0 brings complete Fine-grained CPU Orchestration, Resource Reservation mechanism, safely Pod Migration mechanism and Descheduling Framework.

Install or Upgrade to Koordinator v0.6.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.6.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.6.0 [--force]

For more details, please refer to the installation manual.

Fine-grained CPU Orchestration

In Koordinator v0.5.0, we designed and implemented basic CPU orchestration capabilities. The koord-scheduler supports different CPU bind policies to help LSE/LSR Pods achieve better performance.

Now in the v0.6 version, we have basically completed the CPU orchestration capabilities originally designed, such as:

  • Support default CPU bind policy configured by koord-scheduler for LSR/LSE Pods that do not specify a CPU bind policy
  • Support CPU exclusive policy that supports PCPULevel and NUMANodeLevel, which can spread the CPU-bound Pods to different physical cores or NUMA Nodes as much as possible to reduce the interference between Pods.
  • Support Node CPU Orchestration API to helper cluster administrators control the CPU orchestration behavior of nodes. The label node.koordinator.sh/cpu-bind-policy constrains how to bind CPU logical CPUs when scheduling. If set with the FullPCPUsOnly that requires that the scheduler must allocate full physical cores. Equivalent to kubelet CPU manager policy option full-pcpus-only=true. If there is no node.koordinator.sh/cpu-bind-policy in the node's label, it will be executed according to the policy configured by the Pod or koord-scheduler. The label node.koordinator.sh/numa-allocate-strategy indicates how to choose satisfied NUMA Nodes when scheduling. Support MostAllocated and LeastAllocated.
  • koordlet supports the LSE Pods and improve compatibility with existing Guaranteed Pods with static CPU Manager policy.

Please check out our user manual for a detailed introduction and tutorial.

Resource Reservation

We completed the Resource Reservation API design proposal in v0.5, and implemented the basic Reservation mechanism in the current v0.6 version.

When you want to use the Reservation mechanism to reserve resources, you do not need to modify the Pod or the existing workloads(e.g. Deployment, StatefulSet). koord-scheduler provides a simple to use API named Reservation, which allows us to reserve node resources for specified pods or workloads even if they haven't get created yet. You only need to write the Pod Template and the owner information in the ReservationSpec when creating a Reservation. When koord-scheduler perceives a new Reservation object, it will allocate resources to the Reservation object through the normal Pod scheduling process. After scheduling, koord-scheduler will update the success or failure information to ResourceStatus. If the reservation is successful, and the OwnerReference or Labels of the newly created Pod satisfy the owner information declared earlier, then the newly created Pod will directly reuse the resources held by the Reservation. When the Pod is destroyed, the Reservation object can be reused until the Reservation expires.

image

The resource reservation mechanism can help solve or optimize the problems in the following scenarios:

  1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to "lock" resources preventing from allocation of other pods with the same or higher priority.
  2. Descheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
  3. Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas.
  4. Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost.

Pod Migration Job

Migrating Pods is an important capability that many components (such as descheduler) rely on, and can be used to optimize scheduling or help resolve workload runtime quality issues. We believe that pod migration is a complex process, involving steps such as auditing, resource allocation, and application startup, and is mixed with application upgrading, scaling scenarios, resource operation and maintenance operations by cluster administrators. Therefore, how to manage the stability risk of this process to ensure that the application does not fail due to the migration of Pods is a very critical issue that must be resolved.

The descheduler in the K8s community evicts pods according to different strategies. However, it does not guarantee whether the evicted Pod has resources available after re-creation. If a large number of newly created Pods are in the Pending state when the resources in the cluster are tight, may lower the application availabilities.

Koordinator defines a CRD-based Migration/Eviction API named PodMigrationAPI, through which the descheduler or other components can evict or delete Pods more safely. With PodMigrationJob we can track the status of each process in the migration, and perceive scenarios such as upgrading and scaling of the application.

It's simple to use the PodMigrationJob API. Create a PodMigrationJob with the YAML file below to migrate pod-demo-0.

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: PodMigrationJob
metadata:
name: migrationjob-demo
spec:
paused: false
ttl: 5m
mode: ReservationFirst
podRef:
namespace: default
name: pod-demo-5f9b977566-c7lvk
status:
phase: Pending
$ kubectl create -f migrationjob-demo.yaml
podmigrationjob.scheduling.koordinator.sh/migrationjob-demo created

Then you can query the migration status and query the migration events

$ kubectl get podmigrationjob migrationjob-demo
NAME PHASE STATUS AGE NODE RESERVATION PODNAMESPACE POD NEWPOD TTL
migrationjob-demo Succeed Complete 37s node-1 d56659ab-ba16-47a2-821d-22d6ba49258e default pod-demo-5f9b977566-c7lvk pod-demo-5f9b977566-nxjdf 5m0s

$ kubectl describe podmigrationjob migrationjob-demo
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ReservationCreated 8m33s koord-descheduler Successfully create Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e"
Normal ReservationScheduled 8m33s koord-descheduler Assigned Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" to node "node-1"
Normal Evicting 8m33s koord-descheduler Try to evict Pod "default/pod-demo-5f9b977566-c7lvk"
Normal EvictComplete 8m koord-descheduler Pod "default/pod-demo-5f9b977566-c7lvk" has been evicted
Normal Complete 8m koord-descheduler Bind Pod "default/pod-demo-5f9b977566-nxjdf" in Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e"

Descheduling Framework

We implemented a brand new Descheduling Framework in v0.6.

The existing descheduler in the community can solve some problems, but we think that there are still many aspects of the descheduler that can be improved, for example, it only supports the mode of periodic execution, and does not support the event-triggered mode. It is not possible to extend and configure custom descheduling strategies without invading the existing code of descheduler like kube-scheduler; it also does not support implementing custom evictor.

We also noticed that the K8s descheduler community also found these problems and proposed corresponding solutions such as #753 Descheduler framework Proposal and PoC #781. The K8s descheduler community tries to implement a descheduler framework similar to the k8s scheduling framework. This coincides with our thinking.

Overall, these solutions solved most of our problems, but we also noticed that the related implementations were not merged into the main branch. But we review these implementations and discussions, and we believe this is the right direction. Considering that Koordiantor has clear milestones for descheduler-related features, we will implement Koordinator's own descheduler independently of the upstream community. We try to use some of the designs in the #753 PR proposed by the community and we will follow the Koordinator's compatibility principle with K8s to maintain compatibility with the upstream community descheduler when implementing. Such as independent implementation can also drive the evolution of the upstream community's work on the descheduler framework. And when the upstream community has new changes or switches to the architecture that Koordinator deems appropriate, Koordinator will follow up promptly and actively.

Based on this descheduling framework, it is very easy to be compatible with the existing descheduling strategies in the K8s community, and users can implement and integrate their own descheduling plugins as easily as K8s Scheduling Framework. At the same time, users are also supported to implement Controller in the form of plugins to realize event-based descheduling scenarios. At the same time, the framework integrates the MigrationController based on PodMigrationJob API and serves as the default Evictor plugin to help safely migrate Pods in various descheduling scenarios.

At present, we have implemented the main body of the framework, including the MigrationController based on PodMigrationJob, which is available as a whole. And we also provide a demo descheduling plugin. In the future, we will migrate and be compatible with the existing descheduling policies of the community, as well as the load balancing descheduling plugin provided for co-location scenarios.

The current framework is still in the early stage of rapid evolution, and there are still many details that need to be improved. Everyone who is interested is welcome to participate in the construction together. We hope that more people can be more assured and simpler to realize the descheduling capabilities they need.

About GPU Scheduling

There are also some new developments in GPU scheduling capabilities that everyone cares about.

During the iteration of v0.6, we completed the design of GPU Share Scheduling, and also completed the design of Gang Scheduling. Development of these capabilities is ongoing and will be released in v0.7.

In addition, in order to explore the mechanism of GPU overcommitment, we have implemented the ability to report GPU Metric in v0.6.

What’s coming next in Koordinator

Don't forget that Koordinator is developed in the open. You can check out our Github milestone to know more about what is happening and what we have planned. For more details, please refer to our milestone. Hope it helps!

· 8 min read
Jason

In addition to the usual updates to supporting utilities, Koordinator v0.5 adds a couple of new useful features we think you'll like.

Install or Upgrade to Koordinator v0.5.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.5.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.5.0 [--force]

For more details, please refer to the installation manual.

Fine-grained CPU Orchestration

In this version, we introduced a fine-grained CPU orchestration. Pods in the Kubernetes cluster may interfere with others' running when they share the same physical resources and both demand many resources. The sharing of CPU resources is almost inevitable. e.g. SMT threads (i.e. logical processors) share execution units of the same core, and cores in the same chip share one last-level cache. The resource contention can slow down the running of these CPU-sensitive workloads, resulting in high response latency (RT).

To improve the performance of CPU-sensitive workloads, koord-scheduler provides a mechanism of fine-grained CPU orchestration. It enhances the CPU management of Kubernetes and supports detailed NUMA-locality and CPU exclusions.

Please check out our user manual for a detailed introduction and tutorial.

Resource Reservation

Pods are fundamental units for allocating node resources in Kubernetes, which bind resource requirements with business logic. The scheduler is not able to reserve node resources for specific pods or workloads. We may try using a fake pod to prepare resources by the preemption mechanism. However, fake pods can be preempted by any scheduled pods with higher priorities, which make resources get scrambled unexpectedly.

In Koordinator, a resource reservation mechanism is proposed to enhance scheduling and especially benefits scenarios below:

  1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to "lock" resources preventing from allocation of other pods with the same or higher priority.
  2. De-scheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
  3. Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas.
  4. Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost.

This feature is still under development. We've finalized the API, feel free to check it out.

type Reservation struct {
metav1.TypeMeta `json:",inline"`
// A Reservation object is non-namespaced.
// It can reserve resources for pods of any namespace. Any affinity/anti-affinity of reservation scheduling can be
// specified in the pod template.
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ReservationSpec `json:"spec,omitempty"`
Status ReservationStatus `json:"status,omitempty"`
}

type ReservationSpec struct {
// Template defines the scheduling requirements (resources, affinities, images, ...) processed by the scheduler just
// like a normal pod.
// If the `template.spec.nodeName` is specified, the scheduler will not choose another node but reserve resources on
// the specified node.
Template *corev1.PodTemplateSpec `json:"template,omitempty"`
// Specify the owners who can allocate the reserved resources.
// Multiple owner selectors and ANDed.
Owners []ReservationOwner `json:"owners,omitempty"`
// By default, the resources requirements of reservation (specified in `template.spec`) is filtered by whether the
// node has sufficient free resources (i.e. ReservationRequest < NodeFree).
// When `preAllocation` is set, the scheduler will skip this validation and allow overcommitment. The scheduled
// reservation would be waiting to be available until free resources are sufficient.
PreAllocation bool `json:"preAllocation,omitempty"`
// Time-to-Live period for the reservation.
// `expires` and `ttl` are mutually exclusive. If both `ttl` and `expires` are not specified, a very
// long TTL will be picked as default.
TTL *metav1.Duration `json:"ttl,omitempty"`
// Expired timestamp when the reservation expires.
// `expires` and `ttl` are mutually exclusive. Defaults to being set dynamically at runtime based on the `ttl`.
Expires *metav1.Time `json:"expires,omitempty"`
}

type ReservationStatus struct {
// The `phase` indicates whether is reservation is waiting for process (`Pending`), available to allocate
// (`Available`) or expired to get cleanup (Expired).
Phase ReservationPhase `json:"phase,omitempty"`
// The `conditions` indicate the messages of reason why the reservation is still pending.
Conditions []ReservationCondition `json:"conditions,omitempty"`
// Current resource owners which allocated the reservation resources.
CurrentOwners []corev1.ObjectReference `json:"currentOwners,omitempty"`
}

type ReservationOwner struct {
// Multiple field selectors are ORed.
Object *corev1.ObjectReference `json:"object,omitempty"`
Controller *ReservationControllerReference `json:"controller,omitempty"`
LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
}

type ReservationControllerReference struct {
// Extend with a `namespace` field for reference different namespaces.
metav1.OwnerReference `json:",inline"`
Namespace string `json:"namespace,omitempty"`
}

type ReservationPhase string

const (
// ReservationPending indicates the Reservation has not been processed by the scheduler or is unschedulable for
// some reasons (e.g. the resource requirements cannot get satisfied).
ReservationPending ReservationPhase = "Pending"
// ReservationAvailable indicates the Reservation is both scheduled and available for allocation.
ReservationAvailable ReservationPhase = "Available"
// ReservationWaiting indicates the Reservation is scheduled, but the resources to reserve are not ready for
// allocation (e.g. in pre-allocation for running pods).
ReservationWaiting ReservationPhase = "Waiting"
// ReservationExpired indicates the Reservation is expired, which the object is not available to allocate and will
// get cleaned in the future.
ReservationExpired ReservationPhase = "Expired"
)

type ReservationCondition struct {
LastProbeTime metav1.Time `json:"lastProbeTime"`
LastTransitionTime metav1.Time `json:"lastTransitionTime"`
Reason string `json:"reason"`
Message string `json:"message"`
}

QoS Manager

Currently, plugins from resmanager in Koordlet are mixed together, they should be classified into two categories: static and dynamic. Static plugins will be called and run only once when a container created, updated, started or stopped. However, for dynamic plugins, they may be called and run at any time according the real-time runtime states of node, such as CPU suppress, CPU burst, etc. This proposal only focuses on refactoring dynamic plugins. Take a look at current plugin implementation, there are many function calls to resmanager's methods directly, such as collecting node/pod/container metrics, fetching metadata of node/pod/container, fetching configurations(NodeSLO, etc.). In the feature, we may need a flexible and powerful framework with scalability for special external plugins.

The below is directory tree of qos-manager inside koordlet, all existing dynamic plugins(as built-in plugins) will be moved into sub-directory plugins.

pkg/koordlet/qosmanager/
- manager.go
- context.go // plugin context
- /plugins/ // built-in plugins
- /cpubrust/
- /cpusuppress/
- /cpuevict/
- /memoryevict/

We only have the proposal in this version. Stay tuned, further implementation is coming soon!

Multiple Running Hook Modes

Runtime Hooks includes a set of plugins which are responsible for the injections of resource isolation parameters by pod attribute. When Koord Runtime Proxy running as a CRI Proxy, Runtime Hooks acts as the backend server. The mechanism of CRI Proxy can ensure the consistency of resource parameters during pod lifecycle. However, Koord Runtime Proxy can only hijack CRI requests from kubelet for pods, the consistency of resource parameters in QoS class directory cannot be guaranteed. Besides, modification of pod parameters from third-party(e.g. manually) will also break the correctness of hook plugins.

Therefore, a standalone running mode with reconciler for Runtime Hooks is necessary. Under Standalone running mode, resource isolation parameters will be injected asynchronously, keeping eventual consistency of the injected parameters for pod and QoS class even without Runtime Hook Manager.

Some minor works

  1. We fix the backward compatibility issues reported by our users in here. If you've ever encountered similar problem, please upgrade to the latest version.
  2. Two more interfaces were added into runtime-proxy. One is PreCreateContainerHook, which could set container resources setting before creating, and the other is PostStopSandboxHook, which could do the resource setting garbage collecting before pods deleted.
  3. cpuacct.usage is more precise than cpuacct.stat, and cpuacct.stat is in USER_HZ unit, while cpuacct.usage is nanoseconds. After thorough discussion, we were on the same page that we replace cpuacct.stat with cpuacct.usage in koordlet.
  4. Koordlet needs to keep fetching data from kubelet. Before this version, we only support accessing kubelet via read-only port over HTTP. Due to security concern, we've enabled HTTPS access in this version. For more details, please refer to this PR.

What’s coming next in Koordinator

Don't forget that Koordinator is developed in the open. You can check out our Github milestone to know more about what is happening and what we have planned. For more details, please refer to our milestone. Hope it helps!