Blog | Koordinator

What's New in Koordinator v0.3.0?

May 7, 2022 · 12 min read

Koordinator maintainer

We are happy to announce the v0.3.0 release of Koordinator. After starting small and learning what users needed, we are able to adjust its path and develop features needed for a stable community release.

The release of Koordinator v0.3.0 brings in some notable changes that are most wanted by the community while continuing to expand on experimental features.

Install or Upgrade to Koordinator v0.3.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.3.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.3.0 [--force]

For more details, please refer to the installation manual.

CPU Burst

CPU Burst is a service level objective (SLO)-aware resource scheduling feature provided by Koordinator. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. Koordlet automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications.

How CPU Burst works

Kubernetes allows you to specify CPU limits, which can be reused based on time-sharing. If you specify a CPU limit for a container, the OS limits the amount of CPU resources that can be used by the container within a specific time period. For example, you set the CPU limit of a container to 2. The OS kernel limits the CPU time slices that the container can use to 200 milliseconds within each 100-millisecond period.

CPU utilization is a key metric that is used to evaluate the performance of a container. In most cases, the CPU limit is specified based on CPU utilization. CPU utilization on a per-millisecond basis shows more spikes than on a per-second basis. If the CPU utilization of a container reaches the limit within a 100-millisecond period, CPU throttling is enforced by the OS kernel and threads in the container are suspended for the rest of the time period.

How to use CPU Burst

Use an annotation to enable CPU Burst
Add the following annotation to the pod configuration to enable CPU Burst:

annotations:
  # Set the value to auto to enable CPU Burst for the pod. 
  koordinator.sh/cpuBurst: '{"policy": "auto"}'
  # To disable CPU Burst for the pod, set the value to none. 
  #koordinator.sh/cpuBurst: '{"policy": "none"}'

Use a ConfigMap to enable CPU Burst for all pods in a cluster
Modify the slo-controller-config ConfigMap based on the following content to enable CPU Burst for all pods in a cluster:

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  cpu-burst-config: '{"clusterStrategy": {"policy": "auto"}}'
  #cpu-burst-config: '{"clusterStrategy": {"policy": "cpuBurstOnly"}}'
  #cpu-burst-config: '{"clusterStrategy": {"policy": "none"}}'

Advanced configurations
The following code block shows the pod annotations and ConfigMap fields that you can use for advanced configurations:

# Example of the slo-controller-config ConfigMap. 
data:
  cpu-burst-config: |
    {
      "clusterStrategy": {
        "policy": "auto",
        "cpuBurstPercent": 1000,
        "cfsQuotaBurstPercent": 300,
        "sharePoolThresholdPercent": 50,
        "cfsQuotaBurstPeriodSeconds": -1
      }
    }

  # Example of pod annotations. 
  koordinator.sh/cpuBurst: '{"policy": "auto", "cpuBurstPercent": 1000, "cfsQuotaBurstPercent": 300, "cfsQuotaBurstPeriodSeconds": -1}'

The following table describes the ConfigMap fields that you can use for advanced configurations of CPU Burst.

Field	Data type	Description
policy	string	none: disables CPU Burst. If you set the value to none, the related fields are reset to their original values. This is the default value. cpuBurstOnly: enables the CPU Burst feature only for the kernel of Alibaba Cloud Linux 2. cfsQuotaBurstOnly: enables automatic adjustment of CFS quotas of general kernel versions. auto: enables CPU Burst and all the related features.
cpuBurstPercent	int	Default value:`1000`. Unit: %. This field specifies the percentage to which the CPU limit can be increased by CPU Burst. If the CPU limit is set to `1`, CPU Burst can increase the limit to 10 by default.
cfsQuotaBurstPercent	int	Default value: `300`. Unit: %. This field specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. By default, the value of cfs_quota can be increased to at most three times.
cfsQuotaBurstPeriodSeconds	int	Default value: `-1`. Unit: seconds. This indicates that the time period in which the container can run with an increased CFS quota is unlimited. This field specifies the time period in which the container can run with an increased CFS quota, which cannot exceed the upper limit specified by `cfsQuotaBurstPercent`.
sharePoolThresholdPercent	int	Default value: `50`. Unit: %. This field specifies the CPU utilization threshold of the node. If the CPU utilization of the node exceeds the threshold, the value of cfs_quota in cgroup parameters is reset to the original value.

L3 cache and MBA resource isolation

Pods of different priorities are usually deployed on the same machine. This may cause pods to compete for computing resources. As a result, the quality of service (QoS) of your service cannot be ensured. The Resource Director Technology (RDT) controls the Last Level Cache (L3 cache) that can be used by workloads of different priorities. RDT also uses the Memory Bandwidth Allocation (MBA) feature to control the memory bandwidth that can be used by workloads. This isolates the L3 cache and memory bandwidth used by workloads, ensures the QoS of high-priority workloads, and improves overall resource utilization. This topic describes how to improve the resource isolation of pods with different priorities by controlling the L3 cache and using the MBA feature.

How to use L3 cache and MBA resource isolation

Use a ConfigMap to enable L3 cache and MBA resource isolation for all pods in a cluster

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  resource-qos-config: |-
    {
      "clusterStrategy": {
        "lsClass": {
           "resctrlQOS": {
             "enable": true,
             "catRangeStartPercent": 0,
             "catRangeEndPercent": 100,
             "MBAPercent": 100
           }
         },
        "beClass": {
           "resctrlQOS": {
             "enable": true,
             "catRangeStartPercent": 0,
             "catRangeEndPercent": 30,
             "MBAPercent": 100
           }
         }
      }
    }

Memory QoS

The Koordlet provides the memory quality of service (QoS) feature for containers. You can use this feature to optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This topic describes how to enable the memory QoS feature for containers.

Background information

The following memory limits apply to containers:

The memory limit of the container. If the amount of memory that a container uses, including the page cache, is about to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, the application in the container may not be able to request or release memory resources as normal.
The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is downgraded. In extreme cases, the node cannot run as normal.

To improve the performance of applications and the stability of nodes, Koordinator provides the memory QoS feature for containers. We recommend that you use Anolis OS as the node OS. For other OS, we will try our best to adapt, and users can still enable it without side effects. After you enable the memory QoS feature for a container, Koordlet automatically configures the memory control group (memcg) based on the configuration of the container. This helps you optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node.

How to use Memory QoS

When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps:

Add the following annotations to enable memory QoS for the containers in a pod:

annotations:
  # To enable memory QoS for the containers in a pod, set the value to auto. 
  koordinator.sh/memoryQOS: '{"policy": "auto"}'
  # To disable memory QoS for the containers in a pod, set the value to none. 
  #koordinator.sh/memoryQOS: '{"policy": "none"}'

Use a ConfigMap to enable memory QoS for all the containers in a cluster.

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  resource-qos-config: |-
    {
      "clusterStrategy": {
        "lsClass": {
           "memoryQOS": {
             "enable": true
           }
         },
        "beClass": {
           "memoryQOS": {
             "enable": true
           }
         }
      }
    }

Optional. Configure advanced parameters.
The following table describes the advanced parameters that you can use to configure fine-grained memory QoS configurations at the pod level and cluster level.

Parameter	Data type	Valid value	Description
enable	Boolean	true false	true: enables memory QoS for all the containers in a cluster. The default memory QoS settings for the QoS class of the containers are used. false: disables memory QoS for all the containers in a cluster. The memory QoS settings are restored to the original settings for the QoS class of the containers.
policy	String	auto default none	auto: enables memory QoS for the containers in the pod and uses the recommended memory QoS settings. The recommended memory QoS settings are prioritized over the cluster-wide memory QoS settings. default: specifies that the pod inherits the cluster-wide memory QoS settings. none: disables memory QoS for the pod. The relevant memory QoS settings are restored to the original settings. The original settings are prioritized over the cluster-wide memory QoS settings.
minLimitPercent	Int	0~100	Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: `Value of memory.min = Memory request × Value of minLimitPercent/100`. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify Memory `Request=100MiB` and `minLimitPercent=100` for a container, `the value of memory.min is 104857600`.
lowLimitPercent	Int	0~100	Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: `Value of memory.low = Memory request × Value of lowLimitPercent/100`. For example, if you specify `Memory Request=100MiB` and `lowLimitPercent=100` for a container, `the value of memory.low is 104857600`.
throttlingPercent	Int	0~100	Unit: %. Default value:`0`. The default value indicates that this parameter is disabled. This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: `Value of memory.high = Memory limit × Value of throttlingPercent/100`. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify `Memory Limit=100MiB` and `throttlingPercent=80` for a container, `the value of memory.high is 83886080`, which is equal to 80 MiB.
wmarkRatio	Int	0~100	Unit: %. Default value:`95`. A value of `0` indicates that this parameter is disabled. This parameter specifies the threshold of the usage of the memory limit or the value of `memory.high` that triggers asynchronous memory reclaim. If `throttlingPercent` is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Memory limit × wmarkRatio/100`. If `throttlingPercent` is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: `Value of memory.wmark_high = Value of memory.high × wmarkRatio/100`. If the usage of the memory limit or the value of memory.high exceeds the threshold, the memcg backend asynchronous reclaim feature is triggered. For example, if you specify `Memory Limit=100MiB`for a container, the memory throttling setting is`memory.high=83886080`, the reclaim ratio setting is `memory.wmark_ratio=95`, and the reclaim threshold setting is `memory.wmark_high=79691776`.
wmarkMinAdj	Int	-25~50	Unit: %. The default value is `-25` for the `LS`/ `LSR` QoS class and `50` for the `BE` QoS class. A value of 0 indicates that this parameter is disabled. This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is `memory.wmark_min_adj=-25`, which indicates that the minimum watermark is decreased by 25% for the containers in the pod.

What Comes Next

For more details, please refer to our milestone. Hope it helps!

Koordinator v0.2.0 - Enhanced node-side scheduling capabilities

April 19, 2022 · 4 min read

Joseph

Koordinator maintainer

We’re pleased to announce the release of Koordinator v0.2.0.

Overview

Koordinator v0.1.0 implements basic co-location scheduling capabilities, and after the project was released, it has received attention and positive responses from the community. For some issues that everyone cares about, such as how to isolate resources for best-effort workloads, how to ensure the runtime stability of latency-sensitiv applications in co-location scenarios, etc., we have enhanced node-side scheduling capabilities in koordinator v0.2.0 to solve these problems.

Install or Upgrade to Koordinator v0.2.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.2.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.2.0 [--force]

For more details, please refer to the installation manual.

Isolate resources for best-effort workloads

In Koodinator v0.2.0, we refined the ability to isolate resources for best-effort worklods.

koordlet will set the cgroup parameters according to the resources described in the Pod Spec. Currently supports setting CPU Request/Limit, and Memory Limit.

For CPU resources, only the case of request == limit is supported, and the support for the scenario of request <= limit will be supported in the next version.

Active eviction mechanism based on memory safety thresholds

When latency-sensitiv applications are serving, memory usage may increase due to bursty traffic. Similarly, there may be similar scenarios for best-effort workloads, for example, the current computing load exceeds the expected resource Request/Limit.

These scenarios will lead to an increase in the overall memory usage of the node, which will have an unpredictable impact on the runtime stability of the node side. For example, it can reduce the quality of service of latency-sensitiv applications or even become unavailable. Especially in a co-location environment, it is more challenging.

We implemented an active eviction mechanism based on memory safety thresholds in Koodinator.

koordlet will regularly check the recent memory usage of node and Pods to check whether the safty threshold is exceeded. If it exceeds, it will evict some best-effort Pods to release memory. This mechanism can better ensure the stability of node and latency-sensitiv applications.

koordlet currently only evicts best-effort Pods, sorted according to the Priority specified in the Pod Spec. The lower the priority, the higher the priority to be evicted, the same priority will be sorted according to the memory usage rate (RSS), the higher the memory usage, the higher the priority to be evicted. This eviction selection algorithm is not static. More dimensions will be considered in the future, and more refined implementations will be implemented for more scenarios to achieve more reasonable evictions.

The current memory utilization safety threshold default value is 70%. You can modify the memoryEvictThresholdPercent in ConfigMap slo-controller-config according to the actual situation,

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  colocation-config: |
    {
      "enable": true
    }
  resource-threshold-config: |
    {
      "clusterStrategy": {
        "enable": true,
        "memoryEvictThresholdPercent": 70
      }
    }

CPU Burst - Improve the performance of latency-sensitive applications

CPU Burst is a service level objective (SLO)-aware resource scheduling feature. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. Koordinator automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications.

The code of CPU Burst has been developed and is still under review and testing. It will be released in the next version. If you want to use this ability early, you are welcome to participate in Koordiantor and improve it together. For more details, please refer to the PR #73.

For more details, please refer to the Documentation. Hope it helps!

Koordinator v0.1.0 - QoS based scheduling system

March 31, 2022 · 5 min read

Joseph

Koordinator maintainer

Fangsong Zeng

Koordinator maintainer

We’re pleased to announce the release of Koordinator v0.1.0.

Overview

Koordinator is a QoS-based scheduling for efficient orchestration of microservices, AI, and big data workloads on Kubernetes. It aims to improve the runtime efficiency and reliability of both latency sensitive workloads and batch jobs, simplify the complexity of resource-related configuration tuning, and increase pod deployment density to improve resource utilizations.

Key Features

Koordinator enhances the kubernetes user experiences in the workload management by providing the following:

Well-designed priority and QoS mechanism to co-locate different types of workloads in a cluster and run different types of pods on a single node. Allowing for resource overcommitments to achieve high resource utilizations but still satisfying the QoS guarantees by leveraging an application profiling mechanism.
Fine-grained resource orchestration and isolation mechanism to improve the efficiency of latency-sensitive workloads and batch jobs.
Flexible job scheduling mechanism to support workloads in specific areas, e.g., big data, AI, audio and video.
A set of tools for monitoring, troubleshooting and operations.

Node Metrics

Koordinator defines the NodeMetrics CRD, which is used to record the resource utilization of a single node and all Pods on the node. koordlet will regularly report and update NodeMetrics. You can view NodeMetrics with the following commands.

$ kubectl get nodemetrics node-1 -o yaml
apiVersion: slo.koordinator.sh/v1alpha1
kind: NodeMetric
metadata:
  creationTimestamp: "2022-03-30T11:50:17Z"
  generation: 1
  name: node-1
  resourceVersion: "2687986"
  uid: 1567bb4b-87a7-4273-a8fd-f44125c62b80
spec: {}
status:
  nodeMetric:
    nodeUsage:
      resources:
        cpu: 138m
        memory: "1815637738"
  podsMetric:
  - name: storage-service-6c7c59f868-k72r5
    namespace: default
    podUsage:
      resources:
        cpu: "300m"
        memory: 17828Ki

Colocation Resources

After the Koordinator is deployed in the K8s cluster, the Koordinator will calculate the CPU and Memory resources that have been allocated but not used according to the data of NodeMetrics. These resources are updated in Node in the form of extended resources.

koordinator.sh/batch-cpu represents the CPU resources for Best Effort workloads, koordinator.sh/batch-memory represents the Memory resources for Best Effort workloads.

You can view these resources with the following commands.

$ kubectl describe node node-1
Name:               node-1
....
Capacity:
  cpu:                          8
  ephemeral-storage:            103080204Ki
  koordinator.sh/batch-cpu:     4541
  koordinator.sh/batch-memory:  17236565027
  memory:                       32611012Ki
  pods:                         64
Allocatable:
  cpu:                          7800m
  ephemeral-storage:            94998715850
  koordinator.sh/batch-cpu:     4541
  koordinator.sh/batch-memory:  17236565027
  memory:                       28629700Ki
  pods:                         64

Cluster-level Colocation Profile

In order to make it easier for everyone to use Koordinator to co-locate different workloads, we defined ClusterColocationProfile to help gray workloads use co-location resources. A ClusterColocationProfile is CRD like the one below. Please do edit each parameter to fit your own use cases.

apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
  name: colocation-profile-example
spec:
  namespaceSelector:
    matchLabels:
      koordinator.sh/enable-colocation: "true"
  selector:
    matchLabels:
      sparkoperator.k8s.io/launched-by-spark-operator: "true"
  qosClass: BE
  priorityClassName: koord-batch
  koordinatorPriority: 1000
  schedulerName: koord-scheduler
  labels:
    koordinator.sh/mutated: "true"
  annotations: 
    koordinator.sh/intercepted: "true"
  patch:
    spec:
      terminationGracePeriodSeconds: 30

Various Koordinator components ensure scheduling and runtime quality through labels koordinator.sh/qosClass, koordinator.sh/priority and kubernetes native priority.

With the webhook mutating mechanism provided by Kubernetes, koord-manager will modify Pod resource requirements to co-located resources, and inject the QoS and Priority defined by Koordinator into Pod.

Taking the above Profile as an example, when the Spark Operator creates a new Pod in the namespace with the koordinator.sh/enable-colocation=true label, the Koordinator QoS label koordinator.sh/qosClass will be injected into the Pod. According to the Profile definition PriorityClassName, modify the Pod's PriorityClassName and the corresponding Priority value. Users can also set the Koordinator Priority according to their needs to achieve more fine-grained priority management, so the Koordinator Priority label koordinator.sh/priority is also injected into the Pod. Koordinator provides an enhanced scheduler koord-scheduler, so you need to modify the Pod's scheduler name koord-scheduler through Profile.

If you expect to integrate Koordinator into your own system, please learn more about the core concepts.

CPU Suppress

In order to ensure the runtime quality of different workloads in co-located scenarios, Koordinator uses the CPU Suppress mechanism provided by koordlet on the node side to suppress workloads of the Best Effort type when the load increases. Or increase the resource quota for Best Effort type workloads when the load decreases.

When installing through the helm chart, the ConfigMap slo-controller-config will be created in the koordinator-system namespace, and the CPU Suppress mechanism is enabled by default. If it needs to be closed, refer to the configuration below, and modify the configuration of the resource-threshold-config section to take effect.

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: {{ .Values.installation.namespace }}
data:
  ...
  resource-threshold-config: |
    {
      "clusterStrategy": {
        "enable": false
      }
    }

Colocation Resources Balance

Koordinator currently adopts a strategy for node co-location resource scheduling, which prioritizes scheduling to machines with more resources remaining in co-location to avoid Best Effort workloads crowding together. More rich scheduling capabilities are on the way.

Tutorial - Colocation of Spark Jobs

Apache Spark is an analysis engine for large-scale data processing, which is widely used in Big Data, SQL Analysis and Machine Learning scenarios. We provide a tutorial to help you how to quickly use Koordinator to run Spark Jobs in colocation mode with other latency sensitive applications. For more details, please refer to the tutorial.

Summary

Fore More details, please refer to the Documentation. Hope it helps!

Install or Upgrade to Koordinator v0.3.0​

Install with helms​

Upgrade with helm​

CPU Burst​

How CPU Burst works​

How to use CPU Burst​

L3 cache and MBA resource isolation​

How to use L3 cache and MBA resource isolation​

Memory QoS​

Background information​

How to use Memory QoS​

What Comes Next​

Overview​

Install or Upgrade to Koordinator v0.2.0​

Install with helms​

Upgrade with helm​

Isolate resources for best-effort workloads​

Active eviction mechanism based on memory safety thresholds​

CPU Burst - Improve the performance of latency-sensitive applications​

More​

Overview​

Key Features​

Node Metrics​

Colocation Resources​

Cluster-level Colocation Profile​

CPU Suppress​

Colocation Resources Balance​

Tutorial - Colocation of Spark Jobs​

Summary​

Install or Upgrade to Koordinator v0.3.0

Install with helms

Upgrade with helm

CPU Burst

How CPU Burst works

How to use CPU Burst

L3 cache and MBA resource isolation

How to use L3 cache and MBA resource isolation

Memory QoS

Background information

How to use Memory QoS

What Comes Next

Overview

Install or Upgrade to Koordinator v0.2.0

Install with helms

Upgrade with helm

Isolate resources for best-effort workloads

Active eviction mechanism based on memory safety thresholds

CPU Burst - Improve the performance of latency-sensitive applications

More

Overview

Key Features

Node Metrics

Colocation Resources

Cluster-level Colocation Profile

CPU Suppress

Colocation Resources Balance

Tutorial - Colocation of Spark Jobs

Summary