Skip to main content

· 7 min read
Joseph

Koordinator 今年3月份开源以来,先后发布了7个版本,逐步的把阿里巴巴&阿里云内部的混部系统的核心能力输出到开源社区,并在中间过程中逐渐的被 Kubernetes、大数据、高性能计算、机器学习领域或者社区的关注,Koordinator 社区也逐步获得了一些贡献者的支持,并有一些企业开始逐步的在生产环境中使用 Koordinator 解决实际生产中遇到的成本问题、混部问题等。 经过 Koordinator 社区的努力,我们怀着十分激动的心情向大家宣布 Koordinator 1.0 版本正式发布。

Koordinator 项目早期着重建设核心混部能力 -- 差异化 SLO,并且为了让用户更容易的使用 Koordinator 的混部能力,Koordinator 提供了 ClusterColocationProfile 机制帮助用户可以不用修改存量代码完成不同工作负载的混部,让用户逐步的熟悉混部技术。随后 Koordinaor 逐步在节点侧 QoS 保障机制上做了增强,提供了包括但不限于 CPU Suppress、CPU Burst、 Memory QoS、L3 Cache/MBA 资源隔离机制和基于满足度驱逐机制等多种能力,解决了大部分节点侧工作负载的稳定性问题。配合使用 Koordinator Runtime Proxy 组件,可以更好的兼容 Kubernetes kubelet 原生管理机制。

并且 Koordinator 在任务调度和 QoS 感知调度以及重调度等方面也都提供了一些创新方案,建设了全面兼容 Kubernetes CPU 管理机制的精细化 CPU 调度能力,面向节点实际负载的均衡调度能力。为了更好的让用户管理好资源, Koordinator 还提供了资源预留能力(Reservation),并且 Koordinator 基于 Kubernetes 社区已有的Coscheduling、ElasticQuota Scheduling 能力做了进一步的增强,为任务调度领域注入了新的活力。Koordinator 提供了全新的重调度器框架,着重建设 Descheduler 的扩展性和安全性问题。

安装或升级 Koordinator v1.0.0

使用 Helm 安装

您可以通过 helm v3.5+ 非常方便的安装 Koordinator,Helm 是一个简单的命令行工具,您可以从 这里 获取它。

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 1.0.0

版本功能特性解读

Koordinator v1.0 整体新增的特性并不多,主要有以下一些变化

独立 API Repo

为了更方便集成和使用 Koordiantor 定义的 API,并避免因依赖 Koordiantor 引入额外的依赖或者依赖冲突问题,我们建立了独立的 API Repo: koordinator-sh/apis

新增 ElasticQuota Webhook

在 Koordinator v0.7 版本中,我们基于 Kubernetes sig-scheduler 提供的 ElasticQuota 做了诸多增强,提供了树形管理机制,并提供了公平性保障机制等,可以很好的帮助您解决使用 ElasticQuota 遇到的问题。在 Koordinator v1.0 版本中,我们进一步提供了 ElasticQuota Webhook,帮助您在使用 ElasticQuota 树形管理机制时,保障新的 ElasticQuota 对象遵循 Koordinator 定义的规范或约束:

  1. 除了根节点,其他所有子节点的 min 之和要小于父节点的 min。
  2. 不限制子节点 max,允许子节点的 max 大于父节点的 max。考虑以下场景,集群中有 2 个 ElasticQuota 子树:dev-parent 和 production-parent,每个子树都有几个子 ElasticQuota。 当 production-parent 忙时,我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量,而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
  3. Pod 不能使用父节点ElasticQuota。如果放开这个限制,会导致整个弹性 Quota 的机制变的异常复杂,暂时不考虑支持这种场景。
  4. 只有父节点可以挂子节点,不允许子节点挂子节点
  5. 暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

进一步完善 ElasticQuota Scheduling

在 Koordinator v0.7 版本中,koord-scheduler 的主副 Pod 都会启动 ElasticQuota Controller 并都会更新 ElasticQuota 对象。在 Koordinator v1.0 中我们修复了该问题,确保只有主 Pod 可以启动 Controller 并更新 ElasticQuota 对象。 还优化了 ElasticQuota Controller 潜在的频繁更新 ElasticQuota 对象的问题,当检查到 ElasticQuota 各维度数据发生变化时才会更新,降低频繁更新给 APIServer 带来的压力。

进一步完善 Device Share Scheduling

Koordinator v1.0 中 koordlet 会上报 GPU 的型号和驱动版本到 Device CRD 对象中,并会由 koord-manager 同步更新到 Node 对象,追加相应的标签。

apiVersion: v1
kind: Node
metadata:
labels:
kubernetes.io/gpu-driver: 460.91.03
kubernetes.io/gpu-model: Tesla-T4
...
name: cn-hangzhou.10.0.4.164
spec:
...
status:
...

Koordinator Runtime Proxy 增强兼容性

在 Koordinator 之前的版本中,koord-runtime-proxy 和 koordlet 一起安装后,如果 koordlet 异常或者 koordlet 卸载/重装等场景下,会遇到新调度到节点的 Pod 无法创建容器的问题。为了解决这个问题,koord-runtime-proxy 会感知 Pod 是否具有特殊的 label runtimeproxy.koordinator.sh/skip-hookserver=true,如果 Pod 存在该标签,koord-runtime-proxy 会直接把 CRI 请求转发给 containerd/docker 等 runtime。

其他改动

你可以通过 Github release 页面,来查看更多的改动以及它们的作者与提交记录。

· 34 min read
Joseph

Koordinator[1] 继上次 v0.6版本[2] 发布后,经过 Koordinator 社区的努力,我们迎来了具有重大意义的 v0.7 版本。在这个版本中着重解决机器学习、大数据场景需要的任务调度能力,例如 CoScheduling、ElasticQuota和精细化的 GPU 共享调度能力。并在调度问题诊断分析方面得到了增强,重调度器也极大的提升了安全性,降低了重调度的风险。

版本功能特性解读

1. 任务调度

1.1 Enhanced Coscheduling

Gang scheduling是在并发系统中将多个相关联的进程调度到不同处理器上同时运行的策略,其最主要的原则是保证所有相关联的进程能够同时启动,防止部分进程的异常,导致整个关联进程组的阻塞。例如当提交一个Job时会产生多个任务,这些任务期望要么全部调度成功,要么全部失败。这种需求称为 All-or-Nothing,对应的实现被称作 Gang Scheduling(or Coscheduling) 。
Koordinator 在启动之初,期望支持 Kubernetes 多种工作负载的混部调度,提高工作负载的运行时效率和可靠性,其中就包括了机器学习和大数据领域中广泛存在的具备 All-or-Nothing 需求的作业负载。 为了解决 All-or-Nothing 调度需求,Koordinator v0.7.0 基于社区已有的 Coscheduling 实现了 Enhanced Coscheduling。
Enhanced Coscheduling 秉承着 Koordiantor 兼容社区的原则,完全兼容社区 Coscheduling 和 依赖的 PodGroup CRD。已经使用 PodGroup 的用户可以无缝升级到 Koordinator。
除此之外,Enhanced Coscheduling 还实现了如下增强能力:

支持 StrictNonStrict 两种模式

两种模式的区别在于 Strict模式(即默认模式)下调度失败会 Reject 所有分配到资源并处于 Wait 状态的 Pod,而 NonStrict 模式不会发起 Reject。NonStrict 模式下,同属于一个 PodGroup 的 Pod A 和 PodB 调度时,如果 PodA 调度失败不会影响 PodB 调度, PodB 还会继续被调度。NonStrict 模式对于体量较大的 Job 比较友好,可以让这种大体量 Job 更快的调度完成,但同时也增加了资源死锁的风险。后续 Koordinator 会提供 NonStrict 模式下解决死锁的方案实现。
用户在使用时,可以在 PodGroup 或者 Pod 中追加 annotation gang.scheduling.koordinator.sh/mode=NonStrict开启 NonStrict 模式。

改进 PodGroup 调度失败的处理机制,实现更高效的重试调度

举个例子,PodGroup A 关联了5个Pod,其中前3个Pod通过Filter/Score,进入Wait阶段,第4个Pod调度失败,当调度第5个Pod时,发现第4个Pod已经失败,则拒绝调度。在社区 Coscheduling 实现中,调度失败的PodGroup 会加入到基于cache机制的 lastDeniedPG 对象中,当 cache 没有过期,则会拒绝调度;如果过期就允许继续调度。可以看到 cache 的过期时间很关键,过期时间设置的过长会导致Pod迟迟得不到调度机会,设置的过短会出现频繁的无效调度。
而在Enhanced Coscheduling 中,实现了一种基于 ScheduleCycle 的重试机制。以上场景为例,5个Pod的 ScheduleCycle 初始值为 0,PodGroup 对应的 ScheduleCycle 初始值为1;当每一次尝试调度 Pod 时,都会更新 Pod ScheduleCycle 为 PodGroup ScheduleCycle。如果其中一个 Pod 调度失败,会标记当前的 PodGroup ScheduleCycle 无效,之后所有小于 PodGroup ScheduleCycle 的 Pod 都会被拒绝调度。当同一个 PodGroup 下的所有 Pod 都尝试调度一轮后,Pod ScheduleCycle 都更新为当前 PodGroup ScheduleCycle,并递进 PodGroup ScheduleCycle,并标记允许调度。这种方式可以有效规避基于过期时间的缺陷,完全取决于调度队列的配置重试调度。
image.png

支持多个 PodGroup 为一组完成 Gang Scheduling

一些复杂的 Job 有多种角色,每个角色管理一批任务,每个角色的任务要求支持 All-or-Nothing ,每个角色的 MinMember 要求也不一样,并且每个角色之间也要求 All-or-Nothing。这就导致每个角色都有一个对应的 PodGroup ,并且还要求 PodGroup 即使满足了也需要等待其他角色的 PodGroup 必须满足。社区 Coscheduling 无法满足这种场景需求。而 Koordinator 实现的 Enhanced Coscheduling 支持用户在多个 PodGroup 中增加 anntation 相关关联实现,并支持跨Namespace。例如用户有2个PodGroup ,名字分别是PodGroupA和PodGroupB,可以按照如下例子关联两个 PodGroup:

apiVersion: v1alpha1
kind: PodGroup
metadata:
name: podGroupA
namespace: default
annotations:
gang.scheduling.koordinator.sh/groups: ["namespaceA/podGroupA", "namespaceB/podGroupB"]
spec:
...

支持轻量化 Gang 协议

如果用户不希望创建 PodGroup,认为创建 PodGroup 太繁琐,那么可以考虑在一组 Pod 中填充相同 annotation gang.scheduling.koordinator.sh/name=<podGroupName> 表示这一组 Pod 使用 Coscheduling 调度。如果期望设置 minMember ,可以追加 Annotation gang.scheduling.koordinator.sh/min-available=<availableNum>。举个例子:

apiVersion: v1
kind: Pod
metadata:
annotations:
gang.scheduling.koordinator.sh/name: "pod-group-a"
gang.scheduling.koordinator.sh/min-available: "5"
name: demo-pod
namespace: default
spec:
...

1.2 ElasticQuota Scheduling

一家中大型公司内有多个产品和研发团队,共用多个比较大规模的 Kubernetes 集群,这些集群内含有的大量 CPU/Memory/Disk 等资源被资源运营团队统一管理。运营团队往往在采购资源前,通过额度预算的机制让公司内每个团队根据自身的需求提交额度预算。业务团队此时一般根据业务当前和对未来的预期做好额度预算。最理想的情况是每一份额度都能够被使用,但现实告诉我们这是不现实的。往往出现的问题是:

  1. 团队 A 高估了业务的发展速度,申请了太多的额度用不完
  2. 团队 B 低估了业务的发展速度,申请的额度不够用
  3. 团队 C 安排了一场活动,手上的额度不够多了,但是活动只持续几周,申请太多额度和资源也会浪费掉。
  4. 团队 D 下面还有各个子团队和业务,每个子团队内也会出现类似A B C 三个团队的情况,而且其中有些团队的业务临时突发需要提交一些计算任务要交个客户,但是没有额度了,走额度预算审批也不够了。
  5. ......

以上大家日常经常遇到的场景,在混部场景、大数据场景,临时性突发需求又是时常出现的,这些资源的需求都给额度管理工作带来了极大的挑战。做好额度管理工作,一方面避免过度采购资源降低成本,又要在临时需要额度时不采购资源或者尽量少的采购资源;另一方面不能因为额度问题限制资源使用率,额度管理不好就会导致即使有比较好的技术帮助复用资源,也无法发挥其价值。 总之,额度管理工作是广大公司或组织需长期面对且必须面对的问题。
Kubernetes ResourceQuota 可以解决额度管理的部分问题。 原生 Kubernetes ResourceQuota API 用于指定每个 Namespace 的最大资源额度量,并通过 admission 机制完成准入检查。如果 Namespace 当前资源分配总量超过ResourceQuota 指定的配额,则拒绝创建 Pod。 Kubernetes ResourceQuota 设计有一个局限性:Quota 用量是按照 Pod Requests 聚合的。 虽然这种机制可以保证实际的资源消耗永远不会超过 ResourceQuota 的限制,但它可能会导致资源利用率低,因为一些 Pod 可能已经申请了资源但未能调度。
Kuberenetes Scheduler-Sig 后来给出了一个借鉴 Yarn Capacity Scheduling,称作 ElasticQuota 的设计方案并给出了具体的实现。允许用户设置 max 和 min:

  • max 表示用户可以消费的资源上限
  • min 表示需要保障用户实现基本功能/性能所需要的最小资源量

通过这两个参数可以帮助用户实现如下的需求:

  1. 用户设置 min < max 时,当有突发资源需求时,即使当前 ElasticQuota 的总用量超过了 min, 但只要没有达到 max,那么用户可以继续创建新的 Pod 应对新的任务请求。
  2. 当用户需要更多资源时,用户可以从其他 ElasticQuota 中“借用(borrow)” 还没有被使用并且需要通保障的 min。
  3. 当一个 ElasticQuota 需要使用 min 资源时,会通过抢占机制从其他借用方抢回来,即驱逐一些其他ElasticQuota 超过 min 用量的 Pod。

ElasticQuota 还有一些局限性:没有很好的保障公平性。假如同一个 ElasticQuota 有大量新建的Pod,有可能会消耗所有其他可以被借用的Quota,从而导致后来的 Pod 可能拿不到 Quota。此时只能通过抢占机制抢回来一些 Quota。
另外 ElasticQuota 和 Kubernetes ResourceQuota 都是面向 Namespace的,不支持多级树形结构,对于一些本身具备复杂组织关系的企业/组织不能很好的使用ElasticQuota/Kubenretes ResourceQuota 完成额度管理工作。
Koordinator 针对这些额度管理问题,给出了一种基于社区 ElasticQuota 实现的支持多级管理方式的弹性Quota管理机制(multi hierarchy quota management)。具备如下特性:

  • 兼容社区的 ElasticQuota API。用户可以无缝升级到 Koordinator
  • 支持树形结构管理 Quota。
  • 支持按照共享权重(shared weight)保障公平性。
  • 允许用户设置是否允许借用Quota 给其他消费对象。

Pod 关联 ElasticQuota 方式

用户可以非常使用的使用该能力,可以完全按照 ElasticQuota 的用法,即每个 Namespace 设置一个 ElasticQuota 对象。也可以在 Pod 中追加 Label 关联 ElasticQuota:

apiVersion: v1
kind: Pod
metadata:
labels:
quota.scheduling.koordinator.sh/name: "elastic-quota-a"
name: demo-pod
namespace: default
spec:
...

树形结构管理机制和使用方法

需要使用树形结构管理 Quota 时,需要在 ElasticQuota 中追加 Label quota.scheduling.koordinator.sh/is-parent表示当前 ElasticQuota 是否是父节点,quota.scheduling.koordinator.sh/parent表示当前 ElasticQuota 的父节点 ElasticQuota 的名字。举个例子:
image.png
我们创建一个 ElasticQuota Root 作为根节点,资源总量为CPU 100C,内存200Gi,以及子节点 quota-a

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: parentA
namespace: default
labels:
quota.scheduling.koordinator.sh/is-parent: "true"
quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
max:
cpu: 100
memory: 200Gi
min:
cpu: 100
memory: 200Gi
---
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: ElasticQuota
metadata:
name: childA1
namespace: default
labels:
quota.scheduling.koordinator.sh/is-parent: "false"
quota.scheduling.koordinator.sh/parent: "parentA"
quota.scheduling.koordinator.sh/allow-lent-resource: "true"
spec:
max:
cpu: 40
memory: 100Gi
min:
cpu: 20
memory: 40Gi

在使用树形结构管理 ElasticQuota 时,有一些需要遵循的约束:

  1. 除了根节点,其他所有子节点的 min 之和要小于父节点的 min。
  2. 不限制子节点 max,允许子节点的 max 大于父节点的 max。考虑以下场景,集群中有 2 个 ElasticQuota 子树:dev-parent 和 production-parent,每个子树都有几个子 ElasticQuota。 当 production-parent 忙时,我们可以通过只降低 dev-parent 的 max 限制 dev-parent 整颗子树的资源使用量,而不是降低 dev-parent 子树的每个子 ElasticQuota 的max限制用量。
  3. Pod 不能使用父节点ElasticQuota。如果放开这个限制,会导致整个弹性 Quota 的机制变的异常复杂,暂时不考虑支持这种场景。
  4. 只有父节点可以挂子节点,不允许子节点挂子节点
  5. 暂时不允许改变 ElasticQuota 的 quota.scheduling.koordinator.sh/is-parent属性

我们将在下个版本中通过 webhook 机制实现这些约束。

公平性保障机制

为了方便阅读和理解将要介绍的公平性保障机制,先明确几个新概念:

  • request 表示同一个 ElasticQuota 关联的所有 Pod 的资源请求量。如果一个 ElasticQuota A 的 request 小于 min,ElasticQuota B 的 request 大于 min,那么 ElasticQuota A 未使用的部分,即 min - request 剩余的量通过公平性保障机制借用给 ElasticQuota B. 当 ElasticQuota A 需要使用这些借走的量时,要求 ElasticQuota B 依据公平性保障机制归还给 ElasticQuota A。
  • runtime 表示 ElasticQuota 当前可以使用的实际资源量。如果 request 小于 min,runtime 等于 request。这也意味着,需要遵循 min 语义,应无条件满足 request。如果 request 大于 min,且 min 小于 max,公平性保障机制会分配 runtime 在min 与 max 之前,即 max >= runtime >= min。
  • shared-weight 表示一个 ElasticQuota 的竞争力,默认等于 ElasticQuota Max。

通过几个例子为大家介绍公平性保障机制的运行过程,假设当前集群的 CPU 总量为100C,并且有4个ElasticQuota,如下图所示,绿色部分为 Request 量:A 当前的request 为5,B当前的request为20,C当前的Request为30,D当前的Request为70。
image.png
并且我们注意到, A, B, C, D 的 min 之和是60,剩下 40 个空闲额度, 同时 A 还可以借给 B, C, D 5个额度,所以一共有45个额度被B,C,D共享,根据各个ElasticQuota的 shared-weight,B,C,D分别对应60,50和80,计算出各自可以共享的量:

  • B 可以获取 14个额度, 45 * 60 / (60 + 50 + 80) = 14
  • C 可以获取 12个额度, 45 * 50 / (60 + 50 + 80) = 12
  • D 可以获取 19个额度, 45 * 80 / (60 + 50 + 80) = 19

image.png
但我们也要注意的是,C和D需要更多额度,而 B只需要5个额度就能满足 Request,并且 B 的min是15,也就意味着我们只需要给 B 5个额度,剩余的9个额度继续分给C和D。

  • C 可以获取 3个额度, 9 * 50 / (50 + 80) = 3
  • D 可以获取 6个额度, 9 * 80 / (50 + 80) = 6


最终我们得出如下的分配结果结果:

  • A runtime = 5
  • B runtime = 20
  • C runtime = 35
  • D runtime = 40


总结整个过程可以知道:

  1. 当前 request < min 时,需要借出 lent-to-quotas;当 request > min 时,需要借入 borrowed-qutoas
  2. 统计所有 runtime < min 的 Quota,这些总量就是接下来可被借出的量。
  3. 根据 shared-weight 计算每个ElasticQuota可以借入的量
  4. 如果最新的 runtime > reuqest,那么 runtime - request 剩余的量可以借给更需要的对象。

另外还有一种日常生产时会遇到的情况:即集群内资源总量会随着节点故障、资源运营等原因降低,导致所有ElasticQuota的 min 之和大于资源总量。当出现这种情况时,我们无法确保 min 的资源述求。此时我们会按照一定的比例调整各个ElasticQuota的min,确保所有min之和小于或者等于当前实际的资源总量。

抢占机制

Koordinator ElasticQuota 机制在调度阶段如果发现 Quota 不足,会进入抢占阶段,按照优先级排序,抢占属于同一个ElasticQuota 内的 低优先级 Pod。 同时,我们不支持跨 ElasticQuota 抢占其他 Pod。但是我们也提供了另外的机制支持从借用 Quota 的 ElasticQuota 抢回。
举个例子,在集群中,有两个 ElasticQuota,ElasticQuota A {min = 50, max = 100}, ElasticQuota B {min = 50, max = 100}。用户在上午10点使用 ElasticQuota A 提交了一个 Job, Request = 100 ,此时因为 ElasticQuota B 无人使用,ElasticQuota A 能从 B 手里借用50个Quota,满足了 Request = 100, 并且此时 Used = 100。在11点钟时,另一个用户开始使用 ElasticQuota B 提交Job,Request = 100,因为 ElasticQuota B 的 min = 50,是必须保障的,通过公平性保障机制,此时 A 和 B 的 runtime 均为50。那么此时对于 ElasticQuota A ,Used = 100 是大于当前 runtime = 50 的,因此我们会提供一个 Controller,驱逐掉一部分 Pod ,使得当前 ElasticQuota A 的 Used 降低到 runtime 相等的水位。

2. 精细化资源调度

Device Share Scheduling

机器学习领域里依靠大量强大算力性能的 GPU 设备完成模型训练,但是 GPU 自身价格十分昂贵。如何更好地利用GPU设备,发挥GPU的价值,降低成本,是一个亟待解决的问题。 Kubernetes 社区现有的 GPU 分配机制中,GPU 是由 kubelet 分配的,并只支持分配一个或多个完整的 GPU 实例。 这种方法简单可靠,但类似于 CPU 和 Memory,GPU 并不是一直处于高利用率水位,同样存在资源浪费的问题。 因此,Koordinator 希望支持多工作负载共享使用 GPU 设备以节省成本。 此外,GPU 有其特殊性。 比如下面的 NVIDIA GPU 支持的 NVLink 和超卖场景,都需要通过调度器进行中央决策,以获得全局最优的分配结果。
image.png

从图中我们可以发现,虽然该节点有8个 GPU 实例,型号为A100/V100,但 GPU 实例之间的数据传输速度是不同的。 当一个 Pod 需要多个 GPU 实例时,我们可以为 Pod 分配具有最大数据传输速度组合关系的 GPU 实例。 此外,当我们希望一组 Pod 中的 GPU 实例具有最大数据传输速度组合关系时,调度器应该将最佳 GPU 实例批量分配给这些 Pod,并将它们分配到同一个节点。

GPU 资源协议

Koordinator 兼容社区已有的 nvidia.com/gpu资源协议,并且还自定义了扩展资源协议,支持用户更细粒度的分配 GPU 资源。

  • kubernetes.io/gpu-core 代表GPU的计算能力。 与 Kuberetes MilliCPU 类似,我们将 GPU 的总算力抽象为100,用户可以根据需要申请相应数量的 GPU 算力。
  • kubernetes.io/gpu-memory 表示 GPU 的内存容量,以字节为单位。
  • kubernetes.io/gpu-memory-ratio 代表 GPU 内存的百分比。

假设一个节点有4个GPU设备实例,每个GPU设备实例有 8Gi 显存。用户如果期望申请一个完整的 GPU 实例,除了使用 nvidia.com/gpu之外,还可以按照如下方式申请:

apiVersion: v1
kind: Pod
metadata:
name: demo-pod
namespace: default
spec:
containers:
- name: main
resources:
limits:
kubernetes.io/gpu-core: 100
kubernetes.io/gpu-memory: "8Gi"
requests:
kubernetes.io/gpu-core: 100
kubernetes.io/gpu-memory: "8Gi"

如果期望只使用一个 GPU 实例一半的资源,可以按照如下方式申请:

apiVersion: v1
kind: Pod
metadata:
name: demo-pod
namespace: default
spec:
containers:
- name: main
resources:
limits:
kubernetes.io/gpu-core: 50
kubernetes.io/gpu-memory: "4Gi"
requests:
kubernetes.io/gpu-core: 50
kubernetes.io/gpu-memory: "4Gi"

设备信息和设备容量上报

在 Koordinator v0.7.0 版本中,单机侧 koordlet 安装后会自动识别节点上是否含有 GPU 设备,如果存在的话,会上报这些 GPU 设备的 Minor ID、 UUID、算力和显存大小到一个类型为 Device CRD 中。每个节点对应一个 Device CRD 实例。Device CRD 不仅支持描述 GPU,还支持类似于 FPGA/RDMA等设备类型,目前 v0.7.0 版本只支持 GPU, 暂未支持这些设备类型。
Device CRD 会被 koord-manager 内的 NodeResource controller 和 koord-scheduler 消费。NodeResource controller 会根据 Device CRD 中描述的信息,换算成 Koordinator 支持的资源协议 kubernetes.io/gpu-core,kubernetes.io/gpu-memory 更新到 Node.Status.Allocatable 和 Node.Status.Capacity 字段,帮助调度器和 kubelet 完成资源调度。gpu-core 表示GPU 设备实例的算力,一个实例的完整算力为100。假设一个节点有 8 个 GPU 设备实例,那么节点的 gpu-core 容量为 8 100 = 800; gpu-memory 表示 GPU 设备实例的显存大小,单位为字节,同样的节点可以分配的显存总量为 设备数量 每个实例的单位容量,例如一个 GPU 设备的显存是 8G,节点上有8 个 GPU 实例,总量为 8 * 8G = 64G。

apiVersion: v1
kind: Node
metadata:
name: node-a
status:
capacity:
koordinator.sh/gpu-core: 800
koordinator.sh/gpu-memory: "64Gi"
koordinator.sh/gpu-memory-ratio: 800
allocatable:
koordinator.sh/gpu-core: 800
koordinator.sh/gpu-memory: "64Gi"
koordinator.sh/gpu-memory-ratio: 800

中心调度分配设备资源

Kuberetes 社区原生提供的设备调度机制中,调度器只负责校验设备容量是否满足 Pod,对于一些简单的设备类型是足够的,但是当需要更细粒度分配 GPU 时,需要中心调度器给予支持才能实现全局最优。
Koordinator 调度器 koord-scheduler 新增了调度插件 DeviceShare,负责精细度设备资源调度。DeviceShare 插件消费 Device CRD,记录每个节点可以分配的设备信息。DeviceShare 在调度时,会把 Pod 的GPU资源请求转换为 Koordinator 的资源协议,并过滤每个节点的未分配的 GPU 设备实例。确保有资源可用后,在 Reserve 阶段更新内部状态,并在 PreBind 阶段更新 Pod Annotation,记录当前 Pod 应该使用哪些 GPU 设备。
DeviceShare 将在后续版本支持 Binpacking 和 Spread 策略,实现更好的设备资源调度能力。

单机侧精准绑定设备信息

Kubernetes 社区在 kubelet 中提供了 DevicePlugin 机制,支持设备厂商在 kubelet 分配好设备后有机会获得设备信息,并填充到环境变量或者更新挂载路径。但是不能支持 中心化的 GPU 精细化调度场景。
针对这个问题, Koordinator 扩展了 koord-runtime-proxy ,支持在 kubelet 创建容器时更新环境变量,注入调度器分配的 GPU 设备信息。

3. 调度器诊断分析

大家在使用 Kubernetes 时经常会遇到一些调度相关的问题:

  1. 我这个 Pod 为什么不能调度?
  2. 这个 Pod 为什么会调度到这个节点,不是应该被另一个打分插件影响到么?
  3. 我新开发了一个插件,发现调度结果不符合预期,但是有不知道哪里出了问题。

要诊断分析这些问题,除了要掌握 Kubernetes 基本的调度机制和资源分配机制外,还需要调度器自身给予支持。但是 Kubernetes kube-scheduler 提供的诊断能力比较有限,有时候甚至没有什么日志可以查看。kube-scheduler 原生是支持通过 HTTP 更改日志等级,可以获得更多日志信息,例如执行如下命令可以更改日志等级到5:

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/v --data '5' 
successfully set klog.logging.verbosity to 5

Koordinator 针对这些问题,实现了一套 Restful API ,帮助用户提升问题诊断分析的效率

分析 Score 结果

PUT /debug/flags/s 允许用户打开 Debug Score 开关,在打分结束后,以Markdown 格式打印 TopN 节点各个插件的分值。例如:

$ curl -X PUT schedulerLeaderIP:10251/debug/flags/s --data '100'
successfully set debugTopNScores to 100

当有新 Pod 调度时,观察 scheduler log 可以看到如下信息

| # | Pod | Node | Score | ImageLocality | InterPodAffinity | LoadAwareScheduling | NodeAffinity | NodeNUMAResource | NodeResourcesBalancedAllocation | NodeResourcesFit | PodTopologySpread | Reservation | TaintToleration |
| --- | --- | --- | ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:|
| 0 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.51 | 577 | 0 | 0 | 87 | 0 | 0 | 96 | 94 | 200 | 0 | 100 |
| 1 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.50 | 574 | 0 | 0 | 85 | 0 | 0 | 96 | 93 | 200 | 0 | 100 |
| 2 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.19 | 541 | 0 | 0 | 55 | 0 | 0 | 95 | 91 | 200 | 0 | 100 |
| 3 | default/curlimage-545745d8f8-rngp7 | cn-hangzhou.10.0.4.18 | 487 | 0 | 0 | 15 | 0 | 0 | 90 | 82 | 200 | 0 | 100 |

找个 Markdown 工具,就可以转为如下表格

#PodNodeScoreLoadAwareSchedulingNodeNUMAResourceNodeResourcesFitPodTopologySpread
0default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.5157787094200
1default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.5057485093200
2default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.1954155091200
3default/curlimage-545745d8f8-rngp7cn-hangzhou.10.0.4.1848715082200

调度插件导出内部状态

像 koord-scheduler 内部的 NodeNUMAResource 、 DeviceShare和ElasticQuota等插件内部都有维护一些状态帮助调度。 koord-scheduler 自定义了一个新的插件扩展接口(定义见下文),并会在初始化插件后,识别该插件是否实现了该接口并调用该接口,让插件注入需要暴露的 RestfulAPI。以 NodeNUMAResource 插件为例,会提供 /cpuTopologyOptions/:nodeName/availableCPUs/:nodeName两个Endpoints,可以查看插件内部记录的 CPU 拓扑信息和分配结果。

type APIServiceProvider interface {
RegisterEndpoints(group *gin.RouterGroup)
}

用户在使用时,按照 /apis/v1/plugins/<pluginName>/<pluginEndpoints>方 式构建 URL 查看数据,例如要查看 /cpuTopologyOptions/:nodeName

$ curl schedulerLeaderIP:10252/apis/v1/plugins/NodeNUMAResources/cpuTopologyOptions/node-1
{"cpuTopology":{"numCPUs":32,"numCores":16,"numNodes":1,"numSockets":1,"cpuDetails":....

查看当前支持的插件 API

为了方便大家使用,koord-scheduler 提供了 /apis/v1/__services__ 查看支持的 API Endpoints

$ curl schedulerLeaderIP:10251/apis/v1/__services__
{
"GET": [
"/apis/v1/__services__",
"/apis/v1/nodes/:nodeName",
"/apis/v1/plugins/Coscheduling/gang/:namespace/:name",
"/apis/v1/plugins/DeviceShare/nodeDeviceSummaries",
"/apis/v1/plugins/DeviceShare/nodeDeviceSummaries/:name",
"/apis/v1/plugins/ElasticQuota/quota/:name",
"/apis/v1/plugins/NodeNUMAResource/availableCPUs/:nodeName",
"/apis/v1/plugins/NodeNUMAResource/cpuTopologyOptions/:nodeName"
]
}

4. 更安全的重调度

在 Koordinator v0.6 版本中我们发布了全新的 koord-descheduler,支持插件化实现需要的重调度策略和自定义驱逐机制,并内置了面向 PodMigrationJob 的迁移控制器,通过 Koordinator Reservation 机制预留资源,确保有资源的情况下发起驱逐。解决了 Pod 被驱逐后无资源可用影响应用的可用性问题。
Koordinator v0.7 版本中,koord-descheduler 实现了更安全的重调度

  • 支持 Evict 限流,用户可以根据需要配置限流策略,例如允许每分钟驱逐多少个 Pod
  • 支持配置 Namespace 灰度重调度能力,让用户可以更放心的灰度
  • 支持按照 Node/Namespace 配置驱逐数量,例如配置节点维度最多只驱逐两个,那么即使有插件要求驱逐该节点上的更多Pod,会被拒绝。
  • 感知 Workload ,如果一个 Workload 正在发布、缩容、已经有一定量的 Pod 正在被驱逐或者一些Pod NotReady,重调度器会拒绝新的重调度请求。目前支持原生的 Deployment,StatefulSet 以及 Kruise CloneSet,Kruise AdvancedStatefulSet。

后续重调度器还会提升公平性,防止一直重复的重调度同一个 workload ,尽量降低重调度对应用的可用性的影响。

5. 其他改动

  • Koordinator 进一步增强了 CPU 精细化调度能力,完全兼容 kubelet ( <= v1.22) CPU Manager static 策略。调度器分配 CPU 时会避免分配被 kubelet 预留的 CPU,单机侧koordlet完整适配了kubelet从1.18到1.22版本的分配策略,有效避免了 CPU 冲突。
  • 资源预留机制支持 AllocateOnce 语义,满足单次预留场景。并改进了 Reservation 状态语义,更加准确描述 Reservation 对象当前的状态。
  • 改进了离线资源(Batch CPU/Memory) 的声明方式,支持limit大于request的资源描述形式,可以方便原burstable类型的任务直接转换为混部模式运行。

你可以通过 Github release[6] 页面,来查看更多的改动以及它们的作者与提交记录。

相关链接

· 9 min read
Joseph

We are happy to announce the release of Koordinator v0.6.0. Koordinator v0.6.0 brings complete Fine-grained CPU Orchestration, Resource Reservation mechanism, safely Pod Migration mechanism and Descheduling Framework.

Install or Upgrade to Koordinator v0.6.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.6.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.6.0 [--force]

For more details, please refer to the installation manual.

Fine-grained CPU Orchestration

In Koordinator v0.5.0, we designed and implemented basic CPU orchestration capabilities. The koord-scheduler supports different CPU bind policies to help LSE/LSR Pods achieve better performance.

Now in the v0.6 version, we have basically completed the CPU orchestration capabilities originally designed, such as:

  • Support default CPU bind policy configured by koord-scheduler for LSR/LSE Pods that do not specify a CPU bind policy
  • Support CPU exclusive policy that supports PCPULevel and NUMANodeLevel, which can spread the CPU-bound Pods to different physical cores or NUMA Nodes as much as possible to reduce the interference between Pods.
  • Support Node CPU Orchestration API to helper cluster administrators control the CPU orchestration behavior of nodes. The label node.koordinator.sh/cpu-bind-policy constrains how to bind CPU logical CPUs when scheduling. If set with the FullPCPUsOnly that requires that the scheduler must allocate full physical cores. Equivalent to kubelet CPU manager policy option full-pcpus-only=true. If there is no node.koordinator.sh/cpu-bind-policy in the node's label, it will be executed according to the policy configured by the Pod or koord-scheduler. The label node.koordinator.sh/numa-allocate-strategy indicates how to choose satisfied NUMA Nodes when scheduling. Support MostAllocated and LeastAllocated.
  • koordlet supports the LSE Pods and improve compatibility with existing Guaranteed Pods with static CPU Manager policy.

Please check out our user manual for a detailed introduction and tutorial.

Resource Reservation

We completed the Resource Reservation API design proposal in v0.5, and implemented the basic Reservation mechanism in the current v0.6 version.

When you want to use the Reservation mechanism to reserve resources, you do not need to modify the Pod or the existing workloads(e.g. Deployment, StatefulSet). koord-scheduler provides a simple to use API named Reservation, which allows us to reserve node resources for specified pods or workloads even if they haven't get created yet. You only need to write the Pod Template and the owner information in the ReservationSpec when creating a Reservation. When koord-scheduler perceives a new Reservation object, it will allocate resources to the Reservation object through the normal Pod scheduling process. After scheduling, koord-scheduler will update the success or failure information to ResourceStatus. If the reservation is successful, and the OwnerReference or Labels of the newly created Pod satisfy the owner information declared earlier, then the newly created Pod will directly reuse the resources held by the Reservation. When the Pod is destroyed, the Reservation object can be reused until the Reservation expires.

image

The resource reservation mechanism can help solve or optimize the problems in the following scenarios:

  1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to "lock" resources preventing from allocation of other pods with the same or higher priority.
  2. Descheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
  3. Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas.
  4. Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost.

Pod Migration Job

Migrating Pods is an important capability that many components (such as descheduler) rely on, and can be used to optimize scheduling or help resolve workload runtime quality issues. We believe that pod migration is a complex process, involving steps such as auditing, resource allocation, and application startup, and is mixed with application upgrading, scaling scenarios, resource operation and maintenance operations by cluster administrators. Therefore, how to manage the stability risk of this process to ensure that the application does not fail due to the migration of Pods is a very critical issue that must be resolved.

The descheduler in the K8s community evicts pods according to different strategies. However, it does not guarantee whether the evicted Pod has resources available after re-creation. If a large number of newly created Pods are in the Pending state when the resources in the cluster are tight, may lower the application availabilities.

Koordinator defines a CRD-based Migration/Eviction API named PodMigrationAPI, through which the descheduler or other components can evict or delete Pods more safely. With PodMigrationJob we can track the status of each process in the migration, and perceive scenarios such as upgrading and scaling of the application.

It's simple to use the PodMigrationJob API. Create a PodMigrationJob with the YAML file below to migrate pod-demo-0.

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: PodMigrationJob
metadata:
name: migrationjob-demo
spec:
paused: false
ttl: 5m
mode: ReservationFirst
podRef:
namespace: default
name: pod-demo-5f9b977566-c7lvk
status:
phase: Pending
$ kubectl create -f migrationjob-demo.yaml
podmigrationjob.scheduling.koordinator.sh/migrationjob-demo created

Then you can query the migration status and query the migration events

$ kubectl get podmigrationjob migrationjob-demo
NAME PHASE STATUS AGE NODE RESERVATION PODNAMESPACE POD NEWPOD TTL
migrationjob-demo Succeed Complete 37s node-1 d56659ab-ba16-47a2-821d-22d6ba49258e default pod-demo-5f9b977566-c7lvk pod-demo-5f9b977566-nxjdf 5m0s

$ kubectl describe podmigrationjob migrationjob-demo
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ReservationCreated 8m33s koord-descheduler Successfully create Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e"
Normal ReservationScheduled 8m33s koord-descheduler Assigned Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" to node "node-1"
Normal Evicting 8m33s koord-descheduler Try to evict Pod "default/pod-demo-5f9b977566-c7lvk"
Normal EvictComplete 8m koord-descheduler Pod "default/pod-demo-5f9b977566-c7lvk" has been evicted
Normal Complete 8m koord-descheduler Bind Pod "default/pod-demo-5f9b977566-nxjdf" in Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e"

Descheduling Framework

We implemented a brand new Descheduling Framework in v0.6.

The existing descheduler in the community can solve some problems, but we think that there are still many aspects of the descheduler that can be improved, for example, it only supports the mode of periodic execution, and does not support the event-triggered mode. It is not possible to extend and configure custom descheduling strategies without invading the existing code of descheduler like kube-scheduler; it also does not support implementing custom evictor.

We also noticed that the K8s descheduler community also found these problems and proposed corresponding solutions such as #753 Descheduler framework Proposal and PoC #781. The K8s descheduler community tries to implement a descheduler framework similar to the k8s scheduling framework. This coincides with our thinking.

Overall, these solutions solved most of our problems, but we also noticed that the related implementations were not merged into the main branch. But we review these implementations and discussions, and we believe this is the right direction. Considering that Koordiantor has clear milestones for descheduler-related features, we will implement Koordinator's own descheduler independently of the upstream community. We try to use some of the designs in the #753 PR proposed by the community and we will follow the Koordinator's compatibility principle with K8s to maintain compatibility with the upstream community descheduler when implementing. Such as independent implementation can also drive the evolution of the upstream community's work on the descheduler framework. And when the upstream community has new changes or switches to the architecture that Koordinator deems appropriate, Koordinator will follow up promptly and actively.

Based on this descheduling framework, it is very easy to be compatible with the existing descheduling strategies in the K8s community, and users can implement and integrate their own descheduling plugins as easily as K8s Scheduling Framework. At the same time, users are also supported to implement Controller in the form of plugins to realize event-based descheduling scenarios. At the same time, the framework integrates the MigrationController based on PodMigrationJob API and serves as the default Evictor plugin to help safely migrate Pods in various descheduling scenarios.

At present, we have implemented the main body of the framework, including the MigrationController based on PodMigrationJob, which is available as a whole. And we also provide a demo descheduling plugin. In the future, we will migrate and be compatible with the existing descheduling policies of the community, as well as the load balancing descheduling plugin provided for co-location scenarios.

The current framework is still in the early stage of rapid evolution, and there are still many details that need to be improved. Everyone who is interested is welcome to participate in the construction together. We hope that more people can be more assured and simpler to realize the descheduling capabilities they need.

About GPU Scheduling

There are also some new developments in GPU scheduling capabilities that everyone cares about.

During the iteration of v0.6, we completed the design of GPU Share Scheduling, and also completed the design of Gang Scheduling. Development of these capabilities is ongoing and will be released in v0.7.

In addition, in order to explore the mechanism of GPU overcommitment, we have implemented the ability to report GPU Metric in v0.6.

What’s coming next in Koordinator

Don't forget that Koordinator is developed in the open. You can check out our Github milestone to know more about what is happening and what we have planned. For more details, please refer to our milestone. Hope it helps!

· 8 min read
Jason

In addition to the usual updates to supporting utilities, Koordinator v0.5 adds a couple of new useful features we think you'll like.

Install or Upgrade to Koordinator v0.5.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.5.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.5.0 [--force]

For more details, please refer to the installation manual.

Fine-grained CPU Orchestration

In this version, we introduced a fine-grained CPU orchestration. Pods in the Kubernetes cluster may interfere with others' running when they share the same physical resources and both demand many resources. The sharing of CPU resources is almost inevitable. e.g. SMT threads (i.e. logical processors) share execution units of the same core, and cores in the same chip share one last-level cache. The resource contention can slow down the running of these CPU-sensitive workloads, resulting in high response latency (RT).

To improve the performance of CPU-sensitive workloads, koord-scheduler provides a mechanism of fine-grained CPU orchestration. It enhances the CPU management of Kubernetes and supports detailed NUMA-locality and CPU exclusions.

Please check out our user manual for a detailed introduction and tutorial.

Resource Reservation

Pods are fundamental units for allocating node resources in Kubernetes, which bind resource requirements with business logic. The scheduler is not able to reserve node resources for specific pods or workloads. We may try using a fake pod to prepare resources by the preemption mechanism. However, fake pods can be preempted by any scheduled pods with higher priorities, which make resources get scrambled unexpectedly.

In Koordinator, a resource reservation mechanism is proposed to enhance scheduling and especially benefits scenarios below:

  1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. With a reservation, the scheduler should be able to "lock" resources preventing from allocation of other pods with the same or higher priority.
  2. De-scheduling: For the descheduler, it is better to ensure sufficient resources with the reservation before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
  3. Horizontal scaling: Using reservation to achieve more deterministic horizontal scaling. e.g. Submit a reservation and make sure it is available before scaling up replicas.
  4. Resource Pre-allocation: Sometimes we want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable. Reservation can help with this and it should make no physical cost.

This feature is still under development. We've finalized the API, feel free to check it out.

type Reservation struct {
metav1.TypeMeta `json:",inline"`
// A Reservation object is non-namespaced.
// It can reserve resources for pods of any namespace. Any affinity/anti-affinity of reservation scheduling can be
// specified in the pod template.
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ReservationSpec `json:"spec,omitempty"`
Status ReservationStatus `json:"status,omitempty"`
}

type ReservationSpec struct {
// Template defines the scheduling requirements (resources, affinities, images, ...) processed by the scheduler just
// like a normal pod.
// If the `template.spec.nodeName` is specified, the scheduler will not choose another node but reserve resources on
// the specified node.
Template *corev1.PodTemplateSpec `json:"template,omitempty"`
// Specify the owners who can allocate the reserved resources.
// Multiple owner selectors and ANDed.
Owners []ReservationOwner `json:"owners,omitempty"`
// By default, the resources requirements of reservation (specified in `template.spec`) is filtered by whether the
// node has sufficient free resources (i.e. ReservationRequest < NodeFree).
// When `preAllocation` is set, the scheduler will skip this validation and allow overcommitment. The scheduled
// reservation would be waiting to be available until free resources are sufficient.
PreAllocation bool `json:"preAllocation,omitempty"`
// Time-to-Live period for the reservation.
// `expires` and `ttl` are mutually exclusive. If both `ttl` and `expires` are not specified, a very
// long TTL will be picked as default.
TTL *metav1.Duration `json:"ttl,omitempty"`
// Expired timestamp when the reservation expires.
// `expires` and `ttl` are mutually exclusive. Defaults to being set dynamically at runtime based on the `ttl`.
Expires *metav1.Time `json:"expires,omitempty"`
}

type ReservationStatus struct {
// The `phase` indicates whether is reservation is waiting for process (`Pending`), available to allocate
// (`Available`) or expired to get cleanup (Expired).
Phase ReservationPhase `json:"phase,omitempty"`
// The `conditions` indicate the messages of reason why the reservation is still pending.
Conditions []ReservationCondition `json:"conditions,omitempty"`
// Current resource owners which allocated the reservation resources.
CurrentOwners []corev1.ObjectReference `json:"currentOwners,omitempty"`
}

type ReservationOwner struct {
// Multiple field selectors are ORed.
Object *corev1.ObjectReference `json:"object,omitempty"`
Controller *ReservationControllerReference `json:"controller,omitempty"`
LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
}

type ReservationControllerReference struct {
// Extend with a `namespace` field for reference different namespaces.
metav1.OwnerReference `json:",inline"`
Namespace string `json:"namespace,omitempty"`
}

type ReservationPhase string

const (
// ReservationPending indicates the Reservation has not been processed by the scheduler or is unschedulable for
// some reasons (e.g. the resource requirements cannot get satisfied).
ReservationPending ReservationPhase = "Pending"
// ReservationAvailable indicates the Reservation is both scheduled and available for allocation.
ReservationAvailable ReservationPhase = "Available"
// ReservationWaiting indicates the Reservation is scheduled, but the resources to reserve are not ready for
// allocation (e.g. in pre-allocation for running pods).
ReservationWaiting ReservationPhase = "Waiting"
// ReservationExpired indicates the Reservation is expired, which the object is not available to allocate and will
// get cleaned in the future.
ReservationExpired ReservationPhase = "Expired"
)

type ReservationCondition struct {
LastProbeTime metav1.Time `json:"lastProbeTime"`
LastTransitionTime metav1.Time `json:"lastTransitionTime"`
Reason string `json:"reason"`
Message string `json:"message"`
}

QoS Manager

Currently, plugins from resmanager in Koordlet are mixed together, they should be classified into two categories: static and dynamic. Static plugins will be called and run only once when a container created, updated, started or stopped. However, for dynamic plugins, they may be called and run at any time according the real-time runtime states of node, such as CPU suppress, CPU burst, etc. This proposal only focuses on refactoring dynamic plugins. Take a look at current plugin implementation, there are many function calls to resmanager's methods directly, such as collecting node/pod/container metrics, fetching metadata of node/pod/container, fetching configurations(NodeSLO, etc.). In the feature, we may need a flexible and powerful framework with scalability for special external plugins.

The below is directory tree of qos-manager inside koordlet, all existing dynamic plugins(as built-in plugins) will be moved into sub-directory plugins.

pkg/koordlet/qosmanager/
- manager.go
- context.go // plugin context
- /plugins/ // built-in plugins
- /cpubrust/
- /cpusuppress/
- /cpuevict/
- /memoryevict/

We only have the proposal in this version. Stay tuned, further implementation is coming soon!

Multiple Running Hook Modes

Runtime Hooks includes a set of plugins which are responsible for the injections of resource isolation parameters by pod attribute. When Koord Runtime Proxy running as a CRI Proxy, Runtime Hooks acts as the backend server. The mechanism of CRI Proxy can ensure the consistency of resource parameters during pod lifecycle. However, Koord Runtime Proxy can only hijack CRI requests from kubelet for pods, the consistency of resource parameters in QoS class directory cannot be guaranteed. Besides, modification of pod parameters from third-party(e.g. manually) will also break the correctness of hook plugins.

Therefore, a standalone running mode with reconciler for Runtime Hooks is necessary. Under Standalone running mode, resource isolation parameters will be injected asynchronously, keeping eventual consistency of the injected parameters for pod and QoS class even without Runtime Hook Manager.

Some minor works

  1. We fix the backward compatibility issues reported by our users in here. If you've ever encountered similar problem, please upgrade to the latest version.
  2. Two more interfaces were added into runtime-proxy. One is PreCreateContainerHook, which could set container resources setting before creating, and the other is PostStopSandboxHook, which could do the resource setting garbage collecting before pods deleted.
  3. cpuacct.usage is more precise than cpuacct.stat, and cpuacct.stat is in USER_HZ unit, while cpuacct.usage is nanoseconds. After thorough discussion, we were on the same page that we replace cpuacct.stat with cpuacct.usage in koordlet.
  4. Koordlet needs to keep fetching data from kubelet. Before this version, we only support accessing kubelet via read-only port over HTTP. Due to security concern, we've enabled HTTPS access in this version. For more details, please refer to this PR.

What’s coming next in Koordinator

Don't forget that Koordinator is developed in the open. You can check out our Github milestone to know more about what is happening and what we have planned. For more details, please refer to our milestone. Hope it helps!

· 8 min read
Joseph

We are happy to announce the release of Koordinator v0.4.0. Koordinator v0.4.0 brings in some notable changes that are most wanted by the community while continuing to expand on experimental features. And in this version, we started to gradually enhance the capabilities of the scheduler.

Install or Upgrade to Koordinator v0.4.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.4.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.4.0 [--force]

For more details, please refer to the installation manual.

Enhanced node-side scheduling capabilities

Custom memory evict threshold

In the Koordinator v0.2.0, an ability to improve the stability of the node side in the co-location scenario was introduced: Active eviction mechanism based on memory safety thresholds. The current memory utilization safety threshold default value is 70%, now in the v0.4.0 version, you can modify the memoryEvictThresholdPercent with 60% in ConfigMap slo-controller-config according to the actual situation:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
colocation-config: |
{
"enable": true
}
resource-threshold-config: |
{
"clusterStrategy": {
"enable": true,
"memoryEvictThresholdPercent": 60
}
}

BE Pods eviction based on satisfaction

In order to ensure the runtime quality of different workloads in co-location scenarios, Koordinator uses the CPU Suppress mechanism provided by koordlet on the node side to suppress workloads of the best effort type when the load increases. Or increase the resource quota for best effort type workloads when the load decreases.

However, it is not suitable if there are many best effort Pods on the node and they are frequently suppressed. Therefore, in version v0.4.0, Koordinator provides an eviction mechanism based on satisfaction of the requests for the best effort Pods. If the best effort Pods are frequently suppressed, the requests of the best effort Pods cannot be satisfied, and the satisfaction is generally less than 1; if the best effort Pods are not suppressed and more CPU resources are obtained when the node resources are idle, then the requests of the best effort Pods can be satisfied, and the satisfaction is greater than or equal to 1. If the satisfaction is less than the specified threshold, and the CPU utilization of the best effort Pods is close to 100%, koordlet will evict some best effort Pods to improve the runtime quality of the node. The priority with lower priority or with higher CPU utilization of the same priority is evicted.

You can modify the ConfigMap slo-controller-config according to the actual situation:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
colocation-config: |
{
"enable": true
}
resource-threshold-config: |
{
"clusterStrategy": {
"enable": true,
"cpuEvictBESatisfactionUpperPercent": 80,
"cpuEvictBESatisfactionLowerPercent": 60
}
}

Group identity

When latency-sensitive applications and best effort workloads are deployed on the same node, the Linux kernel scheduler must provide more scheduling opportunities to high-priority applications to minimize scheduling latency and the impacts of low-priority workloads on kernel scheduling. For this scenario, Koordinator integrated with the group identity allowing users to configure scheduling priorities to CPU cgroups.

Alibaba Cloud Linux 2 with a kernel of the kernel-4.19.91-24.al7 version or later supports the group identity feature. The group identity feature relies on a dual red-black tree architecture. A low-priority red-black tree is added based on the red-black tree of the Completely Fair Scheduler (CFS) scheduling queue to store low-priority workloads. When the kernel schedules the workloads that have identities, the kernel processes the workloads based on their priorities. For more details, please refer to the doc.

Koordinator defines group identity default values for Pods of different QoS types:

QoSDefault Value
LSR2
LS2
BE-1

You can modify the ConfigMap slo-controller-config to set group identity values according to the actual situation:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
colocation-config: |
{
"enable": true
}
resource-qos-config: |
{
"clusterStrategy": {
"lsrClass": {
"cpuQOS": {
"enable": true,
"groupIdentity": 2
}
},
"lsClass": {
"cpuQOS": {
"enable": true,
"groupIdentity": 2
}
},
"beClass": {
"cpuQOS": {
"enable": true,
"groupIdentity": -1
}
},
"systemClass": {
"cpuQOS": {
"enable": true,
"groupIdentity": 2
}
}
}
}

To enable this feature, you need to update the kernel and configuration file, then install the new component koord-runtime-proxy of koordinator.

koord-runtime-proxy (experimental)

koord-runtime-proxy acts as a proxy between kubelet and containerd(dockerd under dockershim scenario), which is designed to intercept CRI request, and apply some resource management policies, such as setting different cgroup parameters by pod priorities under hybrid workload orchestration scenario, applying new isolation policies for latest Linux kernel, CPU architecture, and etc.

There are two components involved, koord-runtime-proxy and RuntimePlugins.

image

koord-runtime-proxy

koord-runtime-proxy is in charge of intercepting request during pod's lifecycle, such as RunPodSandbox, CreateContainer etc., and then calling RuntimePlugins to do resource isolation policies before transferring request to backend containerd(dockerd) and after transferring response to kubelet. koord-runtime-proxy provides an isolation-policy-execution framework which allows customized plugins registered to do specified isolation policies, these plugins are called RuntimePlugins. koord-runtime-proxy itself does NOT do any isolation policies.

RuntimePlugins

RuntimePlugins register events(RunPodSandbox etc.) to koord-runtime-proxy and would receive notifications when events happen. RuntimePlugins should complete resource isolation policies basing on the notification message, and then response koord-runtime-proxy, koord-runtime-proxy would decide to transfer request to backend containerd or discard request according to plugins' response.

If no RuntimePlugins registered, koord-runtime-proxy would become a transparent proxy between kubelet and containerd.

For more details, please refer to the design doc.

Installation

When installing koord-runtime-proxy, you need to change the startup parameters of the kubelet, set the CRI parameters to point to the koord-runtime-proxy, and configure the CRI parameters of the corresponding container runtime when installing the koord-runtime-proxy.

koord-runtime-proxy is in the Alpha experimental version stage. Currently, it provides a minimum set of extension points. At the same time, there may be some bugs. You are welcome to try it and give feedback.

For detailed installation process, please refer to the manual.

Load-Aware Scheduling

Although Koordinator provides the co-location mechanism to improve the resource utilization of the cluster and reduce costs, it does not yet have the ability to control the utilization level of the cluster dimension, Best Effort workloads may also interfere with latency-sensitive applications. Load-aware scheduling plugin helps Koordinator to achieve this capability.

The scheduling plugin filters abnormal nodes and scores them according to resource usage. This scheduling plugin extends the Filter/Score/Reserve/Unreserve extension points defined in the Kubernetes scheduling framework.

By default, abnormal nodes are filtered, and users can decide whether to enable or not by configuring as needed.

  • Filter nodes where koordlet fails to update NodeMetric.
  • Filter nodes by utilization thresholds. If the configuration enables, the plugin will exclude nodes with latestUsageUtilization >= utilizationThreshold.

This plugin is dependent on NodeMetric's reporting period. Different reporting periods need to be set according to different scenarios and workloads. Therefore, NodeMetricSpec has been extended to support user-defined reporting period and aggregation period. Users can modify slo-controller-config to complete the corresponding configuration, and the controller in koord-manager will be responsible for updating the reporting period and aggregation period fields of NodeMetrics of related nodes.

Currently, the resource utilization thresholds of nodes are configured based on experience to ensure the runtime quality of nodes. But there are also ways to evaluate the workload running on the node to arrive at a more appropriate threshold for resource utilization. For example, in a time-sharing scenario, a higher threshold can be set to allow scheduling to run more best effort workloads during the valley of latency-sensitive applications. When the peak of latency-sensitive applications comes up, lower the threshold and evict some best effort workloads. In addition, 3-sigma can be used to analyze the utilization level in the cluster to obtain a more appropriate threshold.

The core logic of the scoring algorithm is to select the node with the smallest resource usage. However, considering the delay of resource usage reporting and the delay of Pod startup time, the resource requests of the Pods that have been scheduled and the Pods currently being scheduled within the time window will also be estimated, and the estimated values will be involved in the calculation.

At present, Koordinator does not have the ability to profile workloads. Different types of workloads have different ways of building profiles. For example, long-running pods need to be scheduled with long-period profiling, while short-period pods should be scheduled with short-period profiling.

For more details, please refer to the proposal.

What Comes Next

For more details, please refer to our milestone. Hope it helps!

· 12 min read
Jason

We are happy to announce the v0.3.0 release of Koordinator. After starting small and learning what users needed, we are able to adjust its path and develop features needed for a stable community release.

The release of Koordinator v0.3.0 brings in some notable changes that are most wanted by the community while continuing to expand on experimental features.

Install or Upgrade to Koordinator v0.3.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool, and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.3.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.3.0 [--force]

For more details, please refer to the installation manual.

CPU Burst

CPU Burst is a service level objective (SLO)-aware resource scheduling feature provided by Koordinator. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. Koordlet automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications.

How CPU Burst works

Kubernetes allows you to specify CPU limits, which can be reused based on time-sharing. If you specify a CPU limit for a container, the OS limits the amount of CPU resources that can be used by the container within a specific time period. For example, you set the CPU limit of a container to 2. The OS kernel limits the CPU time slices that the container can use to 200 milliseconds within each 100-millisecond period.

CPU utilization is a key metric that is used to evaluate the performance of a container. In most cases, the CPU limit is specified based on CPU utilization. CPU utilization on a per-millisecond basis shows more spikes than on a per-second basis. If the CPU utilization of a container reaches the limit within a 100-millisecond period, CPU throttling is enforced by the OS kernel and threads in the container are suspended for the rest of the time period.

How to use CPU Burst

  • Use an annotation to enable CPU Burst

    Add the following annotation to the pod configuration to enable CPU Burst:

annotations:
# Set the value to auto to enable CPU Burst for the pod.
koordinator.sh/cpuBurst: '{"policy": "auto"}'
# To disable CPU Burst for the pod, set the value to none.
#koordinator.sh/cpuBurst: '{"policy": "none"}'
  • Use a ConfigMap to enable CPU Burst for all pods in a cluster

    Modify the slo-controller-config ConfigMap based on the following content to enable CPU Burst for all pods in a cluster:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
cpu-burst-config: '{"clusterStrategy": {"policy": "auto"}}'
#cpu-burst-config: '{"clusterStrategy": {"policy": "cpuBurstOnly"}}'
#cpu-burst-config: '{"clusterStrategy": {"policy": "none"}}'
  • Advanced configurations

    The following code block shows the pod annotations and ConfigMap fields that you can use for advanced configurations:

# Example of the slo-controller-config ConfigMap. 
data:
cpu-burst-config: |
{
"clusterStrategy": {
"policy": "auto",
"cpuBurstPercent": 1000,
"cfsQuotaBurstPercent": 300,
"sharePoolThresholdPercent": 50,
"cfsQuotaBurstPeriodSeconds": -1
}
}

# Example of pod annotations.
koordinator.sh/cpuBurst: '{"policy": "auto", "cpuBurstPercent": 1000, "cfsQuotaBurstPercent": 300, "cfsQuotaBurstPeriodSeconds": -1}'

The following table describes the ConfigMap fields that you can use for advanced configurations of CPU Burst.

FieldData typeDescription
policystring
  • none: disables CPU Burst. If you set the value to none, the related fields are reset to their original values. This is the default value.
  • cpuBurstOnly: enables the CPU Burst feature only for the kernel of Alibaba Cloud Linux 2.
  • cfsQuotaBurstOnly: enables automatic adjustment of CFS quotas of general kernel versions.
  • auto: enables CPU Burst and all the related features.
cpuBurstPercentintDefault value:1000. Unit: %. This field specifies the percentage to which the CPU limit can be increased by CPU Burst. If the CPU limit is set to 1, CPU Burst can increase the limit to 10 by default.
cfsQuotaBurstPercentintDefault value: 300. Unit: %. This field specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased. By default, the value of cfs_quota can be increased to at most three times.
cfsQuotaBurstPeriodSecondsintDefault value: -1. Unit: seconds. This indicates that the time period in which the container can run with an increased CFS quota is unlimited. This field specifies the time period in which the container can run with an increased CFS quota, which cannot exceed the upper limit specified by cfsQuotaBurstPercent.
sharePoolThresholdPercentintDefault value: 50. Unit: %. This field specifies the CPU utilization threshold of the node. If the CPU utilization of the node exceeds the threshold, the value of cfs_quota in cgroup parameters is reset to the original value.

L3 cache and MBA resource isolation

Pods of different priorities are usually deployed on the same machine. This may cause pods to compete for computing resources. As a result, the quality of service (QoS) of your service cannot be ensured. The Resource Director Technology (RDT) controls the Last Level Cache (L3 cache) that can be used by workloads of different priorities. RDT also uses the Memory Bandwidth Allocation (MBA) feature to control the memory bandwidth that can be used by workloads. This isolates the L3 cache and memory bandwidth used by workloads, ensures the QoS of high-priority workloads, and improves overall resource utilization. This topic describes how to improve the resource isolation of pods with different priorities by controlling the L3 cache and using the MBA feature.

How to use L3 cache and MBA resource isolation

  • Use a ConfigMap to enable L3 cache and MBA resource isolation for all pods in a cluster
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
resource-qos-config: |-
{
"clusterStrategy": {
"lsClass": {
"resctrlQOS": {
"enable": true,
"catRangeStartPercent": 0,
"catRangeEndPercent": 100,
"MBAPercent": 100
}
},
"beClass": {
"resctrlQOS": {
"enable": true,
"catRangeStartPercent": 0,
"catRangeEndPercent": 30,
"MBAPercent": 100
}
}
}
}

Memory QoS

The Koordlet provides the memory quality of service (QoS) feature for containers. You can use this feature to optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This topic describes how to enable the memory QoS feature for containers.

Background information

The following memory limits apply to containers:

  • The memory limit of the container. If the amount of memory that a container uses, including the page cache, is about to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, the application in the container may not be able to request or release memory resources as normal.
  • The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is downgraded. In extreme cases, the node cannot run as normal.

To improve the performance of applications and the stability of nodes, Koordinator provides the memory QoS feature for containers. We recommend that you use Anolis OS as the node OS. For other OS, we will try our best to adapt, and users can still enable it without side effects. After you enable the memory QoS feature for a container, Koordlet automatically configures the memory control group (memcg) based on the configuration of the container. This helps you optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node.

How to use Memory QoS

When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps:

  1. Add the following annotations to enable memory QoS for the containers in a pod:
annotations:
# To enable memory QoS for the containers in a pod, set the value to auto.
koordinator.sh/memoryQOS: '{"policy": "auto"}'
# To disable memory QoS for the containers in a pod, set the value to none.
#koordinator.sh/memoryQOS: '{"policy": "none"}'
  1. Use a ConfigMap to enable memory QoS for all the containers in a cluster.
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
resource-qos-config: |-
{
"clusterStrategy": {
"lsClass": {
"memoryQOS": {
"enable": true
}
},
"beClass": {
"memoryQOS": {
"enable": true
}
}
}
}
  1. Optional. Configure advanced parameters.

    The following table describes the advanced parameters that you can use to configure fine-grained memory QoS configurations at the pod level and cluster level.

ParameterData typeValid valueDescription
enableBoolean
  • true
  • false
  • true: enables memory QoS for all the containers in a cluster. The default memory QoS settings for the QoS class of the containers are used.
  • false: disables memory QoS for all the containers in a cluster. The memory QoS settings are restored to the original settings for the QoS class of the containers.
policyString
  • auto
  • default
  • none
  • auto: enables memory QoS for the containers in the pod and uses the recommended memory QoS settings. The recommended memory QoS settings are prioritized over the cluster-wide memory QoS settings.
  • default: specifies that the pod inherits the cluster-wide memory QoS settings.
  • none: disables memory QoS for the pod. The relevant memory QoS settings are restored to the original settings. The original settings are prioritized over the cluster-wide memory QoS settings.
minLimitPercentInt0~100Unit: %. Default value:0. The default value indicates that this parameter is disabled. This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: Value of memory.min = Memory request × Value of minLimitPercent/100. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify Memory Request=100MiB and minLimitPercent=100 for a container, the value of memory.min is 104857600.
lowLimitPercentInt0~100Unit: %. Default value:0. The default value indicates that this parameter is disabled. This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: Value of memory.low = Memory request × Value of lowLimitPercent/100. For example, if you specify Memory Request=100MiB and lowLimitPercent=100 for a container, the value of memory.low is 104857600.
throttlingPercentInt0~100Unit: %. Default value:0. The default value indicates that this parameter is disabled. This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: Value of memory.high = Memory limit × Value of throttlingPercent/100. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify Memory Limit=100MiB and throttlingPercent=80 for a container, the value of memory.high is 83886080, which is equal to 80 MiB.
wmarkRatioInt0~100Unit: %. Default value:95. A value of 0 indicates that this parameter is disabled. This parameter specifies the threshold of the usage of the memory limit or the value of memory.high that triggers asynchronous memory reclaim. If throttlingPercent is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Memory limit × wmarkRatio/100. If throttlingPercent is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Value of memory.high × wmarkRatio/100. If the usage of the memory limit or the value of memory.high exceeds the threshold, the memcg backend asynchronous reclaim feature is triggered. For example, if you specify Memory Limit=100MiBfor a container, the memory throttling setting ismemory.high=83886080, the reclaim ratio setting is memory.wmark_ratio=95, and the reclaim threshold setting is memory.wmark_high=79691776.
wmarkMinAdjInt-25~50Unit: %. The default value is -25 for the LS/ LSR QoS class and 50 for the BE QoS class. A value of 0 indicates that this parameter is disabled. This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is memory.wmark_min_adj=-25, which indicates that the minimum watermark is decreased by 25% for the containers in the pod.

What Comes Next

For more details, please refer to our milestone. Hope it helps!

· 4 min read
Joseph

We’re pleased to announce the release of Koordinator v0.2.0.

Overview

Koordinator v0.1.0 implements basic co-location scheduling capabilities, and after the project was released, it has received attention and positive responses from the community. For some issues that everyone cares about, such as how to isolate resources for best-effort workloads, how to ensure the runtime stability of latency-sensitiv applications in co-location scenarios, etc., we have enhanced node-side scheduling capabilities in koordinator v0.2.0 to solve these problems.

Install or Upgrade to Koordinator v0.2.0

Install with helms

Koordinator can be simply installed by helm v3.5+, which is a simple command-line tool and you can get it from here.

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Install the latest version.
$ helm install koordinator koordinator-sh/koordinator --version 0.2.0

Upgrade with helm

# Firstly add koordinator charts repository if you haven't do this.
$ helm repo add koordinator-sh https://koordinator-sh.github.io/charts/

# [Optional]
$ helm repo update

# Upgrade the latest version.
$ helm upgrade koordinator koordinator-sh/koordinator --version 0.2.0 [--force]

For more details, please refer to the installation manual.

Isolate resources for best-effort workloads

In Koodinator v0.2.0, we refined the ability to isolate resources for best-effort worklods.

koordlet will set the cgroup parameters according to the resources described in the Pod Spec. Currently supports setting CPU Request/Limit, and Memory Limit.

For CPU resources, only the case of request == limit is supported, and the support for the scenario of request <= limit will be supported in the next version.

Active eviction mechanism based on memory safety thresholds

When latency-sensitiv applications are serving, memory usage may increase due to bursty traffic. Similarly, there may be similar scenarios for best-effort workloads, for example, the current computing load exceeds the expected resource Request/Limit.

These scenarios will lead to an increase in the overall memory usage of the node, which will have an unpredictable impact on the runtime stability of the node side. For example, it can reduce the quality of service of latency-sensitiv applications or even become unavailable. Especially in a co-location environment, it is more challenging.

We implemented an active eviction mechanism based on memory safety thresholds in Koodinator.

koordlet will regularly check the recent memory usage of node and Pods to check whether the safty threshold is exceeded. If it exceeds, it will evict some best-effort Pods to release memory. This mechanism can better ensure the stability of node and latency-sensitiv applications.

koordlet currently only evicts best-effort Pods, sorted according to the Priority specified in the Pod Spec. The lower the priority, the higher the priority to be evicted, the same priority will be sorted according to the memory usage rate (RSS), the higher the memory usage, the higher the priority to be evicted. This eviction selection algorithm is not static. More dimensions will be considered in the future, and more refined implementations will be implemented for more scenarios to achieve more reasonable evictions.

The current memory utilization safety threshold default value is 70%. You can modify the memoryEvictThresholdPercent in ConfigMap slo-controller-config according to the actual situation,

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
colocation-config: |
{
"enable": true
}
resource-threshold-config: |
{
"clusterStrategy": {
"enable": true,
"memoryEvictThresholdPercent": 70
}
}

CPU Burst - Improve the performance of latency-sensitive applications

CPU Burst is a service level objective (SLO)-aware resource scheduling feature. You can use CPU Burst to improve the performance of latency-sensitive applications. CPU scheduling for a container may be throttled by the kernel due to the CPU limit, which downgrades the performance of the application. Koordinator automatically detects CPU throttling events and automatically adjusts the CPU limit to a proper value. This greatly improves the performance of latency-sensitive applications.

The code of CPU Burst has been developed and is still under review and testing. It will be released in the next version. If you want to use this ability early, you are welcome to participate in Koordiantor and improve it together. For more details, please refer to the PR #73.

More

For more details, please refer to the Documentation. Hope it helps!

· 5 min read
Joseph
Fangsong Zeng

We’re pleased to announce the release of Koordinator v0.1.0.

Overview

Koordinator is a QoS based scheduling system for hybrid workloads orchestration on Kubernetes. It aims to improve the runtime efficiency and reliability of both latency sensitive workloads and batch jobs, simplify the complexity of resource-related configuration tuning, and increase pod deployment density to improve resource utilizations.

Key Features

Koordinator enhances the kubernetes user experiences in the workload management by providing the following:

  • Well-designed priority and QoS mechanism to co-locate different types of workloads in a cluster and run different types of pods on a single node. Allowing for resource overcommitments to achieve high resource utilizations but still satisfying the QoS guarantees by leveraging an application profiling mechanism.
  • Fine-grained resource orchestration and isolation mechanism to improve the efficiency of latency-sensitive workloads and batch jobs.
  • Flexible job scheduling mechanism to support workloads in specific areas, e.g., big data, AI, audio and video.
  • A set of tools for monitoring, troubleshooting and operations.

Node Metrics

Koordinator defines the NodeMetrics CRD, which is used to record the resource utilization of a single node and all Pods on the node. koordlet will regularly report and update NodeMetrics. You can view NodeMetrics with the following commands.

$ kubectl get nodemetrics node-1 -o yaml
apiVersion: slo.koordinator.sh/v1alpha1
kind: NodeMetric
metadata:
creationTimestamp: "2022-03-30T11:50:17Z"
generation: 1
name: node-1
resourceVersion: "2687986"
uid: 1567bb4b-87a7-4273-a8fd-f44125c62b80
spec: {}
status:
nodeMetric:
nodeUsage:
resources:
cpu: 138m
memory: "1815637738"
podsMetric:
- name: storage-service-6c7c59f868-k72r5
namespace: default
podUsage:
resources:
cpu: "300m"
memory: 17828Ki

Colocation Resources

After the Koordinator is deployed in the K8s cluster, the Koordinator will calculate the CPU and Memory resources that have been allocated but not used according to the data of NodeMetrics. These resources are updated in Node in the form of extended resources.

koordinator.sh/batch-cpu represents the CPU resources for Best Effort workloads, koordinator.sh/batch-memory represents the Memory resources for Best Effort workloads.

You can view these resources with the following commands.

$ kubectl describe node node-1
Name: node-1
....
Capacity:
cpu: 8
ephemeral-storage: 103080204Ki
koordinator.sh/batch-cpu: 4541
koordinator.sh/batch-memory: 17236565027
memory: 32611012Ki
pods: 64
Allocatable:
cpu: 7800m
ephemeral-storage: 94998715850
koordinator.sh/batch-cpu: 4541
koordinator.sh/batch-memory: 17236565027
memory: 28629700Ki
pods: 64

Cluster-level Colocation Profile

In order to make it easier for everyone to use Koordinator to co-locate different workloads, we defined ClusterColocationProfile to help gray workloads use co-location resources. A ClusterColocationProfile is CRD like the one below. Please do edit each parameter to fit your own use cases.

apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
name: colocation-profile-example
spec:
namespaceSelector:
matchLabels:
koordinator.sh/enable-colocation: "true"
selector:
matchLabels:
sparkoperator.k8s.io/launched-by-spark-operator: "true"
qosClass: BE
priorityClassName: koord-batch
koordinatorPriority: 1000
schedulerName: koord-scheduler
labels:
koordinator.sh/mutated: "true"
annotations:
koordinator.sh/intercepted: "true"
patch:
spec:
terminationGracePeriodSeconds: 30

Various Koordinator components ensure scheduling and runtime quality through labels koordinator.sh/qosClass, koordinator.sh/priority and kubernetes native priority.

With the webhook mutating mechanism provided by Kubernetes, koord-manager will modify Pod resource requirements to co-located resources, and inject the QoS and Priority defined by Koordinator into Pod.

Taking the above Profile as an example, when the Spark Operator creates a new Pod in the namespace with the koordinator.sh/enable-colocation=true label, the Koordinator QoS label koordinator.sh/qosClass will be injected into the Pod. According to the Profile definition PriorityClassName, modify the Pod's PriorityClassName and the corresponding Priority value. Users can also set the Koordinator Priority according to their needs to achieve more fine-grained priority management, so the Koordinator Priority label koordinator.sh/priority is also injected into the Pod. Koordinator provides an enhanced scheduler koord-scheduler, so you need to modify the Pod's scheduler name koord-scheduler through Profile.

If you expect to integrate Koordinator into your own system, please learn more about the core concepts.

CPU Suppress

In order to ensure the runtime quality of different workloads in co-located scenarios, Koordinator uses the CPU Suppress mechanism provided by koordlet on the node side to suppress workloads of the Best Effort type when the load increases. Or increase the resource quota for Best Effort type workloads when the load decreases.

When installing through the helm chart, the ConfigMap slo-controller-config will be created in the koordinator-system namespace, and the CPU Suppress mechanism is enabled by default. If it needs to be closed, refer to the configuration below, and modify the configuration of the resource-threshold-config section to take effect.

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: {{ .Values.installation.namespace }}
data:
...
resource-threshold-config: |
{
"clusterStrategy": {
"enable": false
}
}

Colocation Resources Balance

Koordinator currently adopts a strategy for node co-location resource scheduling, which prioritizes scheduling to machines with more resources remaining in co-location to avoid Best Effort workloads crowding together. More rich scheduling capabilities are on the way.

Tutorial - Colocation of Spark Jobs

Apache Spark is an analysis engine for large-scale data processing, which is widely used in Big Data, SQL Analysis and Machine Learning scenarios. We provide a tutorial to help you how to quickly use Koordinator to run Spark Jobs in colocation mode with other latency sensitive applications. For more details, please refer to the tutorial.

Summary

Fore More details, please refer to the Documentation. Hope it helps!