跳到主要内容
版本:v1.5 🚧

SLO 配置

简介

Koordinator 使用一个 ConfigMap 管理 SLO 配置。该 ConfigMap 被 slo-controller 所使用,它的名字和命名空间可以在 koord-manager 的启 动参数中指定(默认为 koordinator-system/slo-controller-config)。它分别包含了以下键值:

  • colocation-config:混部配置。例如,是否开启混部 Batch 资源,混部水位线。
  • resource-threshold-config:基于阈值的压制/驱逐策略的配置。例如,CPU 压制的阈值,内存驱逐的阈值。
  • resource-qos-config:QoS 特性的配置。例如,BE pods 的 Group Identity,LS pods 的内存 QoS,BE pods 的末级缓存划分。
  • cpu-burst-config:CPU Burst 特性的配置。例如,pod 的最大 burst 比例。
  • system-config:系统设定的配置。例如,全局内存最低水位线系数 min_free_kbytes

配置层级

每个配置定义为集群级别和节点级别的形式。

例如,

type ColocationCfg struct {
ColocationStrategy `json:",inline"`
NodeConfigs []NodeColocationCfg `json:"nodeConfigs,omitempty"`
}

type ResourceQOSCfg struct {
ClusterStrategy *slov1alpha1.ResourceQOSStrategy `json:"clusterStrategy,omitempty"`
NodeStrategies []NodeResourceQOSStrategy `json:"nodeStrategies,omitempty"`
}

集群级别配置用于设置全局配置,而节点级别则供用户调整部分节点的配置,特别是灰度部署的情况。

请注意,大部分可配置的字段都在组件内部(koordlet、koord-manager)有默认值,所以通常仅需要编辑变更的参数。

NodeSLO

SLO 配置的 data 字段会被 koord-manager 解析。Koord-manager 会检查配置数据是否合法,然后用解析后的配置更新到每个节点的 NodeSLO 对象中。 如果解析失败,koord-manager 会在 ConfigMap 对象上记录 Events,以警示 unmarshal 错误。对于 agent 组件 koordlet,它会 watch NodeSLO 的 Spec,并对节点的 QoS 特性进行调谐。

apiVersion: slo.koordinator.sh/v1alpha1
kind: NodeSLO
metadata:
name: test-node
spec:
cpuBurstStrategy: {}
extensions: {}
resourceQOSStrategy: {}
systemStrategy: {}
# parsed from the `resource-threshold-config` data
resourceUsedThresholdWithBE:
cpuSuppressPolicy: cpuset
cpuSuppressThresholdPercent: 65
enable: true
memoryEvictThresholdPercent: 70

配置

参考版本:Koordinator v1.2

SLO 配置的模板如下:

apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
# colocation-config is the configuration for colocation.
# Related features: Dynamic resource over-commitment, Load-aware scheduling, Load-aware descheduling.
# - enable: whether to enable the colocation. If false, the reclaimed resources of the node allocatable (e.g. `kubernetes.io/batch-cpu`) will be removed.
# - metricAggregateDurationSeconds: the aggregated duration of node metrics reporting.
# - metricReportIntervalSeconds: the reporting interval of the node metrics.
# - metricAggregatePolicy: policies of reporting node metrics in different durations.
# - cpuReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed cpu resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold.
# - memoryReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed memory resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold.
# - memoryCalculatePolicy: the policy for calculating the reclaimable memory resource. If set to `request`, only unallocated memory resource of high-priority pods are reclaimable, and no allocated memory can be reclaimed.
# - degradeTimeMinutes: the threshold duration to degrade the colocation for which the node metrics has not been updated.
# - updateTimeThresholdSeconds: the threshold duration to force updating the reclaimed resources with the latest calculated result.
# - resourceDiffThreshold: the threshold to update the reclaimed resources than which the calculated reclaimed resources is different from the current.
# - nodeConfigs: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
colocation-config: |
{
"enable": false,
"metricAggregateDurationSeconds": 300,
"metricReportIntervalSeconds": 60,
"metricAggregatePolicy": {
"durations": [
"5m",
"10m",
"15m"
]
},
"cpuReclaimThresholdPercent": 60,
"memoryReclaimThresholdPercent": 65,
"memoryCalculatePolicy": "usage",
"degradeTimeMinutes": 15,
"updateTimeThresholdSeconds": 300,
"resourceDiffThreshold": 0.1,
"nodeConfigs": [
{
"name": "anolis",
"nodeSelector": {
"matchLabels": {
"kubernetes.io/kernel": "anolis"
}
},
"updateTimeThresholdSeconds": 360,
"resourceDiffThreshold": 0.2
}
]
}
# The configuration for threshold-based strategies.
# Related features: BECPUSuppress, BEMemoryEvict, BECPUEvict.
# - clusterStrategy: the cluster-level configuration.
# - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
# - enable: whether to enable the threshold-based strategies or not. If false, all threshold-based strategies are disabled. If set to true, CPU Suppress and Memory Evict are enabled by default.
# - cpuSuppressThresholdPercent: the node cpu utilization threshold to suppress BE pods' usage.
# - cpuSuppressPolicy: the policy of cpu suppression. If set to `cpuset`, the BE pods' `cpuset.cpus` will be reconciled when suppression. If set to `cfsQuota`, the BE pods' `cpu.cfs_quota_us` will be reconciled.
# - memoryEvictThresholdPercent: the node memory utilization threshold to evict BE pods.
# - memoryEvictLowerPercent: the node memory utilization threshold to stop the memory eviction. By default, `lowerPercent = thresholdPercent - 2`.
# - cpuEvictBESatisfactionLowerPercent: the cpu satisfaction threshold to start the cpu eviction (also require to meet the BE util threshold).
# - cpuEvictBEUsageThresholdPercent: the BE utilization (BEUsage / BERealLimit) threshold to start the cpu eviction (also require to meet the cpu satisfaction threshold).
# - cpuEvictBESatisfactionUpperPercent: the cpu satisfaction threshold to stop the cpu eviction.
# - cpuEvictTimeWindowSeconds: the time window of the cpu metrics for the cpu eviction.
resource-threshold-config: |
{
"clusterStrategy": {
"enable": false,
"cpuSuppressThresholdPercent": 65,
"cpuSuppressPolicy": "cpuset",
"memoryEvictThresholdPercent": 70,
"memoryEvictLowerPercent": 65,
"cpuEvictBESatisfactionUpperPercent": 90,
"cpuEvictBESatisfactionLowerPercent": 60,
"cpuEvictBEUsageThresholdPercent": 90
},
"nodeStrategies": [
{
"name": "anolis",
"nodeSelector": {
"matchLabels": {
"kubernetes.io/kernel": "anolis"
}
},
"cpuEvictBEUsageThresholdPercent": 80
}
]
}
# The configuration for QoS-based features.
# Related features: CPUQoS (GroupIdentity), MemoryQoS (CgroupReconcile), ResctrlQoS.
# - clusterStrategy: the cluster-level configuration.
# - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
# - lsrClass/lsClass/beClass: the configuration for pods of QoS LSR/LS/BE respectively.
# - cpuQOS: the configuration of CPU QoS.
# - enable: whether to enable CPU QoS. If set to `false`, the related cgroup configs will be reset to the system default.
# - groupIdentity: the priority level of the Group Identity ([-1, 2]). `2` means the highest priority, while `-1` means the lowest priority. Anolis OS required.
# - memoryQOS: the configuration of Memory QoS.
# - enable: whether to enable Memory QoS. If set to `false`, the related cgroup configs will be reset to the system default.
# - minLimitPercent: the scale percentage for setting the `memory.min` based on the container's request. It enables the memory protection from the Linux memory reclaim.
# - lowLimitPercent: the scale percentage for setting the `memory.low` based on the container's request. It enables the memory soft protection from the Linux memory reclaim.
# - throttlingPercent: the scale percentage for setting the `memory.high` based on the container's limit. It enables the memory throttling in cgroup level.
# - wmarkRatio: the ratio of container-level asynchronous memory reclaim based on the container's limit. Anolis OS required.
# - wmarkScalePermill: the per-mill of container memory to reclaim in once asynchronous memory reclaim. Anolis OS required.
# - wmarkMinAdj: the adjustment percentage of global memory min watermark. It affects the reclaim priority when the node memory free is quite a few. Anolis OS required.
# - resctrlQOS: the configuration of Resctrl (Intel RDT) QoS.
# - enable: whether to enable Resctrl QoS.
# - catRangeStartPercent: the starting percentage of the L3 Cache way partitioning. L3 CAT required.
# - catRangeEndPercent: the ending percentage of the L3 Cache way partitioning. L3 CAT required.
# - mbaPercent: the allocation percentage of the memory bandwidth. MBA required.
resource-qos-config: |
{
"clusterStrategy": {
"lsrClass": {
"cpuQOS": {
"enable": false,
"groupIdentity": 2
},
"memoryQOS": {
"enable": false,
"minLimitPercent": 0,
"lowLimitPercent": 0,
"throttlingPercent": 0,
"wmarkRatio": 95,
"wmarkScalePermill": 20,
"wmarkMinAdj": -25,
"priorityEnable": 0,
"priority": 0,
"oomKillGroup": 0
},
"resctrlQOS": {
"enable": false,
"catRangeStartPercent": 0,
"catRangeEndPercent": 100,
"mbaPercent": 100
}
},
"lsClass": {
"cpuQOS": {
"enable": false,
"groupIdentity": 2
},
"memoryQOS": {
"enable": false,
"minLimitPercent": 0,
"lowLimitPercent": 0,
"throttlingPercent": 0,
"wmarkRatio": 95,
"wmarkScalePermill": 20,
"wmarkMinAdj": -25,
"priorityEnable": 0,
"priority": 0,
"oomKillGroup": 0
},
"resctrlQOS": {
"enable": false,
"catRangeStartPercent": 0,
"catRangeEndPercent": 100,
"mbaPercent": 100
}
},
"beClass": {
"cpuQOS": {
"enable": false,
"groupIdentity": -1
},
"memoryQOS": {
"enable": false,
"minLimitPercent": 0,
"lowLimitPercent": 0,
"throttlingPercent": 0,
"wmarkRatio": 95,
"wmarkScalePermill": 20,
"wmarkMinAdj": 50,
"priorityEnable": 0,
"priority": 0,
"oomKillGroup": 0
},
"resctrlQOS": {
"enable": false,
"catRangeStartPercent": 0,
"catRangeEndPercent": 30,
"mbaPercent": 100
}
}
},
"nodeStrategies": [
{
"name": "anolis",
"nodeSelector": {
"matchLabels": {
"kubernetes.io/kernel": "anolis"
}
},
"beClass": {
"memoryQOS": {
"wmarkRatio": 90
}
}
}
]
}
# The configuration for the CPU Burst.
# Related features: CPUBurst.
# - clusterStrategy: the cluster-level configuration.
# - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
# - policy: the policy of CPU Burst. If set to `none`, the CPU Burst is disabled. If set to `auto`, the CPU Burst is fully enabled. If set to `cpuBurstOnly`, only the Linux CFS Burst feature is enabled.
# - cpuBurstPercent: the percentage of Linux CFS Burst. It affects the value of `cpu.cfs_burst_us` of pod/container cgroups. It specifies the percentage to which the CPU limit can be increased by CPU Burst.
# - cfsQuotaBurstPercent: the percentage of cfs quota burst. It affects the scaled ratio of `cpu.cfs_quota_us` of pod/container cgroups. It specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased.
# - cfsQuotaBurstPeriodSeconds: the maximum period of once cfs quota burst. It indicates that the time period in which the container can run with an increased CFS quota is unlimited.
# - sharePoolThresholdPercent: the threshold of share pool utilization. If the share pool utilization is too high, CPU Burst will be stopped and reset to avoid machine overload.
cpu-burst-config: |
{
"clusterStrategy": {
"policy": "none",
"cpuBurstPercent": 1000,
"cfsQuotaBurstPercent": 300,
"cfsQuotaBurstPeriodSeconds": -1,
"sharePoolThresholdPercent": 50
},
"nodeStrategies": [
{
"name": "anolis",
"nodeSelector": {
"matchLabels": {
"kubernetes.io/kernel": "anolis"
}
},
"policy": "cfsQuotaBurstOnly",
"cfsQuotaBurstPercent": 400
}
]
}
# The configuration for system-level settings.
# Related features: SystemConfig.
# - clusterStrategy: the cluster-level configuration.
# - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
# - minFreeKbytesFactor: the factor for calculating the global minimum memory free watermark `/proc/sys/vm/min_free_kbytes`. `min_free_kbytes = minFreeKbytesFactor * nodeTotalMemory / 10000`.
# - watermarkScaleFactor: the reclaim factor `/proc/sys/vm/watermark_scale_factor` in once global memory reclaim.
# - memcgReapBackGround: whether to enable the reaper for orphan memory cgroups.
system-config: |-
{
"clusterStrategy": {
"minFreeKbytesFactor": 100,
"watermarkScaleFactor": 150,
"memcgReapBackGround": 0
}
"nodeStrategies": [
{
"name": "anolis",
"nodeSelector": {
"matchLabels": {
"kubernetes.io/kernel": "anolis"
}
},
"minFreeKbytesFactor": 100,
"watermarkScaleFactor": 150
}
]
}
# The configuration for host application settings.
# - name: name of the host application.
# - qos: QoS class of the application.
# - cgroupPath: cgroup path of the application, the directory equals to `${base}/${parentDir}/${relativePath}`.
# - cgroupPath.base: cgroup base dir of the application, the format is various across cgroup drivers.
# - cgroupPath.parentDir: cgroup parent path under base dir. By default it is "host-latency-sensitive/" for LS and "host-latency-sensitive/" for BE.
# - cgroupPath.relativePath: cgroup relative path under parent dir.
host-application-config: |
{
"applications": [
{
"name": "nginx",
"qos": "LS",
"cgroupPath": {
"base": "CgroupRoot",
"parentDir": "host-latency-sensitive/",
"relativePath": "nginx/"
}
}
]
}

对于更多信息,请查看相关特性的用户手册和设计文档。

快速开始

  1. 通过 ConfigMap koordinator-system/slo-controller-config 检查当前的 SLO 配置。
$ kubectl get configmap -n koordinator-system slo-controller-config -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: koordinator
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/managed-by: Helm
name: slo-controller-config
namespace: koordinator-system
data:
colocation-config: |
{
"enable": false,
"metricAggregateDurationSeconds": 300,
"metricReportIntervalSeconds": 60,
"cpuReclaimThresholdPercent": 60,
"memoryReclaimThresholdPercent": 65,
"memoryCalculatePolicy": "usage",
"degradeTimeMinutes": 15,
"updateTimeThresholdSeconds": 300,
"resourceDiffThreshold": 0.1
}
resource-threshold-config: |
{
"clusterStrategy": {
"enable": false
}
}
  1. 编辑 ConfigMap koordinator-system/slo-controller-config 来修改 SLO 配置。
$ kubectl edit configmap -n koordinator-system slo-controller-config

例如,ConfigMap 编辑如下:

data:
# ...
resource-threshold-config: |
{
"clusterStrategy": {
"enable": true,
"cpuSuppressThresholdPercent": 60,
"cpuSuppressPolicy": "cpuset",
"memoryEvictThresholdPercent": 60
}
}
  1. 确认 NodeSLO 是否成功下发。

注意:默认值会在 NodeSLO 中省略。

$ kubectl get nodeslo.slo.koordinator.sh test-node -o yaml
apiVersion: slo.koordinator.sh/v1alpha1
kind: NodeSLO
metadata:
name: test-node
spec:
# ...
extensions: {}
resourceUsedThresholdWithBE:
cpuSuppressPolicy: cpuset
cpuSuppressThresholdPercent: 60
enable: true
memoryEvictThresholdPercent: 60