Skip to main content
Version: v1.7 ๐Ÿšง

Batch Workload Colocation Quick Start Guide

This guide helps community newcomers quickly understand and deploy Koordinator for batch workload colocation. We'll cover core concepts, deployment process, and important considerations in an easy-to-understand way.

What is Batch Colocation?โ€‹

Batch colocation is a technique that allows running batch processing workloads (like data analysis, machine learning training, offline jobs) alongside latency-sensitive applications (like web services, microservices) on the same Kubernetes cluster. By utilizing idle resources from online services, you can significantly improve cluster resource utilization while maintaining service quality.

Why Batch Colocation?โ€‹

In a typical Kubernetes cluster:

  • Online services request resources (CPU, memory) based on peak traffic, but actual usage is often much lower
  • Idle resources are allocated but unused most of the time
  • Cluster utilization is typically low (20-40%)

Koordinator enables you to:

  • Reclaim idle resources from online services
  • Run batch jobs using these reclaimed resources
  • Improve utilization to 50-80% while maintaining service quality

Core Conceptsโ€‹

1. QoS Classesโ€‹

Koordinator defines five QoS (Quality of Service) classes for different workload types:

QoS ClassUse CaseResource GuaranteeTypical Workload
SYSTEMSystem servicesLimited but guaranteedDaemonSets, system processes
LSEExclusive latency-sensitiveReserved, isolatedMiddleware (rarely used)
LSRReserved latency-sensitiveCPU cores reservedCritical online services
LSShared latency-sensitiveShared with burst capabilityTypical microservices
BEBest EffortNo guarantee, can be throttled/evictedBatch jobs โญ

For batch workloads, you'll primarily use BE (Best Effort) QoS class.

2. Priority Classesโ€‹

Koordinator extends Kubernetes PriorityClass with four levels:

PriorityClassPriority RangeDescriptionUse for Batch?
koord-prod[9000, 9999]Production, guaranteed quotaโŒ No
koord-mid[7000, 7999]Medium priority, guaranteed quotaโŒ No
koord-batch[5000, 5999]Batch workloads, allows borrowingโœ… Yes
koord-free[3000, 3999]Free resources, no guaranteeโœ… Optional

For most batch workloads, use koord-batch priority class.

3. Resource Modelโ€‹

Koordinator's colocation model works as follows:

Resource Model

  • Limit: Resources requested by high-priority pods (LS/LSR)
  • Usage: Actual resources used (varies over time)
  • Reclaimable: Resources between usage and limit - available for BE pods
  • BE Pods: Run using reclaimable resources

Key Point: Batch jobs (BE) use idle resources that would otherwise be wasted, without affecting online service performance.

4. Resource Typesโ€‹

Koordinator introduces special resource types for batch workloads:

Resource TypeDescriptionUse in Pod Spec
kubernetes.io/batch-cpuCPU for batch workloadsโœ… Required
kubernetes.io/batch-memoryMemory for batch workloadsโœ… Required

These resources are allocated from the cluster's reclaimable pool.

Prerequisitesโ€‹

Before getting started, ensure you have:

  1. Kubernetes cluster (version >= 1.18)
  2. kubectl configured to access your cluster
  3. Helm (version >= 3.5) - Install Helm
  4. (Recommended) Linux kernel version >= 4.19 for best performance

Installationโ€‹

Step 1: Install Koordinatorโ€‹

Add the Koordinator Helm repository:

helm repo add koordinator-sh https://koordinator-sh.github.io/charts/
helm repo update

Install Koordinator (latest stable version):

helm install koordinator koordinator-sh/koordinator --version 1.6.0

Verify the installation:

kubectl get pod -n koordinator-system

Expected output (all pods should be Running):

NAME                                  READY   STATUS    RESTARTS   AGE
koord-descheduler-xxx 1/1 Running 0 2m
koord-manager-xxx 1/1 Running 0 2m
koord-manager-xxx 1/1 Running 0 2m
koord-scheduler-xxx 1/1 Running 0 2m
koord-scheduler-xxx 1/1 Running 0 2m
koordlet-xxx 1/1 Running 0 2m
koordlet-xxx 1/1 Running 0 2m

Step 2: Verify Priority Classesโ€‹

Check that Koordinator PriorityClasses are created:

kubectl get priorityclass | grep koord

Expected output:

koord-batch         5000        false        10m
koord-free 3000 false 10m
koord-mid 7000 false 10m
koord-prod 9000 false 10m

Running Your First Batch Workloadโ€‹

ClusterColocationProfile automatically injects colocation configurations into pods based on labels. This is the easiest way for batch workloads.

Step 1: Create a Namespaceโ€‹

kubectl create namespace batch-demo
kubectl label namespace batch-demo koordinator.sh/enable-colocation=true

Step 2: Create ClusterColocationProfileโ€‹

Create batch-colocation-profile.yaml:

apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
name: batch-workload-profile
spec:
# Match namespace with label
namespaceSelector:
matchLabels:
koordinator.sh/enable-colocation: "true"
# Match pods with label
selector:
matchLabels:
app-type: batch
# Set QoS to BE for batch workloads
qosClass: BE
# Set priority class
priorityClassName: koord-batch
# Use Koordinator scheduler
schedulerName: koord-scheduler
# Add labels for tracking
labels:
koordinator.sh/mutated: "true"

Apply the profile:

kubectl apply -f batch-colocation-profile.yaml

Step 3: Create a Batch Jobโ€‹

Create batch-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
name: data-processing-job
namespace: batch-demo
spec:
completions: 1
template:
metadata:
labels:
app-type: batch # This label triggers the profile
spec:
containers:
- name: worker
image: python:3.9
command:
- python
- -c
- |
import time
import random
print("Starting data processing...")
for i in range(60):
# Simulate data processing
time.sleep(1)
print(f"Processing batch {i+1}/60...")
print("Job completed!")
resources:
requests:
cpu: "2" # Will be converted to batch-cpu
memory: "4Gi" # Will be converted to batch-memory
limits:
cpu: "2"
memory: "4Gi"
restartPolicy: Never

Apply the job:

kubectl apply -f batch-job.yaml

Step 4: Verify the Configurationโ€‹

Check that the pod has been configured correctly:

kubectl get pod -n batch-demo -l app-type=batch -o yaml

You should see the colocation configurations automatically injected:

metadata:
labels:
koordinator.sh/qosClass: BE # โœ… QoS injected
koordinator.sh/mutated: "true" # โœ… Profile applied
spec:
priorityClassName: koord-batch # โœ… Priority set
schedulerName: koord-scheduler # โœ… Using Koordinator scheduler
containers:
- name: worker
resources:
limits:
kubernetes.io/batch-cpu: "2000" # โœ… Converted to batch resources
kubernetes.io/batch-memory: "4Gi"
requests:
kubernetes.io/batch-cpu: "2000"
kubernetes.io/batch-memory: "4Gi"

Method 2: Manual Configurationโ€‹

If you prefer explicit configuration without ClusterColocationProfile:

Create manual-batch-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
name: manual-batch-job
namespace: batch-demo
spec:
completions: 1
template:
metadata:
labels:
koordinator.sh/qosClass: BE # Explicitly set QoS
spec:
priorityClassName: koord-batch # Explicitly set priority
schedulerName: koord-scheduler # Use Koordinator scheduler
containers:
- name: worker
image: python:3.9
command: ["python", "-c", "print('Hello from batch job'); import time; time.sleep(30)"]
resources:
requests:
kubernetes.io/batch-cpu: "1000" # Use batch resources
kubernetes.io/batch-memory: "2Gi"
limits:
kubernetes.io/batch-cpu: "1000"
kubernetes.io/batch-memory: "2Gi"
restartPolicy: Never

Apply the job:

kubectl apply -f manual-batch-job.yaml

Monitoring and Verificationโ€‹

Check Node Resourcesโ€‹

View node resource allocation:

kubectl get node -o yaml | grep -A 10 "allocatable:"

You should see batch resources available:

allocatable:
cpu: "8"
memory: "16Gi"
kubernetes.io/batch-cpu: "15000" # Batch CPU available
kubernetes.io/batch-memory: "20Gi" # Batch memory available

Monitor Resource Usageโ€‹

Check actual resource usage:

kubectl top nodes
kubectl top pods -n batch-demo

Check Node Metricsโ€‹

Koordinator creates NodeMetric resources with detailed metrics:

kubectl get nodemetric -o yaml

This shows real-time resource usage, helping Koordinator make scheduling decisions.

Important Considerationsโ€‹

1. Resource Limitsโ€‹

DO:

  • โœ… Always set both requests and limits for batch workloads
  • โœ… Use realistic resource estimates
  • โœ… Set requests == limits for predictable behavior

DON'T:

  • โŒ Don't over-request resources you don't need
  • โŒ Don't omit resource specifications

2. QoS Guaranteesโ€‹

Understand the BE QoS behavior:

  • CPU: BE pods get remaining CPU cycles; may be throttled when LS pods need resources
  • Memory: BE pods can be evicted if memory pressure occurs
  • Priority: BE pods are scheduled after higher-priority pods

3. Workload Suitabilityโ€‹

Good for Batch Colocation:

  • โœ… Data processing jobs
  • โœ… Machine learning training
  • โœ… Batch analytics
  • โœ… Video transcoding
  • โœ… Log processing
  • โœ… ETL jobs

Not Suitable:

  • โŒ Latency-sensitive services
  • โŒ Real-time processing
  • โŒ Jobs requiring guaranteed completion time
  • โŒ Stateful services with strict SLA

4. Failure Handlingโ€‹

Batch jobs may be:

  • Throttled: When high-priority pods need CPU
  • Evicted: During memory pressure

Design your batch workloads to handle:

  • Checkpointing: Save progress periodically
  • Retry logic: Use Job backoffLimit and restartPolicy
  • Idempotency: Ensure jobs can safely restart

Example with retry:

apiVersion: batch/v1
kind: Job
metadata:
name: resilient-batch-job
namespace: batch-demo
spec:
backoffLimit: 3 # Retry up to 3 times
completions: 1
template:
metadata:
labels:
app-type: batch
spec:
restartPolicy: OnFailure # Retry on failure
containers:
- name: worker
image: your-batch-image
# ... rest of configuration

5. Scheduler Configurationโ€‹

For batch workloads, ensure you're using the Koordinator scheduler:

spec:
schedulerName: koord-scheduler # Required for batch resource scheduling

Without this, the pod will use the default Kubernetes scheduler and won't benefit from colocation features.

6. Namespace Isolation (Optional)โ€‹

For better organization, dedicate namespaces to batch workloads:

# Create batch namespace
kubectl create namespace batch-workloads

# Label for colocation
kubectl label namespace batch-workloads koordinator.sh/enable-colocation=true

# Create profile for this namespace
kubectl apply -f batch-colocation-profile.yaml

Common Patternsโ€‹

Pattern 1: Data Processing Pipelineโ€‹

apiVersion: batch/v1
kind: Job
metadata:
name: data-pipeline
namespace: batch-demo
spec:
completions: 5 # Process 5 batches
parallelism: 2 # Run 2 at a time
template:
metadata:
labels:
app-type: batch
spec:
containers:
- name: processor
image: data-processor:latest
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
restartPolicy: OnFailure

Pattern 2: CronJob for Scheduled Batchโ€‹

apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
namespace: batch-demo
spec:
schedule: "0 2 * * *" # Run at 2 AM daily
jobTemplate:
spec:
template:
metadata:
labels:
app-type: batch
spec:
containers:
- name: report-generator
image: report-gen:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
restartPolicy: OnFailure

Troubleshootingโ€‹

Issue 1: Pod Stuck in Pendingโ€‹

Symptom: Batch pod remains in Pending state

Check:

kubectl describe pod <pod-name> -n batch-demo

Common causes:

  • Insufficient batch resources available
  • Node selector constraints
  • Resource requests too high

Solution: Check node allocatable resources and reduce requests if needed.

Issue 2: Pod Evicted Frequentlyโ€‹

Symptom: Batch pods are evicted often

Check:

kubectl get events -n batch-demo --sort-by='.lastTimestamp'

Common causes:

  • Memory pressure on nodes
  • High-priority pods need resources
  • Resource overcommitment too aggressive

Solution:

  • Reduce memory requests
  • Use checkpointing to handle evictions
  • Tune Koordinator resource reservation settings (advanced)

Issue 3: Batch Resources Not Availableโ€‹

Symptom: No kubernetes.io/batch-cpu resources on nodes

Check:

kubectl get nodemetric -o yaml
kubectl get pod -n koordinator-system

Solution:

  • Ensure Koordlet is running on all nodes
  • Check Koordlet logs: kubectl logs -n koordinator-system koordlet-xxx
  • Verify nodes have allocatable resources

Next Stepsโ€‹

After successfully running batch workloads, you can explore:

  1. Advanced Scheduling:

  2. Resource Management:

  3. Monitoring:

  4. Other Batch Frameworks:

Summaryโ€‹

In this guide, you learned:

  • โœ… Core concepts: QoS classes, Priority, Resource Model
  • โœ… How to install Koordinator
  • โœ… Two methods to run batch workloads (ClusterColocationProfile and manual)
  • โœ… Important considerations for production use
  • โœ… Common patterns and troubleshooting

Key Takeaways:

  • Use BE QoS and koord-batch priority for batch workloads
  • Leverage ClusterColocationProfile for easy configuration
  • Design for eviction and throttling with retries and checkpointing
  • Monitor resource usage and adjust as needed

Start small with simple batch jobs and gradually increase complexity as you become familiar with Koordinator's behavior!

Referencesโ€‹