Run LeaderWorkerSet

Run a LeaderWorkerSet as a Kueue-managed workload.

This page shows how to leverage Kueue’s scheduling and resource management capabilities when running LeaderWorkerSet.

We demonstrate how to support scheduling LeaderWorkerSets where a group of Pods constitutes a unit of admission represented by a Workload. This allows to scale-up and down LeaderWorkerSets group by group.

This integration is based on the Plain Pod Group integration.

This guide is for serving users that have a basic understanding of Kueue. For more information, see Kueue’s overview.

Before you begin

  1. The leaderworkerset.x-k8s.io/leaderworkerset integration is enabled by default.

  2. For Kueue v0.15 and earlier, learn how to install Kueue with a custom manager configuration and ensure that you have the leaderworkerset.x-k8s.io/leaderworkerset integration enabled, for example:

    apiVersion: config.kueue.x-k8s.io/v1beta2
    kind: Configuration
    integrations:
      frameworks:
       - "leaderworkerset.x-k8s.io/leaderworkerset"
    
  3. Check Administer cluster quotas for details on the initial Kueue setup.

Running a LeaderWorkerSet admitted by Kueue

When running a LeaderWorkerSet on Kueue, take into consideration the following aspects:

a. Queue selection

The target local queue should be specified in the metadata.labels section of the LeaderWorkerSet configuration.

metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue

b. Configure the resource needs

The resource needs of the workload can be configured in the spec.template.spec.containers.

spec:
  leaderWorkerTemplate:
    leaderTemplate:
      spec:
        containers:
          - resources:
              requests:
                cpu: "100m"
    workerTemplate:
      spec:
        containers:
          - resources:
              requests:
                cpu: "100m"

c. Scaling

You can perform scale up or scale down operations on a LeaderWorkerSet .spec.replicas.

The unit of scaling is a LWS group. By changing the number of replicas in the LWS you can create or delete entire groups of Pods. As a result of scale up the newly created group of Pods is suspended by a scheduling gate, until the corresponding Workload is admitted.

Example

Here is a sample LeaderWorkerSet:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: nginx-leaderworkerset
  labels:
    app: nginx
    kueue.x-k8s.io/queue-name: user-queue
spec:
  replicas: 2
  leaderWorkerTemplate:
    leaderTemplate:
      spec:
        containers:
          - name: nginx-leader
            image: registry.k8s.io/nginx-slim:0.27
            resources:
              requests:
                cpu: "100m"
            ports:
              - containerPort: 80
    size: 3
    workerTemplate:
      spec:
        containers:
          - name: nginx-worker
            image: registry.k8s.io/nginx-slim:0.27
            resources:
              requests:
                cpu: "200m"
            ports:
              - containerPort: 80

You can create the LeaderWorkerSet using the following command:

kubectl create -f sample-leaderworkerset.yaml

Configure Topology Aware Scheduling

For performance-sensitive workloads like large-scale inference or distributed training, you may require the Leader and Worker pods to be co-located within a specific network topology domain (e.g., a rack or a data center block) to minimize latency.

Kueue supports Topology Aware Scheduling (TAS) for LeaderWorkerSet by reading annotations from the Pod templates. To enable this:

  • Configure the cluster for Topology Aware Scheduling.
  • Add the kueue.x-k8s.io/podset-required-topology annotation to both the leaderTemplate and the workerTemplate.
  • Add the kueue.x-k8s.io/podset-group-name annotation to both the leaderTemplate and the workerTemplate with the same value. This ensures that the Leader and Workers are scheduled in the same topology domain.

Example: Rack-Level Co-location

The following example uses the podset-group-name annotation to ensure that the Leader and all Workers are scheduled within the same rack (represented by the cloud.provider.com/topology-rack label).

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: nginx-leaderworkerset
  labels:
    app: nginx
    kueue.x-k8s.io/queue-name: tas-user-queue
spec:
  replicas: 2
  leaderWorkerTemplate:
    leaderTemplate:
      metadata:
        annotations:
          # Require leader to be in the topology domain
          kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
          # Identify the group to ensure co-location with workers
          kueue.x-k8s.io/podset-group-name: "lws-group"
      spec:
        containers:
        - name: nginx-leader
          image: registry.k8s.io/nginx-slim:0.27
          resources:
            requests:
              cpu: "100m"
          ports:
          - containerPort: 80
    size: 3
    workerTemplate:
      metadata:
        annotations:
          # Require workers to be in the same topology domain
          kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rack"
          # Identify the group to ensure co-location with leader
          kueue.x-k8s.io/podset-group-name: "lws-group"
      spec:
        containers:
        - name: nginx-worker
          image: registry.k8s.io/nginx-slim:0.27
          resources:
            requests:
              cpu: "200m"
              nvidia.com/gpu: "1"
          ports:
          - containerPort: 80

When replicas is greater than 1 (as in the example above where replicas: 2), the topology constraints apply to each replica individually. This means that for each replica, the Leader and its Workers will be co-located in the same topology domain (e.g., rack), but different replicas may be assigned to different topology domains.

Multikueue

Check MultiKueue for details on running LeaderWorkerSets in MultiKueue environment.

Troubleshooting

For general troubleshooting guidance, see the Kueue troubleshooting guide.

Long LeaderWorkerSet names

By default, Kueue stores the pod-group identifier in the kueue.x-k8s.io/pod-group-name label, which inherits Kubernetes' 63-character label value limit. Because Kueue derives this value by appending a group suffix to the LeaderWorkerSet name, the effective LWS name limit is 39 characters.

Enable the alpha WorkloadIdentifierAnnotations feature gate to store the identifier in an annotation instead, removing Kueue’s label-length constraint. With the feature gate enabled, the effective limit shifts to the upstream LWS constraint of 51 characters (see Unable to Create LWS Object with a Name Exceeding 51 Characters in the LWS troubleshooting guide).