Descheduling
Overview
Descheduling involves evicting pods based on specific policies so that the pods can be rescheduled onto more appropriate nodes.
Your cluster can benefit from descheduling and rescheduling already-running pods for various reasons:
-
Nodes are under- or over-utilized.
-
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
-
Node failure requires pods to be moved.
-
New nodes are added to clusters.
The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.
It is important to note that there are a number of core components, such as DNS, that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster may stop working properly if the component is evicted. To prevent the descheduler from removing these pods, configure the pod as a critical pod by adding the scheduler.alpha.kubernetes.io/critical-pod
annotation to the pod specification.
The descheduler job is considered a critical pod, which prevents the descheduler pod from being evicted by the descheduler. |
The descheduler job and descheduler pod are created in the kube-system
project, which is created by default.
The descheduler is a Technology Preview feature only. |
The descheduler does not evict the following types of pods:
-
Critical pods (with the
scheduler.alpha.kubernetes.io/critical-pod
annotation). -
Pods (static and mirror pods or pods in standalone mode) not associated with a Replica Set, Replication Controller, Deployment, or Job (because these pods are not recreated).
-
Pods associated with DaemonSets.
-
Pods with local storage.
-
Pods subject to Pod Disruption Budget (PDB) are not evicted if descheduling violates the PDB. The pods can be evicted using an eviction policy.
Best efforts pods are evicted before Burstable and Guaranteed pods. |
The following sections describe the process to configure and run the descheduler:
-
Define the descheduling behavior in a policy file.
-
Create the descheduler job configuration.
Creating a Cluster Role
To configure the necessary permissions for the descheduler to work in a pod:
-
Create a cluster role with the following rules:
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1beta1 metadata: name: descheduler-cluster-role rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "watch", "list"] (1) - apiGroups: [""] resources: ["pods"] verbs: ["get", "watch", "list", "delete"] (2) - apiGroups: [""] resources: ["pods/eviction"] (3) verbs: ["create"]
1 Configures the role to allow viewing nodes. 2 Configures the role to allow viewing and deleting pods. 3 Allows a node to evict pods bound to itself. -
Create the service account which will be used to run the job:
# oc create sa <file-name>.yaml -n kube-system
For example:
# oc create sa descheduler-sa.yaml -n kube-system
-
Bind the cluster role to the service account:
# oc create clusterrolebinding descheduler-cluster-role-binding \ --clusterrole=<cluster-role-name> \ --serviceaccount=kube-system:<service-account-name>
For example:
# oc create clusterrolebinding descheduler-cluster-role-binding \ --clusterrole=descheduler-cluster-role \ --serviceaccount=kube-system:descheduler-sa
Creating Descheduler Policies
You can configure the descheduler to remove pods from nodes that violate rules defined by strategies in a YAML policy file. Include a path to the policy file in the job specification to apply the specific descheduling strategy.
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: false
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
targetThresholds:
"cpu" : 50
"memory": 50
"pods": 50
numberOfNodes: 3
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
There are three default strategies that can be used with the descheduler:
-
Remove duplicate pods (
RemoveDuplicates
) -
Move pods to underutilized nodes (
LowNodeUtilization
) -
Remove pods that violate anti-affinity rules (
RemovePodsViolatingInterPodAntiAffinity
).
You can configure and disable parameters associated with strategies as needed.
Removing Duplicate Pods
The RemoveDuplicates
strategy ensures that there is only one pod associated with a Replica Set, Replication Controller, Deployment Configuration, or Job running on same node. If there are other pods associated with those objects, the duplicate pods are evicted. Removing duplicate pods results in better spreading of pods in a cluster.
For example, duplicate pods could happen if a node fails and the pods on the node are moved to another node, leading to more than one pod associated with an Replica Set or Replication Controller, running on same node. After the failed node is ready again, this strategy could be used to evict those duplicate pods.
There are no parameters associated with this strategy.
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: false (1)
1 | Set this value to enabled: true to use this policy. Set to false to disable this policy. |
Creating a Low Node Utilization Policy
The LowNodeUtilization
strategy finds nodes that are underutilized and evicts pods from other nodes so that the evicted pods can be scheduled on these underutilized nodes.
The underutilization of nodes is determined by a configurable threshold, thresholds
, for CPU, memory, or number of pods (based on percentage). If a node usage is below all these thresholds, the node is considered underutilized and the descheduler can evict pods from other nodes. Pods request resource requirements are considered when computing node resource utilization.
A high threshold value, targetThresholds
is used to determine properly utilized nodes. Any node that is between the thresholds and targetThresholds is considered properly utilized and is not considered for eviction. The threshold, targetThresholds
, can be configured for CPU, memory, and number of pods (based on percentage).
These thresholds could be tuned for your cluster requirements.
The numberOfNodes
parameter can be configured to activate the strategy only when number of underutilized nodes is above the configured value. Set this parameter if it is acceptable for a few nodes to go underutilized. By default, numberOfNodes
is set to zero.
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds: (1)
"cpu" : 20
"memory": 20
"pods": 20
targetThresholds: (2)
"cpu" : 50
"memory": 50
"pods": 50
numberOfNodes: 3 (3)
1 | Set the low-end threshold. If the node is below all three values, the descheduler considers the node underutilized. |
2 | Set the high-end threshold. If the node is below these values and above the threshold values, the descheduler considers the node properly utilized. |
3 | Set the number of nodes that can be underutilized before the descheduler will evict pods from underutilized nodes. |
Remove Pods Violating Inter-Pod Anti-Affinity
The RemovePodsViolatingInterPodAntiAffinity
strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.
For example, Node1 has podA, podB, and podC. podB and podC have anti-affinity rules that prohibit them from running on the same node as podA. podA will be evicted from the node so that podB and podC can run on that node. This situation could happen if the anti-affinity rule was applied when podB and podC were running on the node.
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingInterPodAntiAffinity": (1)
enabled: true
1 | Set this value to enabled: true to use this policy. Set to false to disable this policy. |
Create a Configuration Map for the Descheduler Policy
Create a configuration map for the descheduler policy file in the kube-system
project, so that it can be referenced by the descheduler job.
# oc create configmap descheduler-policy-configmap \ -n kube-system --from-file=<path-to-policy-dir/policy.yaml> (1)
1 | The path to the policy file you created. |
Create the Job Specification
Create a job configuration for the descheduler.
apiVersion: batch/v1
kind: Job
metadata:
name: descheduler-job
namespace: kube-system
spec:
parallelism: 1
completions: 1
template:
metadata:
name: descheduler-pod (1)
annotations:
scheduler.alpha.kubernetes.io/critical-pod: "true" (2)
spec:
containers:
- name: descheduler
image: descheduler
volumeMounts: (3)
- mountPath: /policy-dir
name: policy-volume
command:
- "/bin/sh"
- "-ec"
- |
/bin/descheduler --policy-config-file /policy-dir/policy.yaml (4)
restartPolicy: "Never"
serviceAccountName: descheduler-sa (5)
volumes:
- name: policy-volume
configMap:
name: descheduler-policy-configmap
1 | Specify a name for the job. |
2 | Configures the pod so that it will not be descheduled. |
3 | The volume name and mount path in the container where the job should be mounted. |
4 | Path in the container where the policy file you created will be stored. |
5 | Specify the name of the service account you created. |
The policy file is mounted as a volume from the configuration map.