Descheduling involves evicting pods based on specific policies so that the pods can be rescheduled onto more appropriate nodes.
Your cluster can benefit from descheduling and rescheduling already-running pods for various reasons:
Nodes are under- or over-utilized.
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
Node failure requires pods to be moved.
New nodes are added to clusters.
The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.
It is important to note that there are a number of core components, such as DNS, that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster may stop working properly if the component is evicted. To prevent the descheduler from removing these pods, configure the pod as a critical pod by adding the
scheduler.alpha.kubernetes.io/critical-pod annotation to the pod specification.
The descheduler job is considered a critical pod, which prevents the descheduler pod from being evicted by the descheduler.
The descheduler job and descheduler pod are created in the
kube-system project, which is created by default.
The descheduler is a Technology Preview feature only.
The descheduler does not evict the following types of pods:
Critical pods (with the
Pods (static and mirror pods or pods in standalone mode) not associated with a Replica Set, Replication Controller, Deployment, or Job (because these pods are not recreated).
Pods associated with DaemonSets.
Pods with local storage.
Best efforts pods are evicted before Burstable and Guaranteed pods.
The following sections describe the process to configure and run the descheduler:
To configure the necessary permissions for the descheduler to work in a pod:
Create a cluster role with the following rules:
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1beta1 metadata: name: descheduler-cluster-role rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "watch", "list"] (1) - apiGroups: [""] resources: ["pods"] verbs: ["get", "watch", "list", "delete"] (2) - apiGroups: [""] resources: ["pods/eviction"] (3) verbs: ["create"]
1 Configures the role to allow viewing nodes. 2 Configures the role to allow viewing and deleting pods. 3 Allows a node to evict pods bound to itself.
Create the service account which will be used to run the job:
# oc create sa <file-name>.yaml -n kube-system
# oc create sa descheduler-sa.yaml -n kube-system
Bind the cluster role to the service account:
# oc create clusterrolebinding descheduler-cluster-role-binding \ --clusterrole=<cluster-role-name> \ --serviceaccount=kube-system:<service-account-name>
# oc create clusterrolebinding descheduler-cluster-role-binding \ --clusterrole=descheduler-cluster-role \ --serviceaccount=kube-system:descheduler-sa
You can configure the descheduler to remove pods from nodes that violate rules defined by strategies in a YAML policy file. Include a path to the policy file in the job specification to apply the specific descheduling strategy.
apiVersion: "descheduler/v1alpha1" kind: "DeschedulerPolicy" strategies: "RemoveDuplicates": enabled: false "LowNodeUtilization": enabled: true params: nodeResourceUtilizationThresholds: thresholds: "cpu" : 20 "memory": 20 "pods": 20 targetThresholds: "cpu" : 50 "memory": 50 "pods": 50 numberOfNodes: 3 "RemovePodsViolatingInterPodAntiAffinity": enabled: true
There are three default strategies that can be used with the descheduler:
Remove duplicate pods (
Move pods to underutilized nodes (
Remove pods that violate anti-affinity rules (
You can configure and disable parameters associated with strategies as needed.
RemoveDuplicates strategy ensures that there is only one pod associated with a Replica Set, Replication Controller, Deployment Configuration, or Job running on same node. If there are other pods associated with those objects, the duplicate pods are evicted. Removing duplicate pods results in better spreading of pods in a cluster.
For example, duplicate pods could happen if a node fails and the pods on the node are moved to another node, leading to more than one pod associated with an Replica Set or Replication Controller, running on same node. After the failed node is ready again, this strategy could be used to evict those duplicate pods.
There are no parameters associated with this strategy.
apiVersion: "descheduler/v1alpha1" kind: "DeschedulerPolicy" strategies: "RemoveDuplicates": enabled: false (1)
|1||Set this value to
LowNodeUtilization strategy finds nodes that are underutilized and evicts pods from other nodes so that the evicted pods can be scheduled on these underutilized nodes.
The underutilization of nodes is determined by a configurable threshold,
thresholds, for CPU, memory, or number of pods (based on percentage). If a node usage is below all these thresholds, the node is considered underutilized and the descheduler can evict pods from other nodes. Pods request resource requirements are considered when computing node resource utilization.
A high threshold value,
targetThresholds is used to determine properly utilized nodes. Any node that is between the thresholds and targetThresholds is considered properly utilized and is not considered for eviction. The threshold,
targetThresholds, can be configured for CPU, memory, and number of pods (based on percentage).
These thresholds could be tuned for your cluster requirements.
numberOfNodes parameter can be configured to activate the strategy only when number of underutilized nodes is above the configured value. Set this parameter if it is acceptable for a few nodes to go underutilized. By default,
numberOfNodes is set to zero.
apiVersion: "descheduler/v1alpha1" kind: "DeschedulerPolicy" strategies: "LowNodeUtilization": enabled: true params: nodeResourceUtilizationThresholds: thresholds: (1) "cpu" : 20 "memory": 20 "pods": 20 targetThresholds: (2) "cpu" : 50 "memory": 50 "pods": 50 numberOfNodes: 3 (3)
|1||Set the low-end threshold. If the node is below all three values, the descheduler considers the node underutilized.|
|2||Set the high-end threshold. If the node is below these values and above the
|3||Set the number of nodes that can be underutilized before the descheduler will evict pods from underutilized nodes.|
RemovePodsViolatingInterPodAntiAffinity strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.
For example, Node1 has podA, podB, and podC. podB and podC have anti-affinity rules that prohibit them from running on the same node as podA. podA will be evicted from the node so that podB and podC can run on that node. This situation could happen if the anti-affinity rule was applied when podB and podC were running on the node.
apiVersion: "descheduler/v1alpha1" kind: "DeschedulerPolicy" strategies: "RemovePodsViolatingInterPodAntiAffinity": (1) enabled: true
|1||Set this value to
Create a configuration map for the descheduler policy file in the
kube-system project, so that it can be referenced by the descheduler job.
# oc create configmap descheduler-policy-configmap \ -n kube-system --from-file=<path-to-policy-dir/policy.yaml> (1)
|1||The path to the policy file you created.|
Create a job configuration for the descheduler.
apiVersion: batch/v1 kind: Job metadata: name: descheduler-job namespace: kube-system spec: parallelism: 1 completions: 1 template: metadata: name: descheduler-pod (1) annotations: scheduler.alpha.kubernetes.io/critical-pod: "true" (2) spec: containers: - name: descheduler image: descheduler volumeMounts: (3) - mountPath: /policy-dir name: policy-volume command: - "/bin/sh" - "-ec" - | /bin/descheduler --policy-config-file /policy-dir/policy.yaml (4) restartPolicy: "Never" serviceAccountName: descheduler-sa (5) volumes: - name: policy-volume configMap: name: descheduler-policy-configmap
|1||Specify a name for the job.|
|2||Configures the pod so that it will not be descheduled.|
|3||The volume name and mount path in the container where the job should be mounted.|
|4||Path in the container where the policy file you created will be stored.|
|5||Specify the name of the service account you created.|
The policy file is mounted as a volume from the configuration map.