Clean Shutdown and Safe Startup of an OpenShift Cluster
This SOP describes a controlled, repeatable procedure to cleanly shut down an OpenShift cluster and bring it back online with minimal risk to etcd integrity, workloads, and cluster operators.
Scope
Applies to planned maintenance windows for a self-managed OpenShift cluster (UPI or similar), including:
- 3 Control Plane (master) nodes
- 8 Worker nodes
- External services required for cluster operation (DNS, HAProxy/LB, time sync, storage endpoints, etc.)
Assumptions
You have cluster-admin access (`oc`) and console access (iDRAC/iLO/IPMI/VM console) for all nodes.
- The cluster is healthy before shutdown (or you accept additional risk if it is not).
- DNS and Load Balancer (HAProxy) remain reachable during shutdown and startup.
- Time synchronization (NTP/Chrony) remains stable across all nodes.
Roles and Responsibility
| Role | Responsibility | Notes |
| Cluster Admin / SRE | Execute shutdown/startup steps; validate cluster health | Owns oc commands and sequencing |
| Infrastructure/Hardware | Power operations on servers/VMs; console access | Coordinates iDRAC/iLO/VM platform |
| Network Team | Ensure DNS/LB routes remain stable; verify VIP reachability | Must not reboot critical LB/DNS during window |
| Storage Team (if applicable) | Confirm storage backends are available (e.g., CSI) | Avoid storage endpoint outages during startup |
Safety and Risk Notes
Important: etcd safety
The OpenShift control plane relies on etcd. Improper shutdown (or partial startup) can risk etcd corruption or split-brain-like symptoms. Always preserve quorum (at least 2 of 3 control-plane nodes) while the cluster is running, and shut down control-plane nodes only after workloads are stopped and workers are down.
Important: keep these services online
Do NOT shut down DNS and the Load Balancer (HAProxy) if you expect a normal cluster restart. If DNS/LB must also be restarted, do it in a controlled order and validate before powering on cluster nodes.
Pre-Shutdown Checklist
Complete the following checks before initiating shutdown:
- Confirm maintenance window approval / change ticket.
- Take a snapshot/backup of critical cluster state if available (etcd backup via `cluster-backup.sh` on a control-plane host, if configured).
- Confirm cluster health (operators, nodes).
- Confirm storage endpoints, DNS, and HAProxy are stable and will remain available.
- Notify application owners; stop/scale down stateful workloads cleanly (databases, message queues) and confirm application-level backups.
Recommended health commands (run from an admin host):
oc get nodes -o wide oc get co oc get clusterversion oc get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded oc -n openshift-etcd get pods -o wide oc -n openshift-etcd exec -it $(oc -n openshift-etcd get pods -l app=etcd -o name | head -n1) -- etcdctl endpoint health
Clean Shutdown Procedure (Planned)
Use this order: (1) Applications → (2) Workers → (3) Control Plane → (4) Optional infrastructure.
Stop or Quiesce Applications
Goal: ensure application data consistency and avoid noisy recovery on startup.
- Scale down non-critical deployments and cronjobs (optional but recommended).
- For StatefulSets (databases, queues), follow vendor/app runbooks to stop services gracefully.
- Confirm no active writes or critical batch jobs are running.
Helpful commands:
oc get projects oc get pods -n <namespace> -o wide oc scale deploy/<name> -n <namespace> --replicas=0 oc scale statefulset/<name> -n <namespace> --replicas=0
Drain and Shut Down Worker Nodes
Drain moves pods off nodes safely. Shut down workers first to reduce workload churn and keep etcd/control-plane stable.
For each worker node (hpc-workn1 … hpc-workn8):
- Cordon the node (prevent new scheduling).
- Drain the node (evict pods).
- Shut down the node via console/SSH.
Commands (repeat per worker):
oc adm cordon <worker-node-fqdn> oc adm drain <worker-node-fqdn> --ignore-daemonsets --delete-emptydir-data --force --timeout=20m # After drain completes, shut down the node (on the node console): sudo shutdown -h now
Validation: ensure workers are NotReady/Powered Off and no longer host workload pods.
oc get nodes oc get pods -A -o wide | egrep '<worker-node-name>|<worker-node-fqdn>' || true
Shut Down Control Plane Nodes (Masters)
After ALL workers are shut down and applications are quiesced, shut down control-plane nodes one-by-one. Avoid taking down multiple control-plane nodes simultaneously while the cluster is still actively serving workloads.
Recommended sequence: hpc-ctrln3 → hpc-ctrln2 → hpc-ctrln1 (or any consistent order).
For each control-plane node:
- Optionally cordon and drain (drain may hang if the cluster is already degraded; do not force if it risks etcd).
- Shut down the node from the node console.
Commands (if cluster is still responsive):
oc adm cordon <master-node-fqdn> # Drain with care; masters host critical pods. Use --ignore-daemonsets. Avoid --force unless you understand the impact. oc adm drain <master-node-fqdn> --ignore-daemonsets --delete-emptydir-data --timeout=20m # Then on the node: sudo shutdown -h now
Optional: Shut Down Non-Cluster Hosts
If required by maintenance, you may shut down supporting systems after the cluster is fully powered off. If you plan to bring the cluster back quickly, keep at least DNS and HAProxy running.
Safe Startup Procedure (Bring Cluster Back Online)
Use this order: (1) DNS/LB/Time/Storage → (2) Control Plane → (3) Workers → (4) Applications.
Validate Prerequisites Before Powering On Nodes
- DNS records resolve for api, api-int, and *.apps (if applicable).
- HAProxy is up and listening on API ports (typically 6443) and any required ingress ports (80/443).
- NTP/Chrony is reachable and time is correct.
- Storage backends are reachable (SAN/NFS/CSI endpoints).
Power On Control Plane Nodes
Power on all three control-plane nodes first and wait for etcd to form quorum and the API to become responsive.
Power on sequence (example):
- Power on hpc-ctrln1 and wait for it to boot (reach login prompt).
- Power on hpc-ctrln2 and wait for it to boot (reach login prompt).
- Power on hpc-ctrln3 and wait for it to boot (reach login prompt).
Monitor (from your admin host):
watch -n 5 'oc get nodes -o wide' watch -n 5 'oc get co' oc -n openshift-etcd get pods -o wide
Expectations:
- Nodes may be NotReady initially until the network operator and kubelet networking are healthy.
- etcd pods should become Running on each control-plane node.
- The kube-apiserver and openshift-apiserver components should recover as operators converge.
Power On Worker Nodes
After the control plane is stable (API responsive, etcd healthy), power on workers in batches (e.g., 2-3 at a time).
After each batch, wait for nodes to become Ready before continuing.
oc get nodes oc get pods -n openshift-ovn-kubernetes -o wide oc get pods -n openshift-multus -o wide
Uncordon Nodes
If you cordoned nodes during shutdown, uncordon them after they are Ready:
oc adm uncordon <node-fqdn>
Start Applications
Start applications in dependency order (databases → middleware → stateless apps).
oc scale statefulset/<name> -n <namespace> --replicas=<n> oc scale deploy/<name> -n <namespace> --replicas=<n> oc get pods -n <namespace> -o wide
Post-Startup Validation
Confirm the cluster is fully healthy:
- All nodes are Ready (or expected nodes are Ready).
- Cluster Operators are Available=True and Progressing=False/Degraded=False.
- Ingress and DNS are functional; routes resolve.
- Workloads are running and application health checks pass.
Validation commands:
oc get nodes -o wide oc get co oc get clusterversion oc get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded oc get routes -A
Troubleshooting Quick Reference
API Not Reachable After Startup
- Verify HAProxy VIP and backend health (6443).
- Verify DNS for api/api-int points to the correct VIP.
- Check control-plane node status via console and confirm kubelet/crio are running.
- Inspect operator status: `oc get co` (once API is back).
Nodes Stuck NotReady – ‘No CNI configuration file’
This typically indicates the network operator/OVN-Kubernetes pods are not healthy yet, or the node cannot write CNI config to /etc/kubernetes/cni/net.d/. Ensure the control plane is stable first, then verify ovnkube and multus pods.
oc get pods -n openshift-ovn-kubernetes -o wide oc logs -n openshift-ovn-kubernetes -l app=ovnkube-node -c ovnkube-controller --tail=200 oc get pods -n openshift-multus -o wide
etcd Health Issues
If etcd is unhealthy, do not power-cycle nodes repeatedly. Stabilize quorum first.
- Confirm at least 2 of 3 etcd members are Running.
- Check etcd endpoint health from an etcd pod.
- Review etcd pod logs for disk or certificate errors.
Standard Command Snippets (Cluster health)
oc get co oc get nodes -o wide oc get clusterversion Drain a worker oc adm cordon <worker> oc adm drain <worker> --ignore-daemonsets --delete-emptydir-data --force --timeout=20m
Uncordon
oc adm uncordon <node>
Thanks, Happy Clustering.
Thanks for your wonderful Support and Encouragement