Skip to content

Cluster node Management

General things to look at when rebooting a node

  1. Does the node have any rook-ceph-osd-* pods? Check the corresponging ceph cluster for health, make sure you bring down one node of that cluster at a time
  2. Does the node have any haproxy-ingress-* pods? If node is coming down for a long time, disable the record in Constellix DNS.
  3. Does the node have any seaweedfs/[*volume*|seaweed-master|filer-db] pods? Make sure seaweedfs is not being actively used.
  4. Does the node have the label? This node is the linstor server. Some are redundant, some are not.
  5. Does the node have label? There are 2 nodes in the cluster used for MetalLB IPs, keep one of those alive.
  6. Does the node have label and is not admiralty virtual node? Rebooting this node will make the cluster not accessible.


  1. Install Ansible on a local computer.

  2. Clone the repo of ansible playbooks:

    git clone
  3. Pull the latest updates from the playbook repo:

      cd nautilus-ansible;
      git pull

Reboot a node due to GPU failure

ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>

Special instruction to reboot Ceph nodes

If multiple nodes within a Ceph cluster need to be rebooted, in order to maintain enough OSDs for redundancy, only one node is allowed to be rebooted at a time.

Run this command to enter rook-ceph-tools pod shell, where is the namespace of the corresponding Ceph cluster (one of rook, rook-east, rook-pacific, rook-haosu, rook-suncave):

kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={}) -- bash

In the pod shell, run

watch ceph health detail

Wait until [WRN] OSD_DOWN: 1 osds down to disappear from the ceph health detail output to reboot the next node.