Cluster node Management


  1. Install Ansible on a local computer.

  2. Clone the repo of ansible playbooks:

    git clone
  3. Pull the latest updates from the playbook repo:

      cd nautilus-ansible;
      git pull

Reboot a node due to GPU failure

ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>

Special instruction to reboot Ceph nodes

If multiple nodes within a Ceph cluster need to be rebooted, in order to maintain enough OSDs for redundancy, only one node is allowed to be rebooted at a time.

Run this command to enter rook-ceph-tools pod shell, where is the namespace of the corresponding Ceph cluster (one of rook, rook-east, rook-pacific, rook-haosu, rook-suncave):

kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={}) -- bash

In the pod shell, run

watch ceph health detail

Wait until [WRN] OSD_DOWN: 1 osds down to disappear from the ceph health detail output to reboot the next node.