Cluster node Management
Prerequisites
-
Install Ansible on a local computer.
-
Clone the repo of ansible playbooks:
-
Pull the latest updates from the playbook repo:
Reboot a node due to GPU failure
Special instruction to reboot Ceph nodes
If multiple nodes within a Ceph cluster need to be rebooted, in order to maintain enough OSDs for redundancy, only one node is allowed to be rebooted at a time.
Run this command to enter rook-ceph-tools pod shell, where rook
, rook-east
, rook-pacific
, rook-haosu
, rook-suncave
):
kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bash
In the pod shell, run
Wait until [WRN] OSD_DOWN: 1 osds down
to disappear from the ceph health detail
output to reboot the next node.