Cluster node Management
Install Ansible on a local computer.
Clone the repo of ansible playbooks:
Pull the latest updates from the playbook repo:
Reboot a node due to GPU failure
Special instruction to reboot Ceph nodes
If multiple nodes within a Ceph cluster need to be rebooted, in order to maintain enough OSDs for redundancy, only one node is allowed to be rebooted at a time.
Run this command to enter rook-ceph-tools pod shell, where
In the pod shell, run
[WRN] OSD_DOWN: 1 osds down to disappear from the
ceph health detail output to reboot the next node.