I told colleagues to remove their data from /nfs2 to some other disks with redundancy protection this morning and right in the afternoon host “nfs2.aqua-core.tll” crashes. I suspect it should be because of huge data migration via NFS. Always OSX NFS got problems.
Anyway, I recorded here every steps I did in this to “restart” the cluster remotely without brutely restarting every machine by pushing the power button. And I finally decided to remove all my RemoteMount startup script on each host and mount all the nfs mount after the entire cluster has been fully up. Current situation is:
- nfs.aqua-core.tll hangs and reboots itself.
- All NFS clients of nfs2 hangs and my previous solution of applying “unmount” at program level can not help.
- Any access to /nfs2 hangs and can not be interrupted.
Below is what I did in this situation (all shells are in adm_lsf account)
- Login nfs2 to check up file system with
dfanduptime - As most the machine needs to be reboot, LSF head should be reboot first
- Login libra and check up if there is existing connections
- Arrange clients to disconnect from libra
- Issue
sudo rebooton libra - Re-login libra and check LSF with
bqueuesandbhosts - Restart the nfs1 server with
dsh -w nfs1 "sudo reboot" - Once nfs1 is up, check it
- On libra, issue
dsh -N dmz "sudo reboot"to reboot all the servers in DMZ - On libra, issue
dsh -N farm "sudo reboot"to reboot all the LSF nodes - On libra run
bin/nfs_mountsto mount all the nfs shares for the cluster - On aries, startup the mysql server with
"cd /opt/mysql4; sudo mysqld_safe --user=mysql&" - On taurus, startup optional webservers in
/opt/ensembl15and “/opt/vega” - On taurus, startup proftpd in
~adm_taur/bin/start_ftp