An important network switch just failed, and took down the GPFS filesystems on TRE. Will borrow a new switch from the it-department ASAP.
09:50
New switch in place. Rebooting the nodes to get everything back up in shape.
10:12
Everything on node TRE is up. Rebooting node TO.
10:26
TO is all up. Rebooting node EN.
10:48
All nodes are up. /migrate and /net/bcmhsm is also resolved.
Total downtime:
09:10-10:48 = 1:38 on en, to, tre and fire.
Fimm was mostly unhurt.. only jobs accessing /home/parallab were affected.