12-ports on one of the switches in the cluster stopped working at 02:00 this night, so we lost connection to 12 of the nodes for ~7 hours.
Affected nodes:
compute-0-18 compute-0-16 compute-0-11 compute-0-8 compute-0-7 compute-0-6 compute-0-5 compute-0-4 compute-0-3 compute-0-2 compute-0-1 compute-0-0
To resolve the problem, the failing switch had to be rebooted. This lead to a short (~30s) failure/unmount of the /work* and /home/fimm filesystems on all nodes. Uncertain how this affected running jobs. Most seems to have handled it without problems...
Downtime
FIMM downtime
FIMM was down for scheduled maintanance 2005/05/09 08:00-10:00 = 2 hours of the full cluster.
The work that was done was:
o upgraded firmware on SATABlades
o move /local from the local disk of each node, to a shared disk, to save precious space for local /scratch usage.
The work that was done was:
o upgraded firmware on SATABlades
o move /local from the local disk of each node, to a shared disk, to save precious space for local /scratch usage.
NOTUR 2005 conference, http://www.notur.no/notur2005
The 5th anual gathering on High Performance Computing in Norway will be held in Trondheim, May 30-31, 2005. Please see http://www.notur.no/notur2005 for details.
Scheduled downtime on fimm
Fimm will be down monday May 9th. 08:00-12:00 for firmware upgrades on the SATABlade disk solution, and possibly other minor changes. This is to fix the bug that triggered the disk crashes on March 30th.
http://www.parallaw.uib.no/syslog/56
http://www.parallaw.uib.no/syslog/56
Linux cluster ‘fire’ reinstalled
Fire is now back on-line. All nodes has been re-installed with the Rocks linux cluster distribution. Unfortunately the installation took a bit more time than expected because front-end node for some reason refused to install rocks distribution. Everything worked well when we used node 5 as front-end node.
Total downtime: 52 hours, 30 minutes
Total downtime: 52 hours, 30 minutes
disks on fimm failed
2 out of 3 raid arrays on fimm has failed, so /home/fimm and /work* is gone at the moment. This is a major fault, and can take some time before it's fixed.
Will update this entry when I know more.
Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.
10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.
10:35 Got confirmation from Nexsan support.
13:20 Chatted with Nexsan-support. They'll call me back ASAP.
15:43 Called Nexsan up again.. Where's the support??
16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.
18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.
Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.
Total downtime: 31 hours, 30 minutes
Will update this entry when I know more.
Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.
10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.
10:35 Got confirmation from Nexsan support.
13:20 Chatted with Nexsan-support. They'll call me back ASAP.
15:43 Called Nexsan up again.. Where's the support??
16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.
18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.
Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.
Total downtime: 31 hours, 30 minutes
Linux cluster “FIRE” is being reinstalled
The linux cluster fire.bccs.uib.no will be down tuesday/wednesday March 29./30.. All nodes will be re-installed with the Rocks linux cluster distribution. This is a major system upgrade, so most software will have to be rebuilt to run on fire after the upgrade.
/home/parallab and /work failure, regatta rebooted
An important network switch just failed, and took down the GPFS filesystems on TRE. Will borrow a new switch from the it-department ASAP.
09:50
New switch in place. Rebooting the nodes to get everything back up in shape.
10:12
Everything on node TRE is up. Rebooting node TO.
10:26
TO is all up. Rebooting node EN.
10:48
All nodes are up. /migrate and /net/bcmhsm is also resolved.
Total downtime:
09:10-10:48 = 1:38 on en, to, tre and fire.
Fimm was mostly unhurt.. only jobs accessing /home/parallab were affected.
09:50
New switch in place. Rebooting the nodes to get everything back up in shape.
10:12
Everything on node TRE is up. Rebooting node TO.
10:26
TO is all up. Rebooting node EN.
10:48
All nodes are up. /migrate and /net/bcmhsm is also resolved.
Total downtime:
09:10-10:48 = 1:38 on en, to, tre and fire.
Fimm was mostly unhurt.. only jobs accessing /home/parallab were affected.
NFS problems on tre, fire, fimm
There's a hang on a NFS-server causing problems on fire (all down), fimm (some hanging nodes) and tre (/net/bcmhsm and /migrate is hanging). Am working on resolving this.
Will update this log whenever there's progress.
Will update this log whenever there's progress.
Crash on regatta node EN
The regatta node EN crashed around 08:00 this morning. This caused filesystems to hang on fimm, and also made /work disappear from TO and TRE for a couple of hours.
Everything should be back up again now.
Downtime: ~4 hours on EN.
-------------
Update 20050121: The crash was caused by an uncorrectable memory error.
-------------
Update 20050124: IBM wants to replace 32 GB memory module. Also wants to upgrade several firmwares, so we should schedule a stop on all nodes soon.
Everything should be back up again now.
Downtime: ~4 hours on EN.
-------------
Update 20050121: The crash was caused by an uncorrectable memory error.
-------------
Update 20050124: IBM wants to replace 32 GB memory module. Also wants to upgrade several firmwares, so we should schedule a stop on all nodes soon.