The regatta node EN had a hang from ca. 13:00 to 15:10. Unknown reason, possibly caused by exessive paging / memory use as it answered to ping, but didn't give login prompt within a reasonable time. Node restarted.
Maintenance summary (fimm)
fimm was down Aug. 3. from 08:00 to 12:45 for scheduled maintenance.
Kernel and gpfs update, switch firmware update and satablade (disk) firmware update completed.
Scheduled downtime on fimm
Fimm will be down Wednesday Aug. 3. 08:00-14:00 for kernel upgrades, firmware update on switches, gpfs update and some minor fixes.
Kernel panic on frontend for fimm
The fimm frontend had kernel panic and was unavailable from ca 15:13 to 15:39. Jobs continued running on the nodes.
Totalview and Matlab upgrade
Totalview upgraded to version 7.0 on tre and fimm (version 6.7 still available)
Matlab upgraded to R14sp2 (7.0.4) on fimm and fire (with updated Bioinformatics toolbox-2.1 and Simulink-6.2.1)
Tape robot and /migrate filesystem down
A failure is being fixed on the robot. All backup operations (backup and restore) as well as the /migrate filesystem on tre,to,en is unavailable.
Update, 15:09: Some parts needs to be replaced, should arrive tomorrow.
Update, 2005-07-12 16:54: Replaced a cable in the robot. Backup/restore and /migrate is available.
fimm filesystems down
The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.
http://www.parallaw.uib.no/syslog/56
We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.
Downtime started Monday June 27 00:42:51.
fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes
Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.
Maintenance summary
Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.
Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).
Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.
Scheduled downtime on TRE+FIRE
The regatta cluster TRE will be down Tuesday June 14 08:00-14:00 for firmware upgrades, and replacement of a failed memory module on one of the nodes. Running jobs will be killed, and will have to be resubmitted after the maintenance stop.
Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.
Switch-problems on fimm
12-ports on one of the switches in the cluster stopped working at 02:00 this night, so we lost connection to 12 of the nodes for ~7 hours.
Affected nodes:
compute-0-18 compute-0-16 compute-0-11 compute-0-8 compute-0-7 compute-0-6 compute-0-5 compute-0-4 compute-0-3 compute-0-2 compute-0-1 compute-0-0
To resolve the problem, the failing switch had to be rebooted. This lead to a short (~30s) failure/unmount of the /work* and /home/fimm filesystems on all nodes. Uncertain how this affected running jobs. Most seems to have handled it without problems...
