The regatta node EN had a hang from ca. 13:00 to 15:10. Unknown reason, possibly caused by exessive paging / memory use as it answered to ping, but didn't give login prompt within a reasonable time. Node restarted.
fimm was down Aug. 3. from 08:00 to 12:45 for scheduled maintenance.
Kernel and gpfs update, switch firmware update and satablade (disk) firmware update completed.
Totalview upgraded to version 7.0 on tre and fimm (version 6.7 still available)
Matlab upgraded to R14sp2 (7.0.4) on fimm and fire (with updated Bioinformatics toolbox-2.1 and Simulink-6.2.1)
Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.
Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).
Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.
The regatta cluster TRE will be down Tuesday June 14 08:00-14:00 for firmware upgrades, and replacement of a failed memory module on one of the nodes. Running jobs will be killed, and will have to be resubmitted after the maintenance stop.
Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.
To resolve the problem, the failing switch had to be rebooted. This lead to a short (~30s) failure/unmount of the /work* and /home/fimm filesystems on all nodes. Uncertain how this affected running jobs. Most seems to have handled it without problems...