Regatta node EN hang

lsz075 • August 8, 2005

The regatta node EN had a hang from ca. 13:00 to 15:10. Unknown reason, possibly caused by exessive paging / memory use as it answered to ping, but didn't give login prompt within a reasonable time. Node restarted.

Maintenance summary (fimm)

lsz075 • August 3, 2005

fimm was down Aug. 3. from 08:00 to 12:45 for scheduled maintenance.
Kernel and gpfs update, switch firmware update and satablade (disk) firmware update completed.

Scheduled downtime on fimm

lsz075 • July 25, 2005

Fimm will be down Wednesday Aug. 3. 08:00-14:00 for kernel upgrades, firmware update on switches, gpfs update and some minor fixes.

Kernel panic on frontend for fimm

lsz075 • July 18, 2005

The fimm frontend had kernel panic and was unavailable from ca 15:13 to 15:39. Jobs continued running on the nodes.

Totalview and Matlab upgrade

lsz075 • July 15, 2005

Totalview upgraded to version 7.0 on tre and fimm (version 6.7 still available)
Matlab upgraded to R14sp2 (7.0.4) on fimm and fire (with updated Bioinformatics toolbox-2.1 and Simulink-6.2.1)

Tape robot and /migrate filesystem down

lsz075 • July 11, 2005

A failure is being fixed on the robot. All backup operations (backup and restore) as well as the /migrate filesystem on tre,to,en is unavailable.

Update, 15:09: Some parts needs to be replaced, should arrive tomorrow.

Update, 2005-07-12 16:54: Replaced a cable in the robot. Backup/restore and /migrate is available.

fimm filesystems down

lsz075 • June 27, 2005

The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.

http://www.parallaw.uib.no/syslog/56

We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.

Downtime started Monday June 27 00:42:51.

fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes

Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.

Maintenance summary

lsz075 • June 14, 2005

Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.

Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).

Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.

Scheduled downtime on TRE+FIRE

lsz075 • June 7, 2005

The regatta cluster TRE will be down Tuesday June 14 08:00-14:00 for firmware upgrades, and replacement of a failed memory module on one of the nodes. Running jobs will be killed, and will have to be resubmitted after the maintenance stop.

Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.

Switch-problems on fimm

lsz075 • May 13, 2005

12-ports on one of the switches in the cluster stopped working at 02:00 this night, so we lost connection to 12 of the nodes for ~7 hours.

Affected nodes:

compute-0-18 compute-0-16 compute-0-11 compute-0-8 compute-0-7 compute-0-6 compute-0-5 compute-0-4 compute-0-3 compute-0-2 compute-0-1 compute-0-0

To resolve the problem, the failing switch had to be rebooted. This lead to a short (~30s) failure/unmount of the /work* and /home/fimm filesystems on all nodes. Uncertain how this affected running jobs. Most seems to have handled it without problems...

HPC Syslog

Log over changes and events on UiB's HPC systems