Downtime

backup and HSM-problems

lsz075 • August 26, 2005

The Tivoli Storage Manager database recovery log ran full, and then could no longer process backup or HSM-requests. The problem was noted at about 09:30, and resolved by 10:20.

Fimm frontend hang

lsz075 • August 15, 2005

The fimm frontend was non-responsive from 19:56 to 20:40 due to excessive memory usage by a interactive user process causing swap-storm and oom-killing. Frontend rebooted.

Tape robot got new gripper 1

lsz075 • August 11, 2005

Tape robot was offline for 30 min. for change of a faulty gripper.

Regatta node EN hang

lsz075 • August 8, 2005

The regatta node EN had a hang from ca. 13:00 to 15:10. Unknown reason, possibly caused by exessive paging / memory use as it answered to ping, but didn't give login prompt within a reasonable time. Node restarted.

Maintenance summary (fimm)

lsz075 • August 3, 2005

fimm was down Aug. 3. from 08:00 to 12:45 for scheduled maintenance.
Kernel and gpfs update, switch firmware update and satablade (disk) firmware update completed.

Scheduled downtime on fimm

lsz075 • July 25, 2005

Fimm will be down Wednesday Aug. 3. 08:00-14:00 for kernel upgrades, firmware update on switches, gpfs update and some minor fixes.

Kernel panic on frontend for fimm

lsz075 • July 18, 2005

The fimm frontend had kernel panic and was unavailable from ca 15:13 to 15:39. Jobs continued running on the nodes.

Tape robot and /migrate filesystem down

lsz075 • July 11, 2005

A failure is being fixed on the robot. All backup operations (backup and restore) as well as the /migrate filesystem on tre,to,en is unavailable.

Update, 15:09: Some parts needs to be replaced, should arrive tomorrow.

Update, 2005-07-12 16:54: Replaced a cable in the robot. Backup/restore and /migrate is available.

fimm filesystems down

lsz075 • June 27, 2005

The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.

http://www.parallaw.uib.no/syslog/56

We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.

Downtime started Monday June 27 00:42:51.

fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes

Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.

Maintenance summary

lsz075 • June 14, 2005

Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.

Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).

Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.

HPC Syslog

Log over changes and events on UiB's HPC systems