The Tivoli Storage Manager database recovery log ran full, and then could no longer process backup or HSM-requests. The problem was noted at about 09:30, and resolved by 10:20.
Downtime
Fimm frontend hang
The fimm frontend was non-responsive from 19:56 to 20:40 due to excessive memory usage by a interactive user process causing swap-storm and oom-killing. Frontend rebooted.
Tape robot got new gripper 1
Tape robot was offline for 30 min. for change of a faulty gripper.
Regatta node EN hang
The regatta node EN had a hang from ca. 13:00 to 15:10. Unknown reason, possibly caused by exessive paging / memory use as it answered to ping, but didn't give login prompt within a reasonable time. Node restarted.
Maintenance summary (fimm)
fimm was down Aug. 3. from 08:00 to 12:45 for scheduled maintenance.
Kernel and gpfs update, switch firmware update and satablade (disk) firmware update completed.
Scheduled downtime on fimm
Fimm will be down Wednesday Aug. 3. 08:00-14:00 for kernel upgrades, firmware update on switches, gpfs update and some minor fixes.
Kernel panic on frontend for fimm
The fimm frontend had kernel panic and was unavailable from ca 15:13 to 15:39. Jobs continued running on the nodes.
Tape robot and /migrate filesystem down
A failure is being fixed on the robot. All backup operations (backup and restore) as well as the /migrate filesystem on tre,to,en is unavailable.
Update, 15:09: Some parts needs to be replaced, should arrive tomorrow.
Update, 2005-07-12 16:54: Replaced a cable in the robot. Backup/restore and /migrate is available.
fimm filesystems down
The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.
http://www.parallaw.uib.no/syslog/56
We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.
Downtime started Monday June 27 00:42:51.
fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes
Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.
Maintenance summary
Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.
Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).
Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.
