Because of many recoverable errors, pdisk34 was replaced.
Downtime
Crash on node EN
Node EN got stuck at about 18:00 today, and was up again 20:00.
For the first hour there also where some hangs for the /work filesystem on TO and TRE, caused by the hang on EN.
Will add more to this log later..
Maintenance summary
Fast Write Cache batteries on ssa0, ssa1 and ssa2 on node TO were replaced while a plumber worked on the cooling water. No problems.
Downtime on the regatta and linux cluster:
20040701 08:00-10:10 = 2 hours, 10 minutes
Power outage – machines down
On Friday, May 28 at 11:45 both tre.ii.uib.no and fire.ii.uib.no crashed due to power outage (mistake made by electrician in neighbouring machine room that triggered an emergency power stop).
tre.ii.uib.no got back on-line at 17:11, downtime 5 hours, 26 minutes
After reboot 6 nodes on cluster fire.ii.uib.no were down, queueing system was down as well. Monday, May 31 at 20:50 fire.ii.uib.no got back on-line (3 nodes were still down), downtime 81 hours, 5 minutes
Power outage – machines down
There was a power outage in the machine room, and no working
UPS, so all machines crashed.
Downtime:
20040420 18:47-20:19 = 2 hours, 32 minutes
norgrid – scsi backplane replaced
The norgrid.bccs.no node was having trouble scanning the
scsi buses. Seems to have been a problem with the scsi
backplane. IBM replaced it today, and the problem disappeared.
disk crash on backupserver – hsmbcm unavailable
A disk crashed on the backup and fileserver for /net/bcmhsm,
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.
Power outage
The power failed in large parts of Bergen, and the UPS
didn't cover the downtime.
Downtime: 20040213 14:01 -> 15:12 = 1 hour 11 minutes on all nodes.
UPS battery under voltaged
The UPS has failed one of the last couple of days.
The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.
I switched it back into battery backed power, and it
immediately went into normal operation.
node14 and node15 rebooted
The networking failed on node14 and node15 of the linux
cluster. Both nodes were complaining about::
eth0: card reports no resources.
Not sure if this is a hardware or software bug, but it's
happened once before. This time we lost 4 of flikka's jobs. They were most likely working hard against NFS, maybe this triggered the crash?
Will upgrade all nodes to the latest kernel from redhat to
see if this fixes the problem.
Node14 downtime: 20:25 20040123 - 07:45 20040126 = 2 days,
11 hours, 10 minutes
Node15 downtime: 12:26 20040124 - 07:45 20040126 = 1 day,
10 hours, 11 minutes
