Node EN got stuck at about 18:00 today, and was up again 20:00.
For the first hour there also where some hangs for the /work filesystem on TO and TRE, caused by the hang on EN.
On Friday, May 28 at 11:45 both tre.ii.uib.no and fire.ii.uib.no crashed due to power outage (mistake made by electrician in neighbouring machine room that triggered an emergency power stop).
tre.ii.uib.no got back on-line at 17:11, downtime 5 hours, 26 minutes
After reboot 6 nodes on cluster fire.ii.uib.no were down, queueing system was down as well. Monday, May 31 at 20:50 fire.ii.uib.no got back on-line (3 nodes were still down), downtime 81 hours, 5 minutes
The norgrid.bccs.no node was having trouble scanning the
scsi buses. Seems to have been a problem with the scsi
backplane. IBM replaced it today, and the problem disappeared.
A disk crashed on the backup and fileserver for /net/bcmhsm,
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.
The UPS has failed one of the last couple of days.
The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.
I switched it back into battery backed power, and it
immediately went into normal operation.
The networking failed on node14 and node15 of the linux
cluster. Both nodes were complaining about::
eth0: card reports no resources.
Not sure if this is a hardware or software bug, but it's
happened once before. This time we lost 4 of flikka's jobs. They were most likely working hard against NFS, maybe this triggered the crash?
Will upgrade all nodes to the latest kernel from redhat to
see if this fixes the problem.