Downtime

On Friday, May 28 at 11:45 both tre.ii.uib.no and fire.ii.uib.no crashed due to power outage (mistake made by electrician in neighbouring machine room that triggered an emergency power stop).

tre.ii.uib.no got back on-line at 17:11, downtime 5 hours, 26 minutes

After reboot 6 nodes on cluster fire.ii.uib.no were down, queueing system was down as well. Monday, May 31 at 20:50 fire.ii.uib.no got back on-line (3 nodes were still down), downtime 81 hours, 5 minutes

The networking failed on node14 and node15 of the linux
cluster. Both nodes were complaining about::

eth0: card reports no resources.

Not sure if this is a hardware or software bug, but it's
happened once before. This time we lost 4 of flikka's jobs. They were most likely working hard against NFS, maybe this triggered the crash?
Will upgrade all nodes to the latest kernel from redhat to
see if this fixes the problem.

Node14 downtime: 20:25 20040123 - 07:45 20040126 = 2 days,
11 hours, 10 minutes

Node15 downtime: 12:26 20040124 - 07:45 20040126 = 1 day,
10 hours, 11 minutes