Downtime

pdisk34 / gpfswrkvg37 replaced

lsz075 • September 8, 2004

Because of many recoverable errors, pdisk34 was replaced.

Crash on node EN

lsz075 • August 25, 2004

Node EN got stuck at about 18:00 today, and was up again 20:00.
For the first hour there also where some hangs for the /work filesystem on TO and TRE, caused by the hang on EN.

Will add more to this log later..

Maintenance summary

lsz075 • July 1, 2004

Fast Write Cache batteries on ssa0, ssa1 and ssa2 on node TO were replaced while a plumber worked on the cooling water. No problems.

Downtime on the regatta and linux cluster:

20040701 08:00-10:10 = 2 hours, 10 minutes

Power outage – machines down

lsz075 • June 4, 2004

On Friday, May 28 at 11:45 both tre.ii.uib.no and fire.ii.uib.no crashed due to power outage (mistake made by electrician in neighbouring machine room that triggered an emergency power stop).

tre.ii.uib.no got back on-line at 17:11, downtime 5 hours, 26 minutes

After reboot 6 nodes on cluster fire.ii.uib.no were down, queueing system was down as well. Monday, May 31 at 20:50 fire.ii.uib.no got back on-line (3 nodes were still down), downtime 81 hours, 5 minutes

Power outage – machines down

lsz075 • April 21, 2004

There was a power outage in the machine room, and no working
UPS, so all machines crashed.

Downtime:

20040420 18:47-20:19 = 2 hours, 32 minutes

norgrid – scsi backplane replaced

lsz075 • March 25, 2004

The norgrid.bccs.no node was having trouble scanning the
scsi buses. Seems to have been a problem with the scsi
backplane. IBM replaced it today, and the problem disappeared.

disk crash on backupserver – hsmbcm unavailable

lsz075 • March 17, 2004

A disk crashed on the backup and fileserver for /net/bcmhsm,
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.

Power outage

lsz075 • February 13, 2004

The power failed in large parts of Bergen, and the UPS
didn't cover the downtime.

Downtime: 20040213 14:01 -> 15:12 = 1 hour 11 minutes on all nodes.

UPS battery under voltaged

lsz075 • January 30, 2004

The UPS has failed one of the last couple of days.

The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.

I switched it back into battery backed power, and it
immediately went into normal operation.

node14 and node15 rebooted

lsz075 • January 26, 2004

The networking failed on node14 and node15 of the linux
cluster. Both nodes were complaining about::

eth0: card reports no resources.

Not sure if this is a hardware or software bug, but it's
happened once before. This time we lost 4 of flikka's jobs. They were most likely working hard against NFS, maybe this triggered the crash?
Will upgrade all nodes to the latest kernel from redhat to
see if this fixes the problem.

Node14 downtime: 20:25 20040123 - 07:45 20040126 = 2 days,
11 hours, 10 minutes

Node15 downtime: 12:26 20040124 - 07:45 20040126 = 1 day,
10 hours, 11 minutes

HPC Syslog

Log over changes and events on UiB's HPC systems

Downtime

pdisk34 / gpfswrkvg37 replaced

Crash on node EN

Maintenance summary

Power outage – machines down

Power outage – machines down

norgrid – scsi backplane replaced

disk crash on backupserver – hsmbcm unavailable

Power outage

UPS battery under voltaged

node14 and node15 rebooted