On Friday, May 28 at 11:45 both tre.ii.uib.no and fire.ii.uib.no crashed due to power outage (mistake made by electrician in neighbouring machine room that triggered an emergency power stop).
tre.ii.uib.no got back on-line at 17:11, downtime 5 hours, 26 minutes
After reboot 6 nodes on cluster fire.ii.uib.no were down, queueing system was down as well. Monday, May 31 at 20:50 fire.ii.uib.no got back on-line (3 nodes were still down), downtime 81 hours, 5 minutes
Hardware
UPS batteries replaced
The UPS batteries, and the DC-condensators on the UPS has been replaced. We should now have a fully functional UPS again, giving us 15-30 minutes of battery backed power.
Power outage – machines down
There was a power outage in the machine room, and no working
UPS, so all machines crashed.
Downtime:
20040420 18:47-20:19 = 2 hours, 32 minutes
UPS, so all machines crashed.
Downtime:
20040420 18:47-20:19 = 2 hours, 32 minutes
norgrid – scsi backplane replaced
The norgrid.bccs.no node was having trouble scanning the
scsi buses. Seems to have been a problem with the scsi
backplane. IBM replaced it today, and the problem disappeared.
scsi buses. Seems to have been a problem with the scsi
backplane. IBM replaced it today, and the problem disappeared.
disk crash on backupserver – hsmbcm unavailable
A disk crashed on the backup and fileserver for /net/bcmhsm,
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.
Replaced failing LTO drive
IBM replaced a failing LTO-1 drive in the tape robot.
Was: tape1/rmt3/serial=6811092091
Is now: tape1/rmt3/serial=6811147790
Was: tape1/rmt3/serial=6811092091
Is now: tape1/rmt3/serial=6811147790
UPS battery under voltaged
The UPS has failed one of the last couple of days.
The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.
I switched it back into battery backed power, and it
immediately went into normal operation.
The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.
I switched it back into battery backed power, and it
immediately went into normal operation.
/migrate available again
The /migrate filesystem is now back online.
Downtime: 20040122 08:28- 12:30 = 4 hours 2 minutes.
Downtime: 20040122 08:28- 12:30 = 4 hours 2 minutes.
Maintenance stop for /migrate
The /migrate filesystem will be unavailable thursday January
22. because of maintenance on the tape storage system.
We will unmount /migrate around 08:00 thursday morning, and
bring it back online as soon as we're finished, but it might
take all day. Any processes accessing /migrate this morning
will be killed.
22. because of maintenance on the tape storage system.
We will unmount /migrate around 08:00 thursday morning, and
bring it back online as soon as we're finished, but it might
take all day. Any processes accessing /migrate this morning
will be killed.