Downtime

Fimm: compute nodes down due to cooling failure

lsz075 • November 25, 2010

At 8:30 this morning all compute nodes on fimm was shutdown due to cooling failure in machine room.

Now cooling is back to normal , we will take up all compute nodes within next 20 minutes.

Hexagon: down due to cooling failure

lsz075 • November 25, 2010

Hexagon was shutdown due to a cooling failure in the building at 06:30.
We are investigating.

Update: 08:00 We will do some already planned maintenance while the machine is down.

Update: 13:05 Machine is back online. Maintenance done to disk-system firmware and some Lustre config checks, as well as a couple of hardware replacements.

FIMM: file system crash

lsz075 • November 19, 2010

Yesterday, 18th November 2010 around 14:00 , GPFS file system on fimm cluster is crashed , we were replacing switch which should be down without taking down GPFS file system , but unfortunately file system crashed.

Problem resolved around 15:30 same day, hopefully that will fix the continues GPFS file system crash on fimm.

Sorry for inconvenience.

FIMM: file system crash

lsz075 • November 8, 2010

We are still experiencing problems with our new 10GB internal
network, today around 18:30 GPFS file system crashed, and all
running jobs was killed.

We took up our file system at 21:15, and
now you can submit your job again.

Sorry for inconvenience.

FIMM: file system crash

lsz075 • November 2, 2010

We are still experiencing problems with our new 10GB internal
network, yesterday around 21:30 GPFS file system crashed, and all
running jobs was killed.

We took up our file system at 22:15, but we put reservation on most
part of the cluster. this morning this reservation is removed, and
you can submit your job again.

Sorry for inconvenience.

Fimm: GPFS file system on fimm cluster crashed

lsz075 • October 21, 2010

GPFS file system on fimm cluster crashed , all file system is not available, we are working on it, mean while login node is blocked for maintenance.

10:21 fimm.bccs.uib.no is back online again, user ssh block is removed. We are investigating the issue.

12:17 We are still having problem with GPFS file system.all user connection is blocked.

14:00 The issue is resolved but we are doing test. cluster will be accessible soon.

Hexagon: reboot, mds problems

lsz075 • September 22, 2010

At 20:40 we had to restart hexagon because of MDS failover issues.

Fimm: login node crashed while kernel update

lsz075 • September 6, 2010

Fimm login node crashed while we are preforming kernel update on the login node, we are working on the issue, all jobs which is submitted before crash are not affected, we ill keep you updated and sorry for inconvenience.

16:38 Fimm login node is up again , new kernel installed and GPFS file system is updated, there were some unexpected problems while we are performing update, now everything works fine , again we are sorry for inconvenience.

Hexagon: power spike causes power off

lsz075 • July 30, 2010

There was a power spike / drop in the building causing hexagon to power off.

We are looking into it.

Update 16:30, The power spike also caused cooling issues. This means that we have to keep the machine off until the cooling can be fixed.

Update 21:00, 2nd cooling machine has been started again and machine is now running.

Hexagon: failed seastar in one module

lsz075 • July 2, 2010

Seastar failed in one module on hexagon and machine went down. We are working on the problem.
Update:08:45 machine is UP
Update:10:15 There was a problem with alps after reboot so jobs can't start. Now it is fixed.

HPC Syslog

Log over changes and events on UiB's HPC systems

Downtime

Fimm: compute nodes down due to cooling failure

Hexagon: down due to cooling failure

FIMM: file system crash

FIMM: file system crash

FIMM: file system crash

Fimm: GPFS file system on fimm cluster crashed

Hexagon: reboot, mds problems

Fimm: login node crashed while kernel update

Hexagon: power spike causes power off

Hexagon: failed seastar in one module