Downtime

Hexagon was shutdown due to a cooling failure in the building at 06:30.
We are investigating.

Update: 08:00 We will do some already planned maintenance while the machine is down.

Update: 13:05 Machine is back online. Maintenance done to disk-system firmware and some Lustre config checks, as well as a couple of hardware replacements.

Yesterday, 18th November 2010 around 14:00 , GPFS file system on fimm cluster is crashed , we were replacing switch which should be down without taking down GPFS file system , but unfortunately file system crashed.

Problem resolved around 15:30 same day, hopefully that will fix the continues GPFS file system crash on fimm.

Sorry for inconvenience.

We are still experiencing problems with our new 10GB internal
network, yesterday around 21:30 GPFS file system crashed, and all
running jobs was killed.

We took up our file system at 22:15, but we put reservation on most
part of the cluster. this morning this reservation is removed, and
you can submit your job again.

Sorry for inconvenience.

GPFS file system on fimm cluster crashed , all file system is not available, we are working on it, mean while login node is blocked for maintenance.


10:21 fimm.bccs.uib.no is back online again, user ssh block is removed. We are investigating the issue.

12:17 We are still having problem with GPFS file system.all user connection is blocked.

14:00 The issue is resolved but we are doing test. cluster will be accessible soon.

Fimm login node crashed while we are preforming kernel update on the login node, we are working on the issue, all jobs which is submitted before crash are not affected, we ill keep you updated and sorry for inconvenience.

16:38 Fimm login node is up again , new kernel installed and GPFS file system is updated, there were some unexpected problems while we are performing update, now everything works fine , again we are sorry for inconvenience.

There was a power spike / drop in the building causing hexagon to power off.

We are looking into it.

Update 16:30, The power spike also caused cooling issues. This means that we have to keep the machine off until the cooling can be fixed.

Update 21:00, 2nd cooling machine has been started again and machine is now running.