At 8:30 this morning all compute nodes on fimm was shutdown due to cooling failure in machine room.
Now cooling is back to normal , we will take up all compute nodes within next 20 minutes.
Downtime
Hexagon: down due to cooling failure
Hexagon was shutdown due to a cooling failure in the building at 06:30.
We are investigating.
Update: 08:00 We will do some already planned maintenance while the machine is down.
Update: 13:05 Machine is back online. Maintenance done to disk-system firmware and some Lustre config checks, as well as a couple of hardware replacements.
We are investigating.
Update: 08:00 We will do some already planned maintenance while the machine is down.
Update: 13:05 Machine is back online. Maintenance done to disk-system firmware and some Lustre config checks, as well as a couple of hardware replacements.
FIMM: file system crash
Yesterday, 18th November 2010 around 14:00 , GPFS file system on fimm cluster is crashed , we were replacing switch which should be down without taking down GPFS file system , but unfortunately file system crashed.
Problem resolved around 15:30 same day, hopefully that will fix the continues GPFS file system crash on fimm.
Sorry for inconvenience.
Problem resolved around 15:30 same day, hopefully that will fix the continues GPFS file system crash on fimm.
Sorry for inconvenience.
FIMM: file system crash
We are still experiencing problems with our new 10GB internal
network, today around 18:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 21:15, and
now you can submit your job again.
Sorry for inconvenience.
network, today around 18:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 21:15, and
now you can submit your job again.
Sorry for inconvenience.
FIMM: file system crash
We are still experiencing problems with our new 10GB internal
network, yesterday around 21:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 22:15, but we put reservation on most
part of the cluster. this morning this reservation is removed, and
you can submit your job again.
Sorry for inconvenience.
network, yesterday around 21:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 22:15, but we put reservation on most
part of the cluster. this morning this reservation is removed, and
you can submit your job again.
Sorry for inconvenience.
Fimm: GPFS file system on fimm cluster crashed
GPFS file system on fimm cluster crashed , all file system is not available, we are working on it, mean while login node is blocked for maintenance.
10:21 fimm.bccs.uib.no is back online again, user ssh block is removed. We are investigating the issue.
12:17 We are still having problem with GPFS file system.all user connection is blocked.
14:00 The issue is resolved but we are doing test. cluster will be accessible soon.
10:21 fimm.bccs.uib.no is back online again, user ssh block is removed. We are investigating the issue.
12:17 We are still having problem with GPFS file system.all user connection is blocked.
14:00 The issue is resolved but we are doing test. cluster will be accessible soon.
Hexagon: reboot, mds problems
At 20:40 we had to restart hexagon because of MDS failover issues.
Fimm: login node crashed while kernel update
Fimm login node crashed while we are preforming kernel update on the login node, we are working on the issue, all jobs which is submitted before crash are not affected, we ill keep you updated and sorry for inconvenience.
16:38 Fimm login node is up again , new kernel installed and GPFS file system is updated, there were some unexpected problems while we are performing update, now everything works fine , again we are sorry for inconvenience.
16:38 Fimm login node is up again , new kernel installed and GPFS file system is updated, there were some unexpected problems while we are performing update, now everything works fine , again we are sorry for inconvenience.
Hexagon: power spike causes power off
There was a power spike / drop in the building causing hexagon to power off.
We are looking into it.
Update 16:30, The power spike also caused cooling issues. This means that we have to keep the machine off until the cooling can be fixed.
Update 21:00, 2nd cooling machine has been started again and machine is now running.
We are looking into it.
Update 16:30, The power spike also caused cooling issues. This means that we have to keep the machine off until the cooling can be fixed.
Update 21:00, 2nd cooling machine has been started again and machine is now running.
Hexagon: failed seastar in one module
Seastar failed in one module on hexagon and machine went down. We are working on the problem.
Update:08:45 machine is UP
Update:10:15 There was a problem with alps after reboot so jobs can't start. Now it is fixed.
Update:08:45 machine is UP
Update:10:15 There was a problem with alps after reboot so jobs can't start. Now it is fixed.