New power spike on hexagon. 8 cabinets went down. We are working to bring machine up ASAP.
Update: 15:33 Machine is up
New power spike on hexagon. 8 cabinets went down. We are working to bring machine up ASAP.
Update: 15:33 Machine is up
Hexagon has got failure on 4 cabinets and went down. We are working to start machine ASAP.
The possible reason is power spike/power outage.
Update: 10:09 Hexagon is up.
Fimm login node has been crashed. Reason GPFS filesystem hang. Login node is up and running. All open sessions has been killed.
Due to over memory usage , compute-1-32 crashed last night, we are trying reinstall that node , but currently having some hardware failure issue,
we are still working on it , hope fully compute-1-32 will be up tomorrow, we will keep you updated.
We are sorry for incontinence.
UPDATE 26th May 11:47 compute-1-32 is up and running.
One of the file server serving work and home file system on fimm cluster crashed 14:10, all jobs using work file system crashed due to Stale NFS file handle.
File server rebooted , and home file system and work file system mounted back to all compute nodes.
Failure in HSN link, hexagon is down. We are working on problem.
Update: 20:50 Machine is back online
Hexagon went down at 12:35 due to cabinet7 power problem.
Update: 15:10 the machine is up without cabinet 7. We will have downtime at Thursday May 06 at 12:00 to fix cabinet 7. The queue have a reservation in place such that only jobs that can complete (according to asked for walltime) before the maintenance will start.
For reconfiguring home file system setup on Fimm cluster and avoid missing home folder issue on all computer nodes , we will have downtime for whole Fimm cluster on 6th of April.
All Fimm cluster is reserved for maintenance from 11:00 on 6th of April, New submitted jobs which will not be able to finish before that time will not be able to run. All jobs which is already running and will not be able to finish before that time will be killed.
We will come with more information regarding to new configuration of home file system on Fimm cluster and keep you updated of the maintenance.
If you have any question please contact hpc-support@hpc.uib.no or support-uib@notur.no.
We got HSN (High Speed Network) link error between 2 cabinets and machine crashed. We are working to bring machine up.
Update: 17:30 Machine is now running again. Jobs which were running must be resubmitted.
Fimm login node crashed at 12:46 today, all user session was killed, Now login node is up and running again.
We investigating possible cause.