New power spike on hexagon. 8 cabinets went down. We are working to bring machine up ASAP.
Update: 15:33 Machine is up
Downtime
Hexagon: failure on 4 cabinets
Hexagon has got failure on 4 cabinets and went down. We are working to start machine ASAP.
The possible reason is power spike/power outage.
Update: 10:09 Hexagon is up.
The possible reason is power spike/power outage.
Update: 10:09 Hexagon is up.
Fimm: login node crash
Fimm login node has been crashed. Reason GPFS filesystem hang. Login node is up and running. All open sessions has been killed.
Fimm: bjerknode/Compute-1-32 is down
Due to over memory usage , compute-1-32 crashed last night, we are trying reinstall that node , but currently having some hardware failure issue,
we are still working on it , hope fully compute-1-32 will be up tomorrow, we will keep you updated.
We are sorry for incontinence.
UPDATE 26th May 11:47 compute-1-32 is up and running.
we are still working on it , hope fully compute-1-32 will be up tomorrow, we will keep you updated.
We are sorry for incontinence.
UPDATE 26th May 11:47 compute-1-32 is up and running.
Fimm: file server crashed
One of the file server serving work and home file system on fimm cluster crashed 14:10, all jobs using work file system crashed due to Stale NFS file handle.
File server rebooted , and home file system and work file system mounted back to all compute nodes.
File server rebooted , and home file system and work file system mounted back to all compute nodes.
Hexagon: down, HSN link problem
Failure in HSN link, hexagon is down. We are working on problem.
Update: 20:50 Machine is back online
Update: 20:50 Machine is back online
Hexagon: down due to cabinet power problem
Hexagon went down at 12:35 due to cabinet7 power problem.
Update: 15:10 the machine is up without cabinet 7. We will have downtime at Thursday May 06 at 12:00 to fix cabinet 7. The queue have a reservation in place such that only jobs that can complete (according to asked for walltime) before the maintenance will start.
Update: 15:10 the machine is up without cabinet 7. We will have downtime at Thursday May 06 at 12:00 to fix cabinet 7. The queue have a reservation in place such that only jobs that can complete (according to asked for walltime) before the maintenance will start.
Fimm:downtime for whole cluster
For reconfiguring home file system setup on Fimm cluster and avoid missing home folder issue on all computer nodes , we will have downtime for whole Fimm cluster on 6th of April.
All Fimm cluster is reserved for maintenance from 11:00 on 6th of April, New submitted jobs which will not be able to finish before that time will not be able to run. All jobs which is already running and will not be able to finish before that time will be killed.
We will come with more information regarding to new configuration of home file system on Fimm cluster and keep you updated of the maintenance.
If you have any question please contact hpc-support@hpc.uib.no or support-uib@notur.no.
All Fimm cluster is reserved for maintenance from 11:00 on 6th of April, New submitted jobs which will not be able to finish before that time will not be able to run. All jobs which is already running and will not be able to finish before that time will be killed.
We will come with more information regarding to new configuration of home file system on Fimm cluster and keep you updated of the maintenance.
If you have any question please contact hpc-support@hpc.uib.no or support-uib@notur.no.
Hexagon: crash of HSN, March 11th
We got HSN (High Speed Network) link error between 2 cabinets and machine crashed. We are working to bring machine up.
Update: 17:30 Machine is now running again. Jobs which were running must be resubmitted.
Update: 17:30 Machine is now running again. Jobs which were running must be resubmitted.
Fimm: fimm login node crashed
Fimm login node crashed at 12:46 today, all user session was killed, Now login node is up and running again.
We investigating possible cause.
We investigating possible cause.