There is a three hours scheduled maintenance for the storage serving /migrate and /bcmhsm.
This will take place on 28th of March starting from 13:00 o'clock.
Hexagon: login5 OOM
login5 ran out of memory yesterday (27.02.2017) around 18:16 and took about 15 minutes to recover.
During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".
We apologise for any inconvenience caused.
During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".
We apologise for any inconvenience caused.
Hexagon: decommissioned end of June 2017
All local CPU quotas will cease after 01.04.2017.
Login will be closed after 30.06.2017 so please make sure that all your data is transferred prior to that. Please plan this well in advance so that we avoid overload of the filesystem.
Login will be closed after 30.06.2017 so please make sure that all your data is transferred prior to that. Please plan this well in advance so that we avoid overload of the filesystem.
Hexagon: login4 rebooted
login4 had to be rebooted due to deadlocks preventing mounting three OSTs for /work filesystem and releasing three finished jobs.
Jobs terminated by the reboot are: 1944657, 1944664, 1944208 and 1944204.
Jobs terminated by the reboot are: 1944657, 1944664, 1944208 and 1944204.
Hexagon: power issues
Four cabinets went down due to power issues caused by the storm. Storage controllers for /work-common are also affected.
Hexagon was started without /work-common filesystem.
We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.
Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.
Hexagon was started without /work-common filesystem.
We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.
Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.
Hexagon: power issues
Hexagon is down due to power and cooling issues until further notice.
We will take up machine as soon as possible.
Update 14:40: Hexagon is up again.
We will take up machine as soon as possible.
Update 14:40: Hexagon is up again.
Hexagon: HSN link problems
We are having problems with the high speed network on Hexagon. We are working on the problem.
Update 11:31: Hexagon is up again. We had to disable one of the compute nodes due to hardware issues.
Update 11:31: Hexagon is up again. We had to disable one of the compute nodes due to hardware issues.
Fimm cluster /home and /work filesystem crash
This morning /fimm filesystem crashed on fimm.hpc.uib.no. This caused /fimm and /work filesystem unaccessible for users and fimm login node hanged.
We are able to take it up back online quick, but we are investigating the cause of the problem. Jobs that are running during the crash are all killed.
We are sorry fot inconvenience.
We are able to take it up back online quick, but we are investigating the cause of the problem. Jobs that are running during the crash are all killed.
We are sorry fot inconvenience.
Software Developer Course for the new HPC-system
UNINETT Sigma2 is organizing a Software Developer Course for the new HPC-system.
We are pleased to inform you that there will be a second HPC-course this autumn in Trondheim, at 30 November - 1 December, respectively.
Registration is open at https://response.questback.com/uninett/hpctrainingseminar
Please refer to the announcement on www.sigma2.no for further details.
Hexagon: rebooted because of important security update
We will need to carry out an emergency reboot after 13:00 o'clock today. Please save your work and log out from Hexagon by 13:00 o'clock.
More information to come later.
Update 13:59 2016-11-04: Access to the system is stopped and jobs has been terminated. Please accept our apologies for the inconveniences caused by the system reboot.
Update 16:05 2016-11-04: Security patches have been applied. Hexagon is back online again.
More information to come later.
Update 13:59 2016-11-04: Access to the system is stopped and jobs has been terminated. Please accept our apologies for the inconveniences caused by the system reboot.
Update 16:05 2016-11-04: Security patches have been applied. Hexagon is back online again.