Both operating system disks failed in a short timeframe in Grunch making the system unoperational. We are trying to recover from the failure ASAP.
Update 14:00_06.10.2017: grunch server is up again. both os disks are replaced and grunch server are reinstalled.
Downtime
Hexagon: power blink
Lightning caused again crashing of the high speed network on Hexagon and several nodes.
Hexagon is up again starting from 11:15.
Hexagon: HSN problems, rebooted
The high speed network stopped working today due to power spikes around 14:30.
The machine had to be rebooted and is up again since 16:40.
The machine had to be rebooted and is up again since 16:40.
Hexagon: emergency reboot
We will reboot Hexagon at 11:40 to apply important security updates.
Please accept our apologies for short notice.
Update 12:50 - Access to the machine is re-opened.
Please accept our apologies for short notice.
Update 12:50 - Access to the machine is re-opened.
Hexagon: emergency reboot
Hexagon had to be rebooted to apply important security updates.
The machine is up and login enabled from 10:30.
Please accept our apologies for short notice.
The machine is up and login enabled from 10:30.
Please accept our apologies for short notice.
Hexagon: login5 OOM
login5 ran out of memory yesterday (27.02.2017) around 18:16 and took about 15 minutes to recover.
During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".
We apologise for any inconvenience caused.
During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".
We apologise for any inconvenience caused.
Hexagon: login4 rebooted
login4 had to be rebooted due to deadlocks preventing mounting three OSTs for /work filesystem and releasing three finished jobs.
Jobs terminated by the reboot are: 1944657, 1944664, 1944208 and 1944204.
Jobs terminated by the reboot are: 1944657, 1944664, 1944208 and 1944204.
Hexagon: power issues
Four cabinets went down due to power issues caused by the storm. Storage controllers for /work-common are also affected.
Hexagon was started without /work-common filesystem.
We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.
Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.
Hexagon was started without /work-common filesystem.
We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.
Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.
Hexagon: power issues
Hexagon is down due to power and cooling issues until further notice.
We will take up machine as soon as possible.
Update 14:40: Hexagon is up again.
We will take up machine as soon as possible.
Update 14:40: Hexagon is up again.
Hexagon: HSN link problems
We are having problems with the high speed network on Hexagon. We are working on the problem.
Update 11:31: Hexagon is up again. We had to disable one of the compute nodes due to hardware issues.
Update 11:31: Hexagon is up again. We had to disable one of the compute nodes due to hardware issues.