Downtime

Grunch: down

Lóránd Szentannai • October 5, 2017

Both operating system disks failed in a short timeframe in Grunch making the system unoperational. We are trying to recover from the failure ASAP.

Update 14:00_06.10.2017: grunch server is up again. both os disks are replaced and grunch server are reinstalled.

Hexagon: power blink

Lóránd Szentannai • August 2, 2017

Lightning caused again crashing of the high speed network on Hexagon and several nodes.

Hexagon is up again starting from 11:15.

Hexagon: HSN problems, rebooted

Lóránd Szentannai • July 26, 2017

The high speed network stopped working today due to power spikes around 14:30.
The machine had to be rebooted and is up again since 16:40.

Hexagon: emergency reboot

Lóránd Szentannai • July 5, 2017

We will reboot Hexagon at 11:40 to apply important security updates.

Please accept our apologies for short notice.

Update 12:50 - Access to the machine is re-opened.

Hexagon: emergency reboot

Lóránd Szentannai • June 30, 2017

Hexagon had to be rebooted to apply important security updates.
The machine is up and login enabled from 10:30.

Please accept our apologies for short notice.

Hexagon: login5 OOM

Lóránd Szentannai • February 28, 2017

login5 ran out of memory yesterday (27.02.2017) around 18:16 and took about 15 minutes to recover.

During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".

We apologise for any inconvenience caused.

Hexagon: login4 rebooted

Lóránd Szentannai • December 29, 2016

login4 had to be rebooted due to deadlocks preventing mounting three OSTs for /work filesystem and releasing three finished jobs.

Jobs terminated by the reboot are: 1944657, 1944664, 1944208 and 1944204.

Hexagon: power issues

Lóránd Szentannai • December 26, 2016

Four cabinets went down due to power issues caused by the storm. Storage controllers for /work-common are also affected.
Hexagon was started without /work-common filesystem.

We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.

Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.

Hexagon: power issues

Lóránd Szentannai • December 20, 2016

Hexagon is down due to power and cooling issues until further notice.
We will take up machine as soon as possible.

Update 14:40: Hexagon is up again.

Hexagon: HSN link problems

Lóránd Szentannai • December 9, 2016

We are having problems with the high speed network on Hexagon. We are working on the problem.
Update 11:31: Hexagon is up again. We had to disable one of the compute nodes due to hardware issues.

HPC Syslog

Log over changes and events on UiB's HPC systems

Downtime

Grunch: down

Hexagon: power blink

Hexagon: HSN problems, rebooted

Hexagon: emergency reboot

Hexagon: emergency reboot

Hexagon: login5 OOM

Hexagon: login4 rebooted

Hexagon: power issues

Hexagon: power issues

Hexagon: HSN link problems