We got High Speed Network link error caused by cabinet fall-outs.
Cabinets fall-out around 02:10 14-11-2015 most likely due to power spikes. We are still investigating the issue.
Update: since 04:45 14-11-2015 system is up again.
Downtime
Hexagon: system crashed
Hexagon crashed and had to be restarted. We will come back with more information later on.
Update: 2015-10-21 19:30 System and job submission was recovered.
Update: 2015-10-23 14:48 We got confirmation from building maintenance that system crashed due to an electricity failure around 16:45.
Update: 2015-10-21 19:30 System and job submission was recovered.
Update: 2015-10-23 14:48 We got confirmation from building maintenance that system crashed due to an electricity failure around 16:45.
Hexagon: MDS server crash
A new MDS server crash. Some jobs may fail.
Hopefully the MDS crashes will be eliminated after the maintenance we are planing later this year (a separate announcement will come).
Hopefully the MDS crashes will be eliminated after the maintenance we are planing later this year (a separate announcement will come).
/work filesystem hanging
We have problems with /work filesystem. We are looking into the problem.
Update 13:05: Issues were remediated and filesystem is available again. Please contact us in case you still encounter issues accessing it.
Hexagon: MDS crash
The metadata server for /work filesystem crashed on Friday evening.
Some user might have encountered filesystem errors at this point of time.
Hexagon: rebooted login1
Login node 1 hung and we had to reboot it.
Affected jobs are: 1689188, 1691986, 1693190, 1688272, 1688273, 1688264, 1688265, 1691903, 1693054, 1693083, 1693214, 1693209, 1693084, 1693203, 1693204, 1693056, 1692989, 1693499.
Hexagon: NFS timeouts on login nodes
We have once in a while NFS timeouts on different login nodes, the user logged in experience them as a short hangs. This been going for some last week, but not that often. The last week it started to be very often and almost on all nodes.
We've applied patch which is suppose to fix this issue. In order for changes to be picked up we need to restart Hexagon.
Update 15:30: Hexagon is up again.
Hexagon: rebooted login5
We had to reboot login5 due to a serious routing issue.
Our apologies for any inconvenience this could cause.
Hexagon: cooling failure forced machine to shut down
Due to a cooling failure, Hexagon was forced to shut down. We are investigating the issue and will keep you updated.
Update:
08:45 - Service is on-site trying to fix the cooling system. Will get back as soon as issue is remediated.
10:50 - Machine is up again.
Reboot of Hexagon, Fimm, Grunch
Due to important security update we will shortly reboot above mentioned systems.
Our apologies for any inconvenience caused by this.
Update: Hexagon and Grunch were stopped at 11:45 and again available at 12:35. Fimm login nodes were rebooted in the background.