The metadata server for /work filesystem crashed on Friday evening.
Some user might have encountered filesystem errors at this point of time.
Someone managed to kill login3 and login4 by oversubscribing memory. As a result, the jobs started from these login nodes were killed.
We've started these login nodes and will investigate reasons tomorrow.
Login node 1 hung and we had to reboot it.
Affected jobs are: 1689188, 1691986, 1693190, 1688272, 1688273, 1688264, 1688265, 1691903, 1693054, 1693083, 1693214, 1693209, 1693084, 1693203, 1693204, 1693056, 1692989, 1693499.
We have once in a while NFS timeouts on different login nodes, the user logged in experience them as a short hangs. This been going for some last week, but not that often. The last week it started to be very often and almost on all nodes.
We've applied patch which is suppose to fix this issue. In order for changes to be picked up we need to restart Hexagon.
Update 15:30: Hexagon is up again.