Some of the OSTs serving /work filesystem has become full and caused few jobs to fail. We are working on rebalancing the usage between the OSTs but it is fairly difficult since /work is 87% used at the moment. We have notified top users of /work filesystem to clean-up un-necessarry files.
Today at 8:23 primary MDS serving /work has crashed. This resulted that all IO to /work was suspended. The failover MDS is up from 10:50 and serving /work fs. All IO should be recovered. We will investigate cause of primary MDS crash on Monday.
Looks like after the reboot of the machine on Friday not all data nodes hosting /work picked up proper settings and /work fs is temporary slow. We are working to resolve this issue ASAP.
We have to unmount /work-common from Hexagon because of HW issues with the /work-common MDS server.
We are working to fix this problem ASAP.
Update 21:35: We have moved MDS to one of the OSTs until MDS HW is fixed. The file system can have slightly degraded performance. It should be available on all compute nodes and login nodes, except login5. We are working to make it available on login5 as well.
Update 14.05 13:00 : /work-common is back on login5
One of the OST /work FS nodes crashed. We are working on it. /work fs currently is unavailable.
Update:13:12 OST was recovered , /work FS should be back online Update:25.11 15:44 new crash of the same node in filesystem. We are working to fix FS ASAP. Update:25.11 16:15 /work is back alive. We had to disable quota. Update:26.11 5:30 This time another OST crashed, fs is online, we are investigating root cause for OST crashes.
Due to a needed security update that requires a reboot we will be forced to do the next maintenance of hexagon earlier than planned. We will therefore have a scheduled maintenance starting on Thursday Sep. 10th at 13:00.
Job-scheduler reservation is now in place so that only jobs that can finish (according to requested walltime) before the scheduled maintenance will be allowed to start.
During the maintenance we will install a security update as well as replacing a few faulty hardware components.
We will update this note when we have more information about expected length or ongoing progress for the maintenance.
As usual, send any questions to support-uib@notur.no.
Update 16:30: Machine is now up again and ready for use.