Some of the OSTs serving /work filesystem has become full and caused few jobs to fail. We are working on rebalancing the usage between the OSTs but it is fairly difficult since /work is 87% used at the moment. We have notified top users of /work filesystem to clean-up un-necessarry files.
Today at 8:23 primary MDS serving /work has crashed. This resulted that all IO to /work was suspended.
The failover MDS is up from 10:50 and serving /work fs. All IO should be recovered.
We will investigate cause of primary MDS crash on Monday.
Looks like after the reboot of the machine on Friday not all data nodes hosting /work picked up proper settings and /work fs is temporary slow. We are working to resolve this issue ASAP.
Update 10:00 This issue is resolved now.
From this night there is a problem with one of our storage systems serving /work, we are looking into the problem.
Update 02.12.2014 13:00: Issue has been remediated, /work should be OK now.
We have to unmount /work-common from Hexagon because of HW issues with the /work-common MDS server.
We are working to fix this problem ASAP.
Update 21:35: We have moved MDS to one of the OSTs until MDS HW is fixed. The file system can have slightly degraded performance. It should be available on all compute nodes and login nodes, except login5. We are working to make it available on login5 as well.
Update 14.05 13:00 : /work-common is back on login5
There is ongoing maintenance on the tape library to resolve several issues.
Update 12:30: Tape services are back online.
We are experiencing problems with the tape robot. Until they not be resolved /bcmhsm and /migrate filesystems will be unavailable.
Update: the filesystems are available now. The /bcmhsm is almost full, please allow it to drain to the tape robot up to 50% before copying to it.
One of the OST /work FS nodes crashed. We are working on it. /work fs currently is unavailable.
Update:13:12 OST was recovered , /work FS should be back online
Update:25.11 15:44 new crash of the same node in filesystem. We are working to fix FS ASAP.
Update:25.11 16:15 /work is back alive. We had to disable quota.
Update:26.11 5:30 This time another OST crashed, fs is online, we are investigating root cause for OST crashes.
Due to a needed security update that requires a reboot we will be forced to do the next maintenance of hexagon earlier than planned. We will therefore have a scheduled maintenance starting on Thursday Sep. 10th at 13:00.
Job-scheduler reservation is now in place so that only jobs that can finish (according to requested walltime) before the scheduled maintenance will be allowed to start.
During the maintenance we will install a security update as well as replacing a few faulty hardware components.
We will update this note when we have more information about expected length or ongoing progress for the maintenance.
As usual, send any questions to firstname.lastname@example.org.
Update 16:30: Machine is now up again and ready for use.
Hexagon has high-speed network problems between few nodes, therefore all machine is reserved and not available for submitting jobs.
Update: 12:00 Machine has to be restarted.
Update: 13:20 Hexagon is back online.