Hardware

Hexagon: issues with /work storage

Lóránd Szentannai • April 22, 2016

Some of the OSTs serving /work filesystem has become full and caused few jobs to fail. We are working on rebalancing the usage between the OSTs but it is fairly difficult since /work is 87% used at the moment. We have notified top users of /work filesystem to clean-up un-necessarry files.

Hexagon: MDS server crash

Alexander Oltu • August 1, 2015

Today at 8:23 primary MDS serving /work has crashed. This resulted that all IO to /work was suspended.
The failover MDS is up from 10:50 and serving /work fs. All IO should be recovered.
We will investigate cause of primary MDS crash on Monday.

Hexagon: tmp issues with /work

lsz075 • December 8, 2014

Looks like after the reboot of the machine on Friday not all data nodes hosting /work picked up proper settings and /work fs is temporary slow. We are working to resolve this issue ASAP.

Update 10:00 This issue is resolved now.

Hexagon: issues with /work storages

lsz075 • December 2, 2014

From this night there is a problem with one of our storage systems serving /work, we are looking into the problem.

Update 02.12.2014 13:00: Issue has been remediated, /work should be OK now.

/work-common issues

lsz075 • May 13, 2013

We have to unmount /work-common from Hexagon because of HW issues with the /work-common MDS server.

We are working to fix this problem ASAP.

Update 21:35: We have moved MDS to one of the OSTs until MDS HW is fixed. The file system can have slightly degraded performance. It should be available on all compute nodes and login nodes, except login5. We are working to make it available on login5 as well.

Update 14.05 13:00 : /work-common is back on login5

Tape services will be unavailable for ~1-2hours

lsz075 • February 28, 2013

There is ongoing maintenance on the tape library to resolve several issues.

Update 12:30: Tape services are back online.

Problems with /bcmhsm and /migrate

lsz075 • July 14, 2010

We are experiencing problems with the tape robot. Until they not be resolved /bcmhsm and /migrate filesystems will be unavailable.

Update: the filesystems are available now. The /bcmhsm is almost full, please allow it to drain to the tape robot up to 50% before copying to it.

/work fs hang on hexagon

lsz075 • November 24, 2009

One of the OST /work FS nodes crashed. We are working on it. /work fs currently is unavailable.

Update:13:12 OST was recovered , /work FS should be back online
Update:25.11 15:44 new crash of the same node in filesystem. We are working to fix FS ASAP.
Update:25.11 16:15 /work is back alive. We had to disable quota.
Update:26.11 5:30 This time another OST crashed, fs is online, we are investigating root cause for OST crashes.

Scheduled maintenance for hexagon, Thu Sep. 10th

lsz075 • September 6, 2009

Due to a needed security update that requires a reboot we will be forced to do the next maintenance of hexagon earlier than planned. We will therefore have a scheduled maintenance starting on Thursday Sep. 10th at 13:00.

Job-scheduler reservation is now in place so that only jobs that can finish (according to requested walltime) before the scheduled maintenance will be allowed to start.

During the maintenance we will install a security update as well as replacing a few faulty hardware components.

We will update this note when we have more information about expected length or ongoing progress for the maintenance.

As usual, send any questions to support-uib@notur.no.

Update 16:30: Machine is now up again and ready for use.

Hexagon HSN network problems, July 6th

lsz075 • July 6, 2009

Hexagon has high-speed network problems between few nodes, therefore all machine is reserved and not available for submitting jobs.

Update: 12:00 Machine has to be restarted.
Update: 13:20 Hexagon is back online.

HPC Syslog

Log over changes and events on UiB's HPC systems