Downtime

Fimm: down time for work file system

lsz075 • February 4, 2010

We will have very short down time for work file system on fimm, Tuesday at 12:00, we have to unmount work file system from all cluster nodes, which means all running jobs which is using work file system has to be stopped.

Down time will take about 10-15 minutes as we estimated. We will keep you updated.

All running jobs will be checked on the cluster, individual notice will be given.

09/02 12:11 Update

Down time is finished , work file system is mounted back to all cluster nodes.

Fimm: hpcmaster crash, Jan. 4

lsz075 • January 4, 2010

Hpcmaster crashed which had affect on job submission system. Jobs was not possible to submit between 12:00 and 12:45.
Problem is fixed now.

/bcmhsm file system unavailable

lsz075 • December 16, 2009

/bcmhsm filesystem expirience software problem. To avoid data corruption or unexpected results we have to temporary stop it. Case is under investigation with high priority. Updates will be posted.

NB: If there are users which demands highly urgent files from that filesystem, please post a service request to support email address with list of files which you would like to restore and location to where.

Update: 17/12 20:55 filesystem is back online.

Fimm login node crashed

lsz075 • December 16, 2009

Fimm login node crashed at 12:40 today caused by kernel panic. we are looking in the issue.

Fimm short down time

lsz075 • December 1, 2009

Due to hardware update on fimm login node and master node , we will have short down time on fimm cluster coming Wednesday, 9th of December, fimm login node will not be available from 13:00~16:00, all the running jobs which is not be able to finish until that time will crash , and has to be resubmitted, reservation set on fimm cluster, so that jobs will not finish before downtime will not be able to run.

We will keep information updated.

Work file system on fimm will be down monday

lsz075 • November 20, 2009

Hi,
Due to firmware update on the storage system, We have to take down work file system on fimm.

We will start update firmware from 12:00 Monday (23th NOV), it will last for 3-4 hours, during that time fimm will be accessible without work file system. All the compute nodes reserved from now for update, job which can not finish before the update will not run.

We will keep information updated as it goes.

12:30 UPDATE work file system unmounted from cluster, preparing for
firmware update .

18:00 UPDATE firmware update on storage system failed some of the disc firmware update , we are working on it.

20:45 UPDATE firmware update finished. work file system mounted back to the cluster.

20:50 UPDATE reservation is canceled, all jobs will start to run.

Scheduled maintenance for hexagon, Mon. Nov. 23rd

lsz075 • November 15, 2009

Hexagon will have a scheduled maintenance on Monday Nov. 23rd from 13:00 to approx. 19:00. Some software updates and hardware replacements will be made. The queue have a reservation in place such that only jobs that can complete (according to asked for walltime) before the maintenance will start.
This note will be updated when we have more information.

Update: 19:08 Maintenance finished, system is up and open for users.

Hexagon Lustre file system hang, Oct. 18th

lsz075 • October 18, 2009

Some of the Lustre IO-nodes have hung. We are working on diagnostics.

The cause of the hang was a HSN hardware failure between two nodes.

Update 13:25, Hexagon is now running again after a reboot.

Work file system crashed on fimm, Sep. 11th

lsz075 • September 13, 2009

Work file system crashed on fimm Friday night, all jobs using work file system also crashed. We blocked login node for maintenance and working on it. We will keep you updated.

Update 2009-09-13 16:19

There are some disk failed on work file system. We are investigating the issue.

Update 13:00 2009-09-14

Work file system is mounted back. All jobs which were using work file system before the file system crash has to be resubmitted. Fimm login node updated to the new kernel and latest version of GPFS.

Sorry for all inconvenience.

Scheduled maintenance for hexagon, Thu Sep. 10th

lsz075 • September 6, 2009

Due to a needed security update that requires a reboot we will be forced to do the next maintenance of hexagon earlier than planned. We will therefore have a scheduled maintenance starting on Thursday Sep. 10th at 13:00.

Job-scheduler reservation is now in place so that only jobs that can finish (according to requested walltime) before the scheduled maintenance will be allowed to start.

During the maintenance we will install a security update as well as replacing a few faulty hardware components.

We will update this note when we have more information about expected length or ongoing progress for the maintenance.

As usual, send any questions to support-uib@notur.no.

Update 16:30: Machine is now up again and ready for use.

HPC Syslog

Log over changes and events on UiB's HPC systems