Downtime

Module hw failure on hexagon

lsz075 • July 16, 2008

hexagon has had a module voltage failure. We are investigating and fixing.

Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.

Node panic and HSN network hang on hexagon

lsz075 • July 12, 2008

hexagon got a node panic and subsequent HSN network hang during the night.
We are investigating and rebooting.

Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.

Fimm frontend file system maintenance

lsz075 • July 9, 2008

Due to a GPFS file system hang on the fimm frontend, for a short period of time (hopefully 10-15 minutes), the frontend will not be available, all users need to log in again after this.

Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.

Scheduled maintenance for hexagon on Jul. 8th

lsz075 • July 8, 2008

As previously noted, we will have a scheduled downtime from 16:00 Tuesday July 8th. We will replace a faulty module and do some I/O-benchmarking which requires a reserved system. It is estimated that the machine will be available for login at 19:00.

Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.

Module hw failure on hexagon

lsz075 • July 2, 2008

4 compute nodes (1 module) on hexagon have stopped responding and due to this also some of the login nodes and lustre filesystem. We will unfortunately need to reboot hexagon to clear the issue, jobs will need to be re-submitted.

Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.

Unresponsive login nodes and lustre filesystem on hexagon

lsz075 • July 1, 2008

Three login nodes as well as the lustre filesystem (/work) on hexagon are unresponsive. Attempts to restart only these login nodes have failed and hexagon needs to be rebooted.

Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.

Module hw failure on hexagon

lsz075 • June 19, 2008

Thursday June 19th 21:10: A module (4 nodes) on hexagon crashed with hardware errors, which impacted the routing and the global file system. We are working on solving the problem.

Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.

Brief power outage (blink) for hexagon

lsz075 • June 6, 2008

Hexagon experienced a short power blink in external power, since only part of the machine is on UPS the machine went down.

The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.

Scheduled maintenance for hexagon, software upgrade, June 16th

lsz075 • June 5, 2008

There will be a planned maintenance on hexagon for software upgrade on Monday June 16th starting at 14:00 and expected to last approximately 3 hours.

The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.

Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0

Look for update on when we have re-compiled libraries:

All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.

Lustre IO-node crash on hexagon

lsz075 • May 29, 2008

An IO-node for the Lustre filesystem (/work) on hexagon has crashed. We are doing a debugging dump and will restart hexagon.
Machine will be unavailable for about 30 min during the dump and restart.

Update 09:40: hexagon is now up again. Jobs will need to be resubmitted.

HPC Syslog

Log over changes and events on UiB's HPC systems

Downtime

Module hw failure on hexagon

Node panic and HSN network hang on hexagon

Fimm frontend file system maintenance

Scheduled maintenance for hexagon on Jul. 8th

Module hw failure on hexagon

Unresponsive login nodes and lustre filesystem on hexagon

Module hw failure on hexagon

Brief power outage (blink) for hexagon

Scheduled maintenance for hexagon, software upgrade, June 16th

Lustre IO-node crash on hexagon