hexagon has had a module voltage failure. We are investigating and fixing.
Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.
hexagon has had a module voltage failure. We are investigating and fixing.
Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.
hexagon got a node panic and subsequent HSN network hang during the night.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
Due to a GPFS file system hang on the fimm frontend, for a short period of time (hopefully 10-15 minutes), the frontend will not be available, all users need to log in again after this.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
As previously noted, we will have a scheduled downtime from 16:00 Tuesday July 8th. We will replace a faulty module and do some I/O-benchmarking which requires a reserved system. It is estimated that the machine will be available for login at 19:00.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
4 compute nodes (1 module) on hexagon have stopped responding and due to this also some of the login nodes and lustre filesystem. We will unfortunately need to reboot hexagon to clear the issue, jobs will need to be re-submitted.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Three login nodes as well as the lustre filesystem (/work) on hexagon are unresponsive. Attempts to restart only these login nodes have failed and hexagon needs to be rebooted.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Thursday June 19th 21:10: A module (4 nodes) on hexagon crashed with hardware errors, which impacted the routing and the global file system. We are working on solving the problem.
Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.
Hexagon experienced a short power blink in external power, since only part of the machine is on UPS the machine went down.
The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.
There will be a planned maintenance on hexagon for software upgrade on Monday June 16th starting at 14:00 and expected to last approximately 3 hours.
The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.
Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0
Look for update on when we have re-compiled libraries:
All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.
An IO-node for the Lustre filesystem (/work) on hexagon has crashed. We are doing a debugging dump and will restart hexagon.
Machine will be unavailable for about 30 min during the dump and restart.
Update 09:40: hexagon is now up again. Jobs will need to be resubmitted.