hexagon has had a module voltage failure. We are investigating and fixing.
Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.
Downtime
Node panic and HSN network hang on hexagon
hexagon got a node panic and subsequent HSN network hang during the night.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
Fimm frontend file system maintenance
Due to a GPFS file system hang on the fimm frontend, for a short period of time (hopefully 10-15 minutes), the frontend will not be available, all users need to log in again after this.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
Scheduled maintenance for hexagon on Jul. 8th
As previously noted, we will have a scheduled downtime from 16:00 Tuesday July 8th. We will replace a faulty module and do some I/O-benchmarking which requires a reserved system. It is estimated that the machine will be available for login at 19:00.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
Module hw failure on hexagon
4 compute nodes (1 module) on hexagon have stopped responding and due to this also some of the login nodes and lustre filesystem. We will unfortunately need to reboot hexagon to clear the issue, jobs will need to be re-submitted.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Unresponsive login nodes and lustre filesystem on hexagon
Three login nodes as well as the lustre filesystem (/work) on hexagon are unresponsive. Attempts to restart only these login nodes have failed and hexagon needs to be rebooted.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Module hw failure on hexagon
Thursday June 19th 21:10: A module (4 nodes) on hexagon crashed with hardware errors, which impacted the routing and the global file system. We are working on solving the problem.
Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.
Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.
Brief power outage (blink) for hexagon
Hexagon experienced a short power blink in external power, since only part of the machine is on UPS the machine went down.
The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.
The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.
Scheduled maintenance for hexagon, software upgrade, June 16th
There will be a planned maintenance on hexagon for software upgrade on Monday June 16th starting at 14:00 and expected to last approximately 3 hours.
The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.
Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0
Look for update on when we have re-compiled libraries:
All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.
The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.
Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0
Look for update on when we have re-compiled libraries:
All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.
Lustre IO-node crash on hexagon
An IO-node for the Lustre filesystem (/work) on hexagon has crashed. We are doing a debugging dump and will restart hexagon.
Machine will be unavailable for about 30 min during the dump and restart.
Update 09:40: hexagon is now up again. Jobs will need to be resubmitted.
Machine will be unavailable for about 30 min during the dump and restart.
Update 09:40: hexagon is now up again. Jobs will need to be resubmitted.