hexagon has had a module voltage failure. We are investigating and fixing.
Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.
Node panic and HSN network hang on hexagon
hexagon got a node panic and subsequent HSN network hang during the night.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
Fimm frontend file system maintenance
Due to a GPFS file system hang on the fimm frontend, for a short period of time (hopefully 10-15 minutes), the frontend will not be available, all users need to log in again after this.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
Scheduled maintenance for hexagon on Jul. 8th
As previously noted, we will have a scheduled downtime from 16:00 Tuesday July 8th. We will replace a faulty module and do some I/O-benchmarking which requires a reserved system. It is estimated that the machine will be available for login at 19:00.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
Module hw failure on hexagon
4 compute nodes (1 module) on hexagon have stopped responding and due to this also some of the login nodes and lustre filesystem. We will unfortunately need to reboot hexagon to clear the issue, jobs will need to be re-submitted.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Unresponsive login nodes and lustre filesystem on hexagon
Three login nodes as well as the lustre filesystem (/work) on hexagon are unresponsive. Attempts to restart only these login nodes have failed and hexagon needs to be rebooted.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Fimm file system upgrade
Wednesday July 16th 08:00, will Fimm be unavailable while the file system and the queuing system is upgraded. This upgrade will most likely last until 17:00.
Please note that a reservation has been set on the system. Jobs must finish before July 16th, if not they will stay in the queue until the upgrade has been completed.
Update, July 16th 08:00: Upgrade is started. Machine will be unavailable until upgrade is complete.
Update, July 16th 15:30: Starting to reinstalling compute nodes. Hopefully the upgraded will be completed within few hours.
Update, July 16th 20:30: Fimm is now available. All global file systems has been upgraded. Queuing system has not been upgraded.
Please note that a reservation has been set on the system. Jobs must finish before July 16th, if not they will stay in the queue until the upgrade has been completed.
Update, July 16th 08:00: Upgrade is started. Machine will be unavailable until upgrade is complete.
Update, July 16th 15:30: Starting to reinstalling compute nodes. Hopefully the upgraded will be completed within few hours.
Update, July 16th 20:30: Fimm is now available. All global file systems has been upgraded. Queuing system has not been upgraded.
Module hw failure on hexagon
Thursday June 19th 21:10: A module (4 nodes) on hexagon crashed with hardware errors, which impacted the routing and the global file system. We are working on solving the problem.
Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.
Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.
Final quad-core upgrade of Hexagon, June 24th
Tuesday June 24th will Hexagon be taken down for the final quad-core upgrade.
During the upgrade we will take up parts of the machine so short jobs can be run.
Updates:
Tuesday 24th, 08:00: hexagon is shutdown for upgrading
Tuesday 24th, 09:00: half of hexagon is started, while the other half is upgraded. The rest of the machine will be turn off tomorrow morning (Wednesday) at 08:00 for upgrading. The last two racks will be turned on and made available until 14:00, then the entire machine will be taken down for the final upgrade. From then on hexagon, including the file system, will unavailable until the diagnostics and checkout procedures has been completed.
Wednesday 25th, 08:00: Only the last two racks are now running.
Wednesday 25th, 14:00: The entire machine is now down for the upgrade. We will update this page when the diagnostics are completed.
Wednesday 25th, 20:00: The machine is now booted with final hardware configuration, but not available to users due to diagnostics and checkout procedures.
Thursday 26th, 23:00: The machine is still going through checkout procedures and will tomorrow start on benchmarking for the Acceptance test of the system. More information on when the system will be available for users will come Friday at 11:00.
Friday 27th, 11:00: Hexagon is currently running benchmarks. These are scheduled to complete by 18:00 today, at which point users will be allowed to login.
Friday 27th, 18:00: Hexagon is now available for users. Note that it has a scheduled slot for further benchmarking at Tuesday July 8th starting at 16:00. Jobs need to ask for walltime shorter than that.
During the upgrade we will take up parts of the machine so short jobs can be run.
Updates:
Tuesday 24th, 08:00: hexagon is shutdown for upgrading
Tuesday 24th, 09:00: half of hexagon is started, while the other half is upgraded. The rest of the machine will be turn off tomorrow morning (Wednesday) at 08:00 for upgrading. The last two racks will be turned on and made available until 14:00, then the entire machine will be taken down for the final upgrade. From then on hexagon, including the file system, will unavailable until the diagnostics and checkout procedures has been completed.
Wednesday 25th, 08:00: Only the last two racks are now running.
Wednesday 25th, 14:00: The entire machine is now down for the upgrade. We will update this page when the diagnostics are completed.
Wednesday 25th, 20:00: The machine is now booted with final hardware configuration, but not available to users due to diagnostics and checkout procedures.
Thursday 26th, 23:00: The machine is still going through checkout procedures and will tomorrow start on benchmarking for the Acceptance test of the system. More information on when the system will be available for users will come Friday at 11:00.
Friday 27th, 11:00: Hexagon is currently running benchmarks. These are scheduled to complete by 18:00 today, at which point users will be allowed to login.
Friday 27th, 18:00: Hexagon is now available for users. Note that it has a scheduled slot for further benchmarking at Tuesday July 8th starting at 16:00. Jobs need to ask for walltime shorter than that.
Brief power outage (blink) for hexagon
Hexagon experienced a short power blink in external power, since only part of the machine is on UPS the machine went down.
The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.
The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.