Monday, August 11th at 14:00 is hexagon scheduled for maintenance. The current failed nodes, and a module will be replaced. The machine will be unavailable for approximately two hours.
Update: August 11th 14:05, hexagon is shutdown for maintenance
Update: 15:20, hardware part is finished, fw-update, diagnostics and checking starts.
Update: 17:35, hexagon is now up and running. Note that due to reserved time for benchmarking (final part of Acceptance test) it will take some hours before jobs will start (but the queue will accept new jobs).
Author Archives: lsz075
fimm file system crash
The global file systems on fimm crashed today at 14:45. We are working on solving the problem.
Update, 16:40: File systems are now up again. All jobs running at the time of the crash has to be resubmitted.
Update, 16:40: File systems are now up again. All jobs running at the time of the crash has to be resubmitted.
Batch-system problem on hexagon
The batch system on hexagon have some problems. We are investigating.
Update, 11:50: hexagon have problems with nodes mistakenly shown down.
Update, 12:20: hexagon is now OK again.
Update, 11:50: hexagon have problems with nodes mistakenly shown down.
Update, 12:20: hexagon is now OK again.
Module hw failure on hexagon
hexagon has had a module voltage failure. We are investigating and fixing.
Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.
Update, 23:30: hexagon is booted again. Jobs will need to be re-submitted.
Node panic and HSN network hang on hexagon
hexagon got a node panic and subsequent HSN network hang during the night.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
We are investigating and rebooting.
Update, 11:10. hexagon taken down for diagnostics and reboot.
Update, 12:10. hexagon is now running again. Jobs that were running will need to be re-submitted.
Fimm frontend file system maintenance
Due to a GPFS file system hang on the fimm frontend, for a short period of time (hopefully 10-15 minutes), the frontend will not be available, all users need to log in again after this.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
Update, 10:12: fimm frontend needs to be rebooted to clear the hang.
Update, 10:26: fimm frontend is now rebooted and up again.
Scheduled maintenance for hexagon on Jul. 8th
As previously noted, we will have a scheduled downtime from 16:00 Tuesday July 8th. We will replace a faulty module and do some I/O-benchmarking which requires a reserved system. It is estimated that the machine will be available for login at 19:00.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.
Module hw failure on hexagon
4 compute nodes (1 module) on hexagon have stopped responding and due to this also some of the login nodes and lustre filesystem. We will unfortunately need to reboot hexagon to clear the issue, jobs will need to be re-submitted.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.
Unresponsive login nodes and lustre filesystem on hexagon
Three login nodes as well as the lustre filesystem (/work) on hexagon are unresponsive. Attempts to restart only these login nodes have failed and hexagon needs to be rebooted.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Update, 09:00 hexagon is now booted. Problem is tracked to have originated with a memory error.
Fimm file system upgrade
Wednesday July 16th 08:00, will Fimm be unavailable while the file system and the queuing system is upgraded. This upgrade will most likely last until 17:00.
Please note that a reservation has been set on the system. Jobs must finish before July 16th, if not they will stay in the queue until the upgrade has been completed.
Update, July 16th 08:00: Upgrade is started. Machine will be unavailable until upgrade is complete.
Update, July 16th 15:30: Starting to reinstalling compute nodes. Hopefully the upgraded will be completed within few hours.
Update, July 16th 20:30: Fimm is now available. All global file systems has been upgraded. Queuing system has not been upgraded.
Please note that a reservation has been set on the system. Jobs must finish before July 16th, if not they will stay in the queue until the upgrade has been completed.
Update, July 16th 08:00: Upgrade is started. Machine will be unavailable until upgrade is complete.
Update, July 16th 15:30: Starting to reinstalling compute nodes. Hopefully the upgraded will be completed within few hours.
Update, July 16th 20:30: Fimm is now available. All global file systems has been upgraded. Queuing system has not been upgraded.