The /work file system on hexagon hangs, we are doing debug dumps and will restart the system. Existing jobs will have to be re-submitted.
Update 15:00, one of the disk controllers have problems, the downtime will be longer than anticipated. We will update this note when we have more information.
Update 20:00, we will need to wait for support on Monday before continuing the work to fix the controller.
Update Monday Dec 29th, 12:00, we are currently waiting for a new controller.
Update Monday Dec 29th, 17:15, the shipment with the controller is expected to arrive on Wednesday 31st. We are sorry for this delay.
Update Wednesday Dec 31st, 14:50, we have got a notice that the expected delivery of the replacement controller is delayed even further, to Monday Jan. 5th. We are looking to other ways to get the file system working.
Update Thursday Jan 1st, 04:00, the system is running again with a workaround. We will have to reboot the system again when the replacement controller arrives (so long-running jobs will have to be resubmitted).
Update Monday Jan 5th, 13:50, the new controller has now arrived we have scheduled this to be replaced on Monday the 12th at 13:30.
Downtime
File system crash on Fimm
Filesystem /home/fimm on fimm cluster crashed this morning,We are working on solving the problem.
12:48 Update: File system is up again. All running jobs before file system crash has to be resubmitted. If any user experiencing file lost, please contact support-uib@notur.no.
we are sorry for the inconvenience.
12:48 Update: File system is up again. All running jobs before file system crash has to be resubmitted. If any user experiencing file lost, please contact support-uib@notur.no.
we are sorry for the inconvenience.
Disk-controller failure on hexagon, Sat 29. nov
One of the disk-controllers for hexagon has failed, forcing us to shutdown the machine. We are investigating possible workarounds.
Update Sat., 23:00, unfortunately no workaround was found, we are waiting for hardware replacement to arrive.
Update Mon., 14:30, we expect new hardware to arrive tomorrow (Tue).
Update Tue., 09:00, we have a workaround in place and have done the scheduled maintenance work that was planned for Thursday. The machine will have to be shutdown again when the replacement disk-controller arrives today, therefore only short jobs will be allowed and users should expect to be logged-out of the login nodes on short notice.
Update Tue., 14:15, controller arrives and we shutdown the machine and replace controller.
Update Tue., 16:00, we are currently running file-system check to be sure that all is OK.
Update Tue., 16:50, machine is running again, thank you for your patience.
Update Sat., 23:00, unfortunately no workaround was found, we are waiting for hardware replacement to arrive.
Update Mon., 14:30, we expect new hardware to arrive tomorrow (Tue).
Update Tue., 09:00, we have a workaround in place and have done the scheduled maintenance work that was planned for Thursday. The machine will have to be shutdown again when the replacement disk-controller arrives today, therefore only short jobs will be allowed and users should expect to be logged-out of the login nodes on short notice.
Update Tue., 14:15, controller arrives and we shutdown the machine and replace controller.
Update Tue., 16:00, we are currently running file-system check to be sure that all is OK.
Update Tue., 16:50, machine is running again, thank you for your patience.
Fimm global file system crash. Nov. 25th
At 15:30 global file system on fimm crashed. All file system is down, we are working on solving the problem.
Update 16:55: The file system on fimm is now up again. All running jobs sadly crashed and has to be resubmitted.
We are sorry for the inconvenience.
Update 16:55: The file system on fimm is now up again. All running jobs sadly crashed and has to be resubmitted.
We are sorry for the inconvenience.
Hexagon crash on October 21st
hexagon got a failed voltage regulator on one of the modules at 16:20, this in turn caused a crash on several of the io-nodes responsible for /work.
We are collecting debug information and rebooting (replacing hardware at next scheduled maintenance).
Update 17:30, hexagon is running again. All jobs must be resubmitted.
We are collecting debug information and rebooting (replacing hardware at next scheduled maintenance).
Update 17:30, hexagon is running again. All jobs must be resubmitted.
Hexagon crash on October 14th
Hexagon crashed today at 07:20 due to HSN panic. We are working on getting the system up again.
Update 09:25: Hexagon is now booted. All running jobs at the time of the crash has to be resubmitted.
Update 09:25: Hexagon is now booted. All running jobs at the time of the crash has to be resubmitted.
Scheduled maintenance for hexagon on Aug. 18th
Monday August 18th at 14:00, Hexagon will be unavailable for approx. two hours, while an upgrade of the firmware on the /home file system is installed, this re-flash is necessary due to a failed firmware flash during the last maintenance window.
We are sorry about any inconvenience, and the short notice.
Update: Upgraded has been postponed to 16:00.
Update: After another delay the machine is taken down at 16:50
Update, 18:50, hexagon is now up again, but unavailable for users while checking the system.
Update, 19:15, hexagon is now available for users.
We are sorry about any inconvenience, and the short notice.
Update: Upgraded has been postponed to 16:00.
Update: After another delay the machine is taken down at 16:50
Update, 18:50, hexagon is now up again, but unavailable for users while checking the system.
Update, 19:15, hexagon is now available for users.
Scheduled maintenance for hexagon on Aug. 11th
Monday, August 11th at 14:00 is hexagon scheduled for maintenance. The current failed nodes, and a module will be replaced. The machine will be unavailable for approximately two hours.
Update: August 11th 14:05, hexagon is shutdown for maintenance
Update: 15:20, hardware part is finished, fw-update, diagnostics and checking starts.
Update: 17:35, hexagon is now up and running. Note that due to reserved time for benchmarking (final part of Acceptance test) it will take some hours before jobs will start (but the queue will accept new jobs).
Update: August 11th 14:05, hexagon is shutdown for maintenance
Update: 15:20, hardware part is finished, fw-update, diagnostics and checking starts.
Update: 17:35, hexagon is now up and running. Note that due to reserved time for benchmarking (final part of Acceptance test) it will take some hours before jobs will start (but the queue will accept new jobs).
fimm file system crash
The global file systems on fimm crashed today at 14:45. We are working on solving the problem.
Update, 16:40: File systems are now up again. All jobs running at the time of the crash has to be resubmitted.
Update, 16:40: File systems are now up again. All jobs running at the time of the crash has to be resubmitted.
Batch-system problem on hexagon
The batch system on hexagon have some problems. We are investigating.
Update, 11:50: hexagon have problems with nodes mistakenly shown down.
Update, 12:20: hexagon is now OK again.
Update, 11:50: hexagon have problems with nodes mistakenly shown down.
Update, 12:20: hexagon is now OK again.