Downtime

The /work file system on hexagon hangs, we are doing debug dumps and will restart the system. Existing jobs will have to be re-submitted.

Update 15:00, one of the disk controllers have problems, the downtime will be longer than anticipated. We will update this note when we have more information.

Update 20:00, we will need to wait for support on Monday before continuing the work to fix the controller.

Update Monday Dec 29th, 12:00, we are currently waiting for a new controller.

Update Monday Dec 29th, 17:15, the shipment with the controller is expected to arrive on Wednesday 31st. We are sorry for this delay.

Update Wednesday Dec 31st, 14:50, we have got a notice that the expected delivery of the replacement controller is delayed even further, to Monday Jan. 5th. We are looking to other ways to get the file system working.

Update Thursday Jan 1st, 04:00, the system is running again with a workaround. We will have to reboot the system again when the replacement controller arrives (so long-running jobs will have to be resubmitted).

Update Monday Jan 5th, 13:50, the new controller has now arrived we have scheduled this to be replaced on Monday the 12th at 13:30.

Filesystem /home/fimm on fimm cluster crashed this morning,We are working on solving the problem.

12:48 Update: File system is up again. All running jobs before file system crash has to be resubmitted. If any user experiencing file lost, please contact support-uib@notur.no.

we are sorry for the inconvenience.

One of the disk-controllers for hexagon has failed, forcing us to shutdown the machine. We are investigating possible workarounds.

Update Sat., 23:00, unfortunately no workaround was found, we are waiting for hardware replacement to arrive.

Update Mon., 14:30, we expect new hardware to arrive tomorrow (Tue).

Update Tue., 09:00, we have a workaround in place and have done the scheduled maintenance work that was planned for Thursday. The machine will have to be shutdown again when the replacement disk-controller arrives today, therefore only short jobs will be allowed and users should expect to be logged-out of the login nodes on short notice.

Update Tue., 14:15, controller arrives and we shutdown the machine and replace controller.

Update Tue., 16:00, we are currently running file-system check to be sure that all is OK.

Update Tue., 16:50, machine is running again, thank you for your patience.

hexagon got a failed voltage regulator on one of the modules at 16:20, this in turn caused a crash on several of the io-nodes responsible for /work.
We are collecting debug information and rebooting (replacing hardware at next scheduled maintenance).

Update 17:30, hexagon is running again. All jobs must be resubmitted.

Monday August 18th at 14:00, Hexagon will be unavailable for approx. two hours, while an upgrade of the firmware on the /home file system is installed, this re-flash is necessary due to a failed firmware flash during the last maintenance window.

We are sorry about any inconvenience, and the short notice.

Update: Upgraded has been postponed to 16:00.
Update: After another delay the machine is taken down at 16:50
Update, 18:50, hexagon is now up again, but unavailable for users while checking the system.
Update, 19:15, hexagon is now available for users.

Monday, August 11th at 14:00 is hexagon scheduled for maintenance. The current failed nodes, and a module will be replaced. The machine will be unavailable for approximately two hours.

Update: August 11th 14:05, hexagon is shutdown for maintenance

Update: 15:20, hardware part is finished, fw-update, diagnostics and checking starts.

Update: 17:35, hexagon is now up and running. Note that due to reserved time for benchmarking (final part of Acceptance test) it will take some hours before jobs will start (but the queue will accept new jobs).