This morning /fimm filesystem crashed on fimm.hpc.uib.no. This caused /fimm and /work filesystem unaccessible for users and fimm login node hanged.
We are able to take it up back online quick, but we are investigating the cause of the problem.
Jobs that are running during the crash are all killed.
We are sorry fot inconvenience.
Downtime
Hexagon: rebooted because of important security update
We will need to carry out an emergency reboot after 13:00 o'clock today. Please save your work and log out from Hexagon by 13:00 o'clock.
More information to come later.
Update 13:59 2016-11-04: Access to the system is stopped and jobs has been terminated. Please accept our apologies for the inconveniences caused by the system reboot.
Update 16:05 2016-11-04: Security patches have been applied. Hexagon is back online again.
More information to come later.
Update 13:59 2016-11-04: Access to the system is stopped and jobs has been terminated. Please accept our apologies for the inconveniences caused by the system reboot.
Update 16:05 2016-11-04: Security patches have been applied. Hexagon is back online again.
Hexagon: high speed network down
Hexagon is down because the high speed network went down. We are working to get the issue fixed and boot the machine.
Update: 2016-09-28 22:45 System is up again.
Update: 2016-09-28 22:45 System is up again.
Hexagon: login1 rebooted
Login node login1 ran out of memory and had to be rebooted.
The following jobs have been affected: 1890515, 1890671, 1891136, 1891264, 1891269, 1891328, 1891105, 1891385.
The following jobs have been affected: 1890515, 1890671, 1891136, 1891264, 1891269, 1891328, 1891105, 1891385.
Grunch server will be rebooted Friday 12:00 noon
Grunch server will be rebooted due to kernel update and some other library updates. All users are advised to logout before 12:00 noon.
Fimm lustre file system crash
Currently we have problem with /fimm lustre file-system. we are working on to resolve problems, and will keep updated.
During this time queue system and /fimm will not be stable.
Thanks for understanding and sorry for inconvenience.
During this time queue system and /fimm will not be stable.
Thanks for understanding and sorry for inconvenience.
Hexagon: rebooted
Both metadata servers and all OSSes serving /work filesystem crashed.
We had to stop the machine and power cycle hexagon.
We had to stop the machine and power cycle hexagon.
Hexagon: reboot needed
All OSTs for /work filesystem are in read-only mode and we need to reboot hexagon. We will come back with more information later on.
Update:
15:20 25-04-2016 OST 8 has corrupted data and was marked read-only by the system. There are 379 inodes containing multiply-claimed blocks. We are trying to recover from it and identify corrupted files. Owners for identified corrupted files will be notified.
If you have corrupted data on /work, please contact us at support-uib@notur.no.
15:45 25-04-2016 Users were logged out and access closed in order to be able to perform maintenance on the system.
16:35 26-04-2016 Corrupted files were identified and /work filesystem is usable again. Hexagon was rebooted and access is reopened.
We will run further checks on /work filesystem while keeping it on-line. After this last check is finished, as earlier mentioned, the owners of corrupted files will be notified.
Update:
15:20 25-04-2016 OST 8 has corrupted data and was marked read-only by the system. There are 379 inodes containing multiply-claimed blocks. We are trying to recover from it and identify corrupted files. Owners for identified corrupted files will be notified.
If you have corrupted data on /work, please contact us at support-uib@notur.no.
15:45 25-04-2016 Users were logged out and access closed in order to be able to perform maintenance on the system.
16:35 26-04-2016 Corrupted files were identified and /work filesystem is usable again. Hexagon was rebooted and access is reopened.
We will run further checks on /work filesystem while keeping it on-line. After this last check is finished, as earlier mentioned, the owners of corrupted files will be notified.
Grunch was rebooted
Grunch had issues with the network connection. It was rebooted and should be very soon available.
Hexagon: down, power blink
Hexagon went down due to power blink.
Update: 2015-12-10 20:47 Machine is up again.
Update: 2015-12-10 20:47 Machine is up again.