Hexagon has a hang on some of the nodes including the scheduler node.
We are investigating.
Update 11:30, machine is rebooted and up again.
Downtime
Hexagon: login5 rebooted
Hexagon login5 node had a hang and was rebooted. Jobs that was started from that node needs to be resubmitted.
Hexagon: HSN problems
We are experiencing High Speed Network problems on Hexagon. We are working to fix them ASAP.
Update: 17:15 Machine is up.
Update: 17:15 Machine is up.
Hexagon: login2 rebooted
Login2 on hexagon was having problems with out-of-memory due to a bad user script. All jobs running on the node crashed and needs to be restarted.
Please observe that you are using "aprun" to run your programs - to avoid that it impacts the login node.
Please observe that you are using "aprun" to run your programs - to avoid that it impacts the login node.
Hexagon: power spike in building causes reboot
Hexagon needs to be rebooted after a power-spike left too many nodes offline.
Update 18:40, machine is now up again after a mezzanine replacement.
Update 18:40, machine is now up again after a mezzanine replacement.
Fimm cluster down time 11th September
Dear fimm users :
We will update maui and torque on fimm.bccs.uib.no also will upgrade qbank accounting system to gold. For that reason fimm.bccs.uib.no will be down for 8~9 hours on 11th of September 2012. Down time starts at 08:00 in the morning.
Jobs already running but will not be able to finish by 08:00 clock 11th September will be killed, since cluster is reserved for maintenance , jobs will not be able to finish by that time will not run.
We will keep all progress updated on this page.
Updates:
maintenance is extended to kl 12:00 , 12 September 2012 due to some problem with software.
Updates: 10:55 12/09/2012
We just completed upgrade on fimm, resource manger and scheduler is updated on fimm.bccs.uib.no.
We are running :
Maui 3.3.1
Torque 4.1.0
We will update maui and torque on fimm.bccs.uib.no also will upgrade qbank accounting system to gold. For that reason fimm.bccs.uib.no will be down for 8~9 hours on 11th of September 2012. Down time starts at 08:00 in the morning.
Jobs already running but will not be able to finish by 08:00 clock 11th September will be killed, since cluster is reserved for maintenance , jobs will not be able to finish by that time will not run.
We will keep all progress updated on this page.
Updates:
maintenance is extended to kl 12:00 , 12 September 2012 due to some problem with software.
Updates: 10:55 12/09/2012
We just completed upgrade on fimm, resource manger and scheduler is updated on fimm.bccs.uib.no.
We are running :
Maui 3.3.1
Torque 4.1.0
Hexagon: thunderstorm causes reboot
Hexagon needs a reboot after a thunderstorm caused power-blink in building power-supply.
Update 21:10: Hexagon is up again without cabinet 8 (needs manual intervention).
Update 21:10: Hexagon is up again without cabinet 8 (needs manual intervention).
Hexagon: reboot due to high speed network problems
Hexagon is getting restart due to the high speed network problems.
Update 20:00, hexagon is now up again without cabinet c12, we will do maintenance on this cabinet soon, likely next week.
Update 20:00, hexagon is now up again without cabinet c12, we will do maintenance on this cabinet soon, likely next week.
Hexagon: cabinet power issue
Hexagon lost 1 cabinet on May 30th because of a power failure, due to the high resiliency it continues to run. On June 7th at about 08:00 another cabinet also got a power failure. The current state is that it continues to operate but 2 login-nodes are down causing connection attempts to fail (depends on round-robing of dns) and some of the nodes have communication problems. We are investigating possible solutions.
Update 12:30: We need to restart the machine to be able to bring it back up.
Update 13:00: Machine is now up again.
Update 12:30: We need to restart the machine to be able to bring it back up.
Update 13:00: Machine is now up again.
Hexagon: system crash
Hexagon is down due to a power issue. We are investigating.
Update 13:00 : Hexagon is back online. we found that panel breaker that tripped causing loss of power and crash of the system. All jobs which were running needs to be re-submitted.
Update 13:00 : Hexagon is back online. we found that panel breaker that tripped causing loss of power and crash of the system. All jobs which were running needs to be re-submitted.