Hexagon has a hang on some of the nodes including the scheduler node.
We are investigating.
Update 11:30, machine is rebooted and up again.
Hexagon has a hang on some of the nodes including the scheduler node.
We are investigating.
Update 11:30, machine is rebooted and up again.
Hexagon login5 node had a hang and was rebooted. Jobs that was started from that node needs to be resubmitted.
We are experiencing High Speed Network problems on Hexagon. We are working to fix them ASAP.
Update: 17:15 Machine is up.
Login2 on hexagon was having problems with out-of-memory due to a bad user script. All jobs running on the node crashed and needs to be restarted.
Please observe that you are using "aprun" to run your programs - to avoid that it impacts the login node.
Hexagon needs to be rebooted after a power-spike left too many nodes offline.
Update 18:40, machine is now up again after a mezzanine replacement.
Dear fimm users :
We will update maui and torque on fimm.bccs.uib.no also will upgrade qbank accounting system to gold. For that reason fimm.bccs.uib.no will be down for 8~9 hours on 11th of September 2012. Down time starts at 08:00 in the morning.
Jobs already running but will not be able to finish by 08:00 clock 11th September will be killed, since cluster is reserved for maintenance , jobs will not be able to finish by that time will not run.
We will keep all progress updated on this page.
Updates:
maintenance is extended to kl 12:00 , 12 September 2012 due to some problem with software.
Updates: 10:55 12/09/2012
We just completed upgrade on fimm, resource manger and scheduler is updated on fimm.bccs.uib.no.
We are running :
Maui 3.3.1
Torque 4.1.0
Hexagon needs a reboot after a thunderstorm caused power-blink in building power-supply.
Update 21:10: Hexagon is up again without cabinet 8 (needs manual intervention).
Hexagon is getting restart due to the high speed network problems.
Update 20:00, hexagon is now up again without cabinet c12, we will do maintenance on this cabinet soon, likely next week.
Hexagon lost 1 cabinet on May 30th because of a power failure, due to the high resiliency it continues to run. On June 7th at about 08:00 another cabinet also got a power failure. The current state is that it continues to operate but 2 login-nodes are down causing connection attempts to fail (depends on round-robing of dns) and some of the nodes have communication problems. We are investigating possible solutions.
Update 12:30: We need to restart the machine to be able to bring it back up.
Update 13:00: Machine is now up again.
Hexagon is down due to a power issue. We are investigating.
Update 13:00 : Hexagon is back online. we found that panel breaker that tripped causing loss of power and crash of the system. All jobs which were running needs to be re-submitted.