Hexagon is down from 17:55 due to power-spike from thunderstorm. We are diagnosing and restarting the machine.
Update 20:25, machine is back up after disabling a failed module.
Downtime
Hexagon: power down crash due to power-spike
Hexagon crashed/powered off due to a lightning strike and power-spike at 08:10.
Update 10:00: Cooling is also affected.
Update 12:47: Machine is up.
We used the time to include a future (planned) maintenance in this downtime.
Update 10:00: Cooling is also affected.
Update 12:47: Machine is up.
We used the time to include a future (planned) maintenance in this downtime.
Fimm file system crashed
Fimm internal network crashed yesterday around 4:00 clock. and also we lost all file system, all running jobs are also crashed.
We are sorry for inconvenience.
We are sorry for inconvenience.
Hexagon: system crash
Hexagon crashed due to a power spike related to a thunderstorm which left cabinets in a fault state.
Update 15:12: System is up after 1 hour downtime.
Update 15:12: System is up after 1 hour downtime.
Fimm maintenance 30th May 2011
Dear Fimm Cluster Users :
We will have maintenance for fimm cluster from 08:00 ~ 16:00 on 30th May 2011(Monday).
Following will be performed :
* Add extra login node for fimm
* Reinstall cluster compute node
* Cable rearrangement
* Firmware update
During that time you will be able to access login node to perform basic operation, but you will not be able to submit any jobs or check queue status. Some of the file system will not be accessible or unstable during the maintenance.
Entire cluster is reserved for maintenance. All running jobs which will not be able to finish by the time the maintenance starts will be killed. user has to resubmit all killed jobs.
All submitted jobs which will not be able to finish by the time maintenance starts will be queued until the end of maintenance and will start running when maintenance is over.
If you have any further question please contact us at
hpc-support@hpc.uib.no
We are sorry for inconvenience.
Support team.
We will have maintenance for fimm cluster from 08:00 ~ 16:00 on 30th May 2011(Monday).
Following will be performed :
* Add extra login node for fimm
* Reinstall cluster compute node
* Cable rearrangement
* Firmware update
During that time you will be able to access login node to perform basic operation, but you will not be able to submit any jobs or check queue status. Some of the file system will not be accessible or unstable during the maintenance.
Entire cluster is reserved for maintenance. All running jobs which will not be able to finish by the time the maintenance starts will be killed. user has to resubmit all killed jobs.
All submitted jobs which will not be able to finish by the time maintenance starts will be queued until the end of maintenance and will start running when maintenance is over.
If you have any further question please contact us at
hpc-support@hpc.uib.no
We are sorry for inconvenience.
Support team.
Fimm queueing system is crashed after reboot
Master node for fimm.bccs.uib.no was crashed this morning around 4:00 due to out of memory, after reboot this morning, master node, where queuing system is running, is crashed.
We are working on it , meantime you can not submit or monitor anything related to queuing system.
14:20 Update
Fimm queuing system is back online , sorry for inconvenience.
We are working on it , meantime you can not submit or monitor anything related to queuing system.
14:20 Update
Fimm queuing system is back online , sorry for inconvenience.
Hexagon: crash due to HSN Seastar chip failure
Hexagon had a hardware failure in one of the HSN Seastar chips, this caused the machine to crash.
Machine back up again after 60 min on 14:30.
Machine back up again after 60 min on 14:30.
Fimm: work file system crashed
work file system crashed due to disk failure , we are working on it , will keep you updated. all the jobs running on work file system were killed.
Update: 18:00
we are still working on getting back work file system. there are some issue with backbone storage system. estimated down time is until tomorrow lunch time.
Sorry for inconvenience.
Update : 18:30 21/01/2011
We have to inform you that work file system on fimm.bccs.uib.no crashed
yesterday (20/01/2011) at 14:50 , and we lost all data on it. After file
system crash we did try our best to rescue it but we were not able to get
anything back.
Since /work are designed to be *Temporary* file system to increase
efficiency of running jobs it was not in back up. therefor all your data
on /work are lost unfortunately.
Sorry for inconvenience.
We created new work file system, and we will create directory for you
upon request.You can send mail to support-uib@notur.no or directly
contact me at Phone: (+47) 55 58 40 43 and mail Saerda Halifu
Update: 18:00
we are still working on getting back work file system. there are some issue with backbone storage system. estimated down time is until tomorrow lunch time.
Sorry for inconvenience.
Update : 18:30 21/01/2011
We have to inform you that work file system on fimm.bccs.uib.no crashed
yesterday (20/01/2011) at 14:50 , and we lost all data on it. After file
system crash we did try our best to rescue it but we were not able to get
anything back.
Since /work are designed to be *Temporary* file system to increase
efficiency of running jobs it was not in back up. therefor all your data
on /work are lost unfortunately.
Sorry for inconvenience.
We created new work file system, and we will create directory for you
upon request.You can send mail to support-uib@notur.no or directly
contact me at Phone: (+47) 55 58 40 43 and mail Saerda Halifu
Hexagon: crash due to power issue (EPO) on cab 3
Hexagon crashed at 20:20 due to EPO (emergency power off) problem on cabinet 3. We are doing diagnostics and will then restart machine.
Update 22:50, machine is back up.
Update 22:50, machine is back up.
Hexagon: failed seastar in one module
Hexagon got a problem in high speed network. We are working to fix the problem. All running jobs failed.
Update: 22:55 Machine is back online.
Update: 22:55 Machine is back online.