Downtime

Hexagon: crash due to power-spike

lsz075 • July 24, 2011

Hexagon is down from 17:55 due to power-spike from thunderstorm. We are diagnosing and restarting the machine.

Update 20:25, machine is back up after disabling a failed module.

Hexagon: power down crash due to power-spike

lsz075 • June 28, 2011

Hexagon crashed/powered off due to a lightning strike and power-spike at 08:10.

Update 10:00: Cooling is also affected.
Update 12:47: Machine is up.

We used the time to include a future (planned) maintenance in this downtime.

Fimm file system crashed

lsz075 • June 21, 2011

Fimm internal network crashed yesterday around 4:00 clock. and also we lost all file system, all running jobs are also crashed.

We are sorry for inconvenience.

Hexagon: system crash

lsz075 • June 19, 2011

Hexagon crashed due to a power spike related to a thunderstorm which left cabinets in a fault state.

Update 15:12: System is up after 1 hour downtime.

Fimm maintenance 30th May 2011

lsz075 • May 23, 2011

Dear Fimm Cluster Users :

We will have maintenance for fimm cluster from 08:00 ~ 16:00 on 30th May 2011(Monday).

Following will be performed :

* Add extra login node for fimm
* Reinstall cluster compute node
* Cable rearrangement
* Firmware update

During that time you will be able to access login node to perform basic operation, but you will not be able to submit any jobs or check queue status. Some of the file system will not be accessible or unstable during the maintenance.

Entire cluster is reserved for maintenance. All running jobs which will not be able to finish by the time the maintenance starts will be killed. user has to resubmit all killed jobs.

All submitted jobs which will not be able to finish by the time maintenance starts will be queued until the end of maintenance and will start running when maintenance is over.

If you have any further question please contact us at
hpc-support@hpc.uib.no

We are sorry for inconvenience.

Support team.

Fimm queueing system is crashed after reboot

lsz075 • March 21, 2011

Master node for fimm.bccs.uib.no was crashed this morning around 4:00 due to out of memory, after reboot this morning, master node, where queuing system is running, is crashed.

We are working on it , meantime you can not submit or monitor anything related to queuing system.

14:20 Update
Fimm queuing system is back online , sorry for inconvenience.

Hexagon: crash due to HSN Seastar chip failure

lsz075 • February 6, 2011

Hexagon had a hardware failure in one of the HSN Seastar chips, this caused the machine to crash.

Machine back up again after 60 min on 14:30.

Fimm: work file system crashed

lsz075 • January 20, 2011

work file system crashed due to disk failure , we are working on it , will keep you updated. all the jobs running on work file system were killed.

Update: 18:00
we are still working on getting back work file system. there are some issue with backbone storage system. estimated down time is until tomorrow lunch time.

Sorry for inconvenience.

Update : 18:30 21/01/2011

We have to inform you that work file system on fimm.bccs.uib.no crashed
yesterday (20/01/2011) at 14:50 , and we lost all data on it. After file
system crash we did try our best to rescue it but we were not able to get
anything back.

Since /work are designed to be *Temporary* file system to increase
efficiency of running jobs it was not in back up. therefor all your data
on /work are lost unfortunately.

Sorry for inconvenience.

We created new work file system, and we will create directory for you
upon request.You can send mail to support-uib@notur.no or directly
contact me at Phone: (+47) 55 58 40 43 and mail Saerda Halifu

Hexagon: crash due to power issue (EPO) on cab 3

lsz075 • January 10, 2011

Hexagon crashed at 20:20 due to EPO (emergency power off) problem on cabinet 3. We are doing diagnostics and will then restart machine.

Update 22:50, machine is back up.

Hexagon: failed seastar in one module

lsz075 • December 7, 2010

Hexagon got a problem in high speed network. We are working to fix the problem. All running jobs failed.

Update: 22:55 Machine is back online.

HPC Syslog

Log over changes and events on UiB's HPC systems