Downtime

Dear fimm cluster user :

We will have scheduled down time for cluster fimm.bccs.uib.no. on First
Of April at 08:00 am. cluster is reserved for this downtime today 13:30.

Reservation will last 24 hours until 08:00 04/02/2012

We will enforce quota on home file system during the maintenance, we
ask all users to check their home file system usage (repquota.sh), and
compare your quota(hardquota) and your actual usage, and
remove files accordingly.

If you don't do so, you home file system will be "locked" and you wont
be able to do anything even if you logged in after all.


We will also perform hardware and software maintenance which
includes upgrading firmware, reinstalling all compute nodes, some
cable and switch changes.

All jobs which will not be finished by 08:00 am , 04/01/2012
* WILL BE KILLED *, we kindly ask you to save/remove/take care of your
job if it will not finish on time.

If you submit a job after reservation (reservation set today 13:30),
system will check if your job can be finished before down time , if not
it will be queued until maintenance is over, if it can be finished
it will just run.

We will keep any update posted here.


Let us know if you have any further question.

Update : Down time extended until 18:00 02/04/2012

Update 15:05/02: maintenance is finished. due to network driver issue we have reserved some of the nodes for further maintenance, reservation on cluster is removed, but less nodes are in cluster.


Hi,

Update: 11:00

Maui job scheduler on fimm is taken down due to some problem.
we are working on resolving problem. will keep you updated.

Update: 13:20

We restart maui and some other processes, due to restart some of your jobs was killed, please check your job status , and submit it again if necessary.

We are sorry for inconvenience.

Hexagon cabinets c1 and c8 experienced Emergency Power Off failure on Dec 2. 23:41. We are investigating.

Due to the cabinets involved (and the topology of the interconnect) we cannot just start the machine without the two cabinets, looking into possibilities.

Update: 2011-12-05 12:45 2 cabinets can not be started because of the PDU failures. We have now started machine without 2 cabinets (c6 and c8).

There is an issue with part of the /work filesystem on Hexagon. We are investigating.

Update Tuesday 09:30, Still diagnosing the issue. No known fix-time as of now.

Update Tuesday 10:00, Machine goes down for maintenance.

Update Tuesday 13:30, Part of filesystem has been e2fsck checked.

Update Tuesday 14:00, Machine up again after maintenance.

Due to fimm.bccs.uib.no cluster core switch firmware update we will take down both internal and external core switch for maintenance tomorrow from 13:00~15:00, actual down time can be shorter then this.



All running job will be killed.

We are sorry for inconvenience and short notice.

We will keep you updated.

10:30 Fimm login node is blocked.

16:00 Both internal and external switch is updated to new firmware.

17:10 maintenance is finished. fimm cluster is operational.