Dear fimm cluster user :
We will have scheduled down time for cluster fimm.bccs.uib.no. on First
Of April at 08:00 am. cluster is reserved for this downtime today 13:30.
Reservation will last 24 hours until 08:00 04/02/2012
We will enforce quota on home file system during the maintenance, we
ask all users to check their home file system usage (repquota.sh), and
compare your quota(hardquota) and your actual usage, and
remove files accordingly.
If you don't do so, you home file system will be "locked" and you wont
be able to do anything even if you logged in after all.
We will also perform hardware and software maintenance which
includes upgrading firmware, reinstalling all compute nodes, some
cable and switch changes.
All jobs which will not be finished by 08:00 am , 04/01/2012
* WILL BE KILLED *, we kindly ask you to save/remove/take care of your
job if it will not finish on time.
If you submit a job after reservation (reservation set today 13:30),
system will check if your job can be finished before down time , if not
it will be queued until maintenance is over, if it can be finished
it will just run.
We will keep any update posted here.
Let us know if you have any further question.
Update : Down time extended until 18:00 02/04/2012
Update 15:05/02: maintenance is finished. due to network driver issue we have reserved some of the nodes for further maintenance, reservation on cluster is removed, but less nodes are in cluster.
Downtime
Fimm: filesystem glitch on login node
There was a temporary filesystem failure on the login node. Seems OK after reboot.
Hexagon: system crash 25.12.2011
We had to restart hexagon due to multiple seastar heartbeat failures in c10 and c12 cabinets. Probably related to power and extreme weather which we had.
This happened on 25.12.2011 23:30.
This happened on 25.12.2011 23:30.
Fimm: maui down
Hi,
Update: 11:00
Maui job scheduler on fimm is taken down due to some problem.
we are working on resolving problem. will keep you updated.
Update: 13:20
We restart maui and some other processes, due to restart some of your jobs was killed, please check your job status , and submit it again if necessary.
We are sorry for inconvenience.
Update: 11:00
Maui job scheduler on fimm is taken down due to some problem.
we are working on resolving problem. will keep you updated.
Update: 13:20
We restart maui and some other processes, due to restart some of your jobs was killed, please check your job status , and submit it again if necessary.
We are sorry for inconvenience.
Hexagon: thunderstorm power failure
Hexagon has shutdown automatically due to thunderstorm power blink. We are diagnosing.
Update: 22:00 Machine is up again.
Update: 22:00 Machine is up again.
Hexagon: cabinet power failure
Hexagon cabinets c1 and c8 experienced Emergency Power Off failure on Dec 2. 23:41. We are investigating.
Due to the cabinets involved (and the topology of the interconnect) we cannot just start the machine without the two cabinets, looking into possibilities.
Update: 2011-12-05 12:45 2 cabinets can not be started because of the PDU failures. We have now started machine without 2 cabinets (c6 and c8).
Due to the cabinets involved (and the topology of the interconnect) we cannot just start the machine without the two cabinets, looking into possibilities.
Update: 2011-12-05 12:45 2 cabinets can not be started because of the PDU failures. We have now started machine without 2 cabinets (c6 and c8).
Hexagon: Part of /work has problems
There is an issue with part of the /work filesystem on Hexagon. We are investigating.
Update Tuesday 09:30, Still diagnosing the issue. No known fix-time as of now.
Update Tuesday 10:00, Machine goes down for maintenance.
Update Tuesday 13:30, Part of filesystem has been e2fsck checked.
Update Tuesday 14:00, Machine up again after maintenance.
Update Tuesday 09:30, Still diagnosing the issue. No known fix-time as of now.
Update Tuesday 10:00, Machine goes down for maintenance.
Update Tuesday 13:30, Part of filesystem has been e2fsck checked.
Update Tuesday 14:00, Machine up again after maintenance.
Fimm network down
Due to fimm.bccs.uib.no cluster core switch firmware update we will take down both internal and external core switch for maintenance tomorrow from 13:00~15:00, actual down time can be shorter then this.
All running job will be killed.
We are sorry for inconvenience and short notice.
We will keep you updated.
10:30 Fimm login node is blocked.
16:00 Both internal and external switch is updated to new firmware.
17:10 maintenance is finished. fimm cluster is operational.
All running job will be killed.
We are sorry for inconvenience and short notice.
We will keep you updated.
10:30 Fimm login node is blocked.
16:00 Both internal and external switch is updated to new firmware.
17:10 maintenance is finished. fimm cluster is operational.
Fimm: backend node crashed. queueing system not available.
The backend machine of Fimm crashed and has ongoing problems.
This means the queueing system and most other services are not avaliable.
13.08.2011, 10:00, service is back.
This means the queueing system and most other services are not avaliable.
13.08.2011, 10:00, service is back.
Fimm: login node update. downtime until 12:00
Because of an urgent security update the login node will be down for 1 hour.