Downtime

Fimm.bccs.uib.no maintenance

lsz075 • March 21, 2012

Dear fimm cluster user :

We will have scheduled down time for cluster fimm.bccs.uib.no. on First
Of April at 08:00 am. cluster is reserved for this downtime today 13:30.

Reservation will last 24 hours until 08:00 04/02/2012

We will enforce quota on home file system during the maintenance, we
ask all users to check their home file system usage (repquota.sh), and
compare your quota(hardquota) and your actual usage, and
remove files accordingly.

If you don't do so, you home file system will be "locked" and you wont
be able to do anything even if you logged in after all.

We will also perform hardware and software maintenance which
includes upgrading firmware, reinstalling all compute nodes, some
cable and switch changes.

All jobs which will not be finished by 08:00 am , 04/01/2012
* WILL BE KILLED *, we kindly ask you to save/remove/take care of your
job if it will not finish on time.

If you submit a job after reservation (reservation set today 13:30),
system will check if your job can be finished before down time , if not
it will be queued until maintenance is over, if it can be finished
it will just run.

We will keep any update posted here.

Let us know if you have any further question.

Update : Down time extended until 18:00 02/04/2012

Update 15:05/02: maintenance is finished. due to network driver issue we have reserved some of the nodes for further maintenance, reservation on cluster is removed, but less nodes are in cluster.

Fimm: filesystem glitch on login node

lsz075 • January 10, 2012

There was a temporary filesystem failure on the login node. Seems OK after reboot.

Hexagon: system crash 25.12.2011

lsz075 • January 3, 2012

We had to restart hexagon due to multiple seastar heartbeat failures in c10 and c12 cabinets. Probably related to power and extreme weather which we had.
This happened on 25.12.2011 23:30.

Fimm: maui down

lsz075 • December 20, 2011

Hi,

Update: 11:00

Maui job scheduler on fimm is taken down due to some problem.
we are working on resolving problem. will keep you updated.

Update: 13:20

We restart maui and some other processes, due to restart some of your jobs was killed, please check your job status , and submit it again if necessary.

We are sorry for inconvenience.

Hexagon: thunderstorm power failure

lsz075 • December 12, 2011

Hexagon has shutdown automatically due to thunderstorm power blink. We are diagnosing.

Update: 22:00 Machine is up again.

Hexagon: cabinet power failure

lsz075 • December 3, 2011

Hexagon cabinets c1 and c8 experienced Emergency Power Off failure on Dec 2. 23:41. We are investigating.

Due to the cabinets involved (and the topology of the interconnect) we cannot just start the machine without the two cabinets, looking into possibilities.

Update: 2011-12-05 12:45 2 cabinets can not be started because of the PDU failures. We have now started machine without 2 cabinets (c6 and c8).

Hexagon: Part of /work has problems

lsz075 • October 17, 2011

There is an issue with part of the /work filesystem on Hexagon. We are investigating.

Update Tuesday 09:30, Still diagnosing the issue. No known fix-time as of now.

Update Tuesday 10:00, Machine goes down for maintenance.

Update Tuesday 13:30, Part of filesystem has been e2fsck checked.

Update Tuesday 14:00, Machine up again after maintenance.

Fimm network down

lsz075 • September 29, 2011

Due to fimm.bccs.uib.no cluster core switch firmware update we will take down both internal and external core switch for maintenance tomorrow from 13:00~15:00, actual down time can be shorter then this.

All running job will be killed.

We are sorry for inconvenience and short notice.

We will keep you updated.

10:30 Fimm login node is blocked.

16:00 Both internal and external switch is updated to new firmware.

17:10 maintenance is finished. fimm cluster is operational.

Fimm: backend node crashed. queueing system not available.

lsz075 • August 12, 2011

The backend machine of Fimm crashed and has ongoing problems.
This means the queueing system and most other services are not avaliable.

13.08.2011, 10:00, service is back.

Fimm: login node update. downtime until 12:00

lsz075 • August 11, 2011

Because of an urgent security update the login node will be down for 1 hour.

HPC Syslog

Log over changes and events on UiB's HPC systems