Downtime

The fimm frontend had a hang that was discovered at 00:10 Sunday 12.11. The actual time the frontend went down is uncertain, but could be some time on the evening of Saturday. The cause seems to be an extreme load due to lots of httpd processes (unknown reason). The frontend was rebooted and at the same time was given some hardware and software maintenance - including kernel upgrade that was planned for a later time.

No jobs were affected, frontend up again at 13:00 Sunday 12.11.

Due to the NFS-hang (see entry for Sep. 16). All the nodes (tre,to,en) had to be rebooted - all running jobs lost.

Please check all jobs, and in particular any jobs that should have copied data to or from /migrate or /net/bcmhsm!!


TO: down from 07:55 to 08:20
TRE: down from 08:35 to 08:55
EN: down from 09:15 to 08:35

Total downtime: 34 cpuhours.

The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.

The scheduler on fimm have some problems at the moment. The passwd distribution system to the nodes do not work properly. Some jobs will fail to start or fail to stop properly after starting. We are working on fixing it.

Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.