09:20 The GPFS daemon was hung on node "en" of the regattas. The node had to be rebooted and thus the jobs on node "en" lost. The login-node "tre" was unreachable from 09:20 to 10:00 (no jobs lost).
Update: 11:00 node "en" is up again.
Downtime
Reboot of fimm frontend
Reboot of fimm frontend to clear filesystem hang.
Hang of fimm frontend
The fimm frontend had a hang that was discovered at 00:10 Sunday 12.11. The actual time the frontend went down is uncertain, but could be some time on the evening of Saturday. The cause seems to be an extreme load due to lots of httpd processes (unknown reason). The frontend was rebooted and at the same time was given some hardware and software maintenance - including kernel upgrade that was planned for a later time.
No jobs were affected, frontend up again at 13:00 Sunday 12.11.
No jobs were affected, frontend up again at 13:00 Sunday 12.11.
Reboot of fimm frontend
Reboot of fimm frontend due to excessive memory usage by an interactive process. No jobs affected. Down: 10min.
Reboot of tre,to,en
Due to the NFS-hang (see entry for Sep. 16). All the nodes (tre,to,en) had to be rebooted - all running jobs lost.
Please check all jobs, and in particular any jobs that should have copied data to or from /migrate or /net/bcmhsm!!
TO: down from 07:55 to 08:20
TRE: down from 08:35 to 08:55
EN: down from 09:15 to 08:35
Total downtime: 34 cpuhours.
Please check all jobs, and in particular any jobs that should have copied data to or from /migrate or /net/bcmhsm!!
TO: down from 07:55 to 08:20
TRE: down from 08:35 to 08:55
EN: down from 09:15 to 08:35
Total downtime: 34 cpuhours.
NFS problem accessing /net/bcmhsm on tre,to,en
The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.
Reboot av fimm frontend
Fimm frontend had a hang due to excessive memory-swapping. Rebooted (downtime 5 minutes).
fimm maintenance upgrade to Rocks 4.1 (CentOS 4.3)
fimm will be upgraded to Rocks 4.1 (CentOS Linux 4.3).
More updates to follow.
Update 11:30: fimm is now back online. We had some trouble with the cpu-accounting "qbank" program.
More updates to follow.
Update 11:30: fimm is now back online. We had some trouble with the cpu-accounting "qbank" program.
Scheduler / passwd problems on fimm
The scheduler on fimm have some problems at the moment. The passwd distribution system to the nodes do not work properly. Some jobs will fail to start or fail to stop properly after starting. We are working on fixing it.
Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.
Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.
Reboot of node “to” of the regattas
Due to a security upgrade node to will be rebooted. No jobs will be affected.
Update 09:25: machine up again. Downtime: 30 min. / 16 cpuhours
Update 09:25: machine up again. Downtime: 30 min. / 16 cpuhours