Reboot of fimm frontend due to excessive memory usage by an interactive process. No jobs affected. Down: 10min.
Author Archives: lsz075
New NOTUR CPU-quota
CPU-quota for the period 2006-2 has now been activated on tre,fire (and fimm). Send a request to support-uib@notur.no if you (incorrectly) have wrong quota access. Please note that according to prior agreements the projects nn2343k, nn2701k, nn2980k and nn4648k on fire has been transfered to fimm with a cpu-factor of 4:1. Other projects on fire need to send a request to move any quota.
Reboot of tre,to,en
Due to the NFS-hang (see entry for Sep. 16). All the nodes (tre,to,en) had to be rebooted - all running jobs lost.
Please check all jobs, and in particular any jobs that should have copied data to or from /migrate or /net/bcmhsm!!
TO: down from 07:55 to 08:20
TRE: down from 08:35 to 08:55
EN: down from 09:15 to 08:35
Total downtime: 34 cpuhours.
NFS problem accessing /net/bcmhsm on tre,to,en
The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.
Reboot av fimm frontend
Fimm frontend had a hang due to excessive memory-swapping. Rebooted (downtime 5 minutes).
fimm maintenance upgrade to Rocks 4.1 (CentOS 4.3)
fimm will be upgraded to Rocks 4.1 (CentOS Linux 4.3).
More updates to follow.
Update 11:30: fimm is now back online. We had some trouble with the cpu-accounting "qbank" program.
/net/bjerknes1 filesystem on regattas have hardware problem
We are having som hardware problems with the disks on /net/bjerknes1. We are working on a fix.
Update: we have rebooted the server again to (temporary) clear the issue while we are looking into the real cause.
Scheduler / passwd problems on fimm
The scheduler on fimm have some problems at the moment. The passwd distribution system to the nodes do not work properly. Some jobs will fail to start or fail to stop properly after starting. We are working on fixing it.
Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.
Reboot of node “to” of the regattas
Due to a security upgrade node to will be rebooted. No jobs will be affected.
Update 09:25: machine up again. Downtime: 30 min. / 16 cpuhours
Totalview upgrade on fimm
Totalview upgraded from 7.1 to 7.2 on fimm
