Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.
Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).
NFS problem on tre,to,en
Regatta nodes has nfs problems. NFS hangs from regatta to jambu (/net/bcmhsm) and to /migrate (on "to") - as well as from en,to to tre.
Seems like a nfs-client issue. I am working to resolve the problem.
15:45 Update: Everything is up again. Had to reboot "tre" and "to" as well as jambu. Jobs were lost (25% load at the time of reboot).
NB! Due to problems with NFS-export of /migrate we have unmounted /migrate on "tre" and "en". Do all copying to and from /migrate on to (as stated in /migrate/README). For copying to (and from) /migrate from fimm use
scp something.tar.gz to:/migrate/myusername/
(Note that Bjerknes has symlink from /migrate/username to /net/bcmhsm/username which is nfs-exported from jambu).
Cpuhours downtime: approx. 384
Seems like a nfs-client issue. I am working to resolve the problem.
15:45 Update: Everything is up again. Had to reboot "tre" and "to" as well as jambu. Jobs were lost (25% load at the time of reboot).
NB! Due to problems with NFS-export of /migrate we have unmounted /migrate on "tre" and "en". Do all copying to and from /migrate on to (as stated in /migrate/README). For copying to (and from) /migrate from fimm use
scp something.tar.gz to:/migrate/myusername/
(Note that Bjerknes has symlink from /migrate/username to /net/bcmhsm/username which is nfs-exported from jambu).
Cpuhours downtime: approx. 384
Tape robot and /migrate filesystem down for tapedrive upgrade
The taperobot is getting 2 new tapedrives installed and will be unavailable from 09:45 to approx. 11:00 16. Feb.
Files in /migrate (and /net/bcmhsm) will be unavailable.
This entry will be updated with more information later.
11:20 Update: The upgrade takes somewhat longer than planned.
12:45 Update: The upgrade is complete and filesystem back.
Files in /migrate (and /net/bcmhsm) will be unavailable.
This entry will be updated with more information later.
11:20 Update: The upgrade takes somewhat longer than planned.
12:45 Update: The upgrade is complete and filesystem back.
Software update on backup server (jambu)
The backup server has been updated with latest OS-maintenance release for AIX (5200-08) and latest tape-device drivers. In addition TSM backup server was updated to version 5.2.7 and TSM client to version 5.2.4. Downtime for restore and /net/bcmhsm (/migrate for Bjerknes) was only a few minutes during reboot.
Rebalancing of /work and /home/fimm on fimm
The GPFS filesystems /work and /home/fimm on fimm has become unbalanced. The needed filesystem-balancing was started last night and is still running. It will increase the IO load untill finished - hopefully sometime later today.
Matlab upgrade
Matlab on fimm upgraded to version 7.1.0.183 (R14) Service Pack 3
Memory and disk problem on regatta node “en”
Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.
Update 13:45: node "en" is now up again.
Update 13:45: node "en" is now up again.
Problem with /work on fimm
10:15 There is some problem with /work on fimm. We are working on it.
13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.
13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.
Fire cluster upgraded to Rocks 4.1 OS
Fire cluster upgraded to Rocks 4.1 OS. It was therefore unavailable from 13:00 to 17:00 (no users were currently using fire, and no jobs were running).
Vim updated to version 6.4 on tre
Vim was updated to version 6.4 on tre (run "vim --version" to check which version you use).