Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.
Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).
Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.
Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).
Regatta nodes has nfs problems. NFS hangs from regatta to jambu (/net/bcmhsm) and to /migrate (on "to") - as well as from en,to to tre.
Seems like a nfs-client issue. I am working to resolve the problem.
15:45 Update: Everything is up again. Had to reboot "tre" and "to" as well as jambu. Jobs were lost (25% load at the time of reboot).
NB! Due to problems with NFS-export of /migrate we have unmounted /migrate on "tre" and "en". Do all copying to and from /migrate on to (as stated in /migrate/README). For copying to (and from) /migrate from fimm use
scp something.tar.gz to:/migrate/myusername/
(Note that Bjerknes has symlink from /migrate/username to /net/bcmhsm/username which is nfs-exported from jambu).
Cpuhours downtime: approx. 384
The taperobot is getting 2 new tapedrives installed and will be unavailable from 09:45 to approx. 11:00 16. Feb.
Files in /migrate (and /net/bcmhsm) will be unavailable.
This entry will be updated with more information later.
11:20 Update: The upgrade takes somewhat longer than planned.
12:45 Update: The upgrade is complete and filesystem back.
The backup server has been updated with latest OS-maintenance release for AIX (5200-08) and latest tape-device drivers. In addition TSM backup server was updated to version 5.2.7 and TSM client to version 5.2.4. Downtime for restore and /net/bcmhsm (/migrate for Bjerknes) was only a few minutes during reboot.
The GPFS filesystems /work and /home/fimm on fimm has become unbalanced. The needed filesystem-balancing was started last night and is still running. It will increase the IO load untill finished - hopefully sometime later today.
Matlab on fimm upgraded to version 7.1.0.183 (R14) Service Pack 3
Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.
Update 13:45: node "en" is now up again.
10:15 There is some problem with /work on fimm. We are working on it.
13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.
Fire cluster upgraded to Rocks 4.1 OS. It was therefore unavailable from 13:00 to 17:00 (no users were currently using fire, and no jobs were running).
Vim was updated to version 6.4 on tre (run "vim --version" to check which version you use).