Memory-hang on TRE

lsz075 • March 15, 2006

Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.

Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).

NFS problem on tre,to,en

lsz075 • February 19, 2006

Regatta nodes has nfs problems. NFS hangs from regatta to jambu (/net/bcmhsm) and to /migrate (on "to") - as well as from en,to to tre.
Seems like a nfs-client issue. I am working to resolve the problem.

15:45 Update: Everything is up again. Had to reboot "tre" and "to" as well as jambu. Jobs were lost (25% load at the time of reboot).

NB! Due to problems with NFS-export of /migrate we have unmounted /migrate on "tre" and "en". Do all copying to and from /migrate on to (as stated in /migrate/README). For copying to (and from) /migrate from fimm use
scp something.tar.gz to:/migrate/myusername/

(Note that Bjerknes has symlink from /migrate/username to /net/bcmhsm/username which is nfs-exported from jambu).

Cpuhours downtime: approx. 384

Tape robot and /migrate filesystem down for tapedrive upgrade

lsz075 • February 16, 2006

The taperobot is getting 2 new tapedrives installed and will be unavailable from 09:45 to approx. 11:00 16. Feb.
Files in /migrate (and /net/bcmhsm) will be unavailable.
This entry will be updated with more information later.

11:20 Update: The upgrade takes somewhat longer than planned.

12:45 Update: The upgrade is complete and filesystem back.

Software update on backup server (jambu)

lsz075 • February 13, 2006

The backup server has been updated with latest OS-maintenance release for AIX (5200-08) and latest tape-device drivers. In addition TSM backup server was updated to version 5.2.7 and TSM client to version 5.2.4. Downtime for restore and /net/bcmhsm (/migrate for Bjerknes) was only a few minutes during reboot.

Rebalancing of /work and /home/fimm on fimm

lsz075 • February 2, 2006

The GPFS filesystems /work and /home/fimm on fimm has become unbalanced. The needed filesystem-balancing was started last night and is still running. It will increase the IO load untill finished - hopefully sometime later today.

Matlab upgrade

lsz075 • January 18, 2006

Matlab on fimm upgraded to version 7.1.0.183 (R14) Service Pack 3

Memory and disk problem on regatta node “en”

lsz075 • January 10, 2006

Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.

Update 13:45: node "en" is now up again.

Problem with /work on fimm

lsz075 • January 10, 2006

10:15 There is some problem with /work on fimm. We are working on it.

13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.

Fire cluster upgraded to Rocks 4.1 OS

lsz075 • January 5, 2006

Fire cluster upgraded to Rocks 4.1 OS. It was therefore unavailable from 13:00 to 17:00 (no users were currently using fire, and no jobs were running).

Vim updated to version 6.4 on tre

lsz075 • December 1, 2005

Vim was updated to version 6.4 on tre (run "vim --version" to check which version you use).

HPC Syslog

Log over changes and events on UiB's HPC systems