Downtime

tre reboot

lsz075 • June 12, 2006

Due to the new security updates installed, tre must be rebooted. This will hopefully also solve problems with totalview debugger.
Expected downtime: 1h (starting from Mon, 10:00)

Update: Mon, 12:45 - disk import problem caused a longer dowtime. Everything should be up and running again

Downtime: 2h 45'

bjerknes fileserver bregne down for os upgrade

lsz075 • May 2, 2006

bregne will be upgraded to centos 4. During this upgrade /net/bjerknes1 will be unavailable till approx. 13:00.

13:10 Update: The upgrade is complete and filesystem is back.

Memory-hang on TRE

lsz075 • March 15, 2006

Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.

Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).

NFS problem on tre,to,en

lsz075 • February 19, 2006

Regatta nodes has nfs problems. NFS hangs from regatta to jambu (/net/bcmhsm) and to /migrate (on "to") - as well as from en,to to tre.
Seems like a nfs-client issue. I am working to resolve the problem.

15:45 Update: Everything is up again. Had to reboot "tre" and "to" as well as jambu. Jobs were lost (25% load at the time of reboot).

NB! Due to problems with NFS-export of /migrate we have unmounted /migrate on "tre" and "en". Do all copying to and from /migrate on to (as stated in /migrate/README). For copying to (and from) /migrate from fimm use
scp something.tar.gz to:/migrate/myusername/

(Note that Bjerknes has symlink from /migrate/username to /net/bcmhsm/username which is nfs-exported from jambu).

Cpuhours downtime: approx. 384

Tape robot and /migrate filesystem down for tapedrive upgrade

lsz075 • February 16, 2006

The taperobot is getting 2 new tapedrives installed and will be unavailable from 09:45 to approx. 11:00 16. Feb.
Files in /migrate (and /net/bcmhsm) will be unavailable.
This entry will be updated with more information later.

11:20 Update: The upgrade takes somewhat longer than planned.

12:45 Update: The upgrade is complete and filesystem back.

Memory and disk problem on regatta node “en”

lsz075 • January 10, 2006

Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.

Update 13:45: node "en" is now up again.

Problem with /work on fimm

lsz075 • January 10, 2006

10:15 There is some problem with /work on fimm. We are working on it.

13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.

Crash on fimm frontend by excessive interactive use

lsz075 • November 11, 2005

A user crashed the fimm frontend by using up all available memory to a memory intensive interactive process. The frontend was unavailable for login from 10.11.05 23:50 to 11.11.05 08:40. No jobs were affected.

Maintenance summary (fimm)

lsz075 • September 13, 2005

fimm was down Tuesday Sep. 13 from 08:00 to 12:15 for filesystem-check (mmfsck) on gpfs filesystem, upgrade of gpfs, and reboot of satablade2 disk-cabinet (due to failure to accept new disk).

Scheduled downtime on fimm

lsz075 • September 5, 2005

Fimm will be down on Tuesday Sep. 13 from 08:00 to 12:00
One of the SATABlade disk-enclosures needs to be rebooted, and the /home/fimm gpfs filesystem needs to be unmounted for a filesystemcheck.

N.B.: Please delete any and all unnecessary files you may have on /home/fimm or /work* filesystems before the downtime to hasten the filesystem fixes.

HPC Syslog

Log over changes and events on UiB's HPC systems