AIX

Hang of node “en” on tre

lsz075 • March 13, 2007

Node "en" on tre was down from 05:15 to 09:00. Reason unknown. The node was rebooted at 08:15 and is now up again. Jobs running on the node was lost.

GPFS hang on node “en”

lsz075 • March 2, 2007

09:20 The GPFS daemon was hung on node "en" of the regattas. The node had to be rebooted and thus the jobs on node "en" lost. The login-node "tre" was unreachable from 09:20 to 10:00 (no jobs lost).

Update: 11:00 node "en" is up again.

Scheduled maintenance on /net/bcmhsm and /net/bjerknes1

lsz075 • January 4, 2007

We will need to do a scheduled maintenance (firmware upgrade) of the disksystem for /net/bcmhsm (for users from BCCR symlinked from /migrate) and /net/bjerknes1. Note that /net/bcmhsm is mounted as /bcmhsm on fimm.

/net/bcmhsm and /net/bjerknes1 will be unavailable on Monday 15. from 09:00 to 11:00 (if all goes well possibly earlier)

Update (11:00): /net/bcmhsm and /net/bjerknes1 is now up again. The downtime was also used to apply a security update on the backup-server (where /net/bcmhsm is).

NFS problem accessing /net/bcmhsm on tre,to,en

lsz075 • September 16, 2006

The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.

Reboot of node “to” of the regattas

lsz075 • June 19, 2006

Due to a security upgrade node to will be rebooted. No jobs will be affected.

Update 09:25: machine up again. Downtime: 30 min. / 16 cpuhours

Memory-hang on TRE

lsz075 • March 15, 2006

Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.

Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).

Software update on backup server (jambu)

lsz075 • February 13, 2006

The backup server has been updated with latest OS-maintenance release for AIX (5200-08) and latest tape-device drivers. In addition TSM backup server was updated to version 5.2.7 and TSM client to version 5.2.4. Downtime for restore and /net/bcmhsm (/migrate for Bjerknes) was only a few minutes during reboot.

Memory and disk problem on regatta node “en”

lsz075 • January 10, 2006

Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.

Update 13:45: node "en" is now up again.

Vim updated to version 6.4 on tre

lsz075 • December 1, 2005

Vim was updated to version 6.4 on tre (run "vim --version" to check which version you use).

/work on tre was 100% full

lsz075 • October 20, 2005

A user managed to generate a 800GB large file in /work on tre during the night - causing jobs to fail when the filesystem went 100% full. The file is now deleted. /work on tre had to be remounted (OK on to and en).

HPC Syslog

Log over changes and events on UiB's HPC systems

AIX