Node "en" on tre was down from 05:15 to 09:00. Reason unknown. The node was rebooted at 08:15 and is now up again. Jobs running on the node was lost.
AIX
GPFS hang on node “en”
09:20 The GPFS daemon was hung on node "en" of the regattas. The node had to be rebooted and thus the jobs on node "en" lost. The login-node "tre" was unreachable from 09:20 to 10:00 (no jobs lost).
Update: 11:00 node "en" is up again.
Scheduled maintenance on /net/bcmhsm and /net/bjerknes1
We will need to do a scheduled maintenance (firmware upgrade) of the disksystem for /net/bcmhsm (for users from BCCR symlinked from /migrate) and /net/bjerknes1. Note that /net/bcmhsm is mounted as /bcmhsm on fimm.
/net/bcmhsm and /net/bjerknes1 will be unavailable on Monday 15. from 09:00 to 11:00 (if all goes well possibly earlier)
Update (11:00): /net/bcmhsm and /net/bjerknes1 is now up again. The downtime was also used to apply a security update on the backup-server (where /net/bcmhsm is).
NFS problem accessing /net/bcmhsm on tre,to,en
The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.
Reboot of node “to” of the regattas
Due to a security upgrade node to will be rebooted. No jobs will be affected.
Update 09:25: machine up again. Downtime: 30 min. / 16 cpuhours
Memory-hang on TRE
Some process managed to use up all memory on tre around 16:03. The node is currently rebooting.
Update 16:48: Tre is now up again. Jobs running on tre were lost (but not to and en). 24 cpuhours downtime (0.75*32).
Software update on backup server (jambu)
The backup server has been updated with latest OS-maintenance release for AIX (5200-08) and latest tape-device drivers. In addition TSM backup server was updated to version 5.2.7 and TSM client to version 5.2.4. Downtime for restore and /net/bcmhsm (/migrate for Bjerknes) was only a few minutes during reboot.
Memory and disk problem on regatta node “en”
Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.
Update 13:45: node "en" is now up again.
Vim updated to version 6.4 on tre
Vim was updated to version 6.4 on tre (run "vim --version" to check which version you use).
/work on tre was 100% full
A user managed to generate a 800GB large file in /work on tre during the night - causing jobs to fail when the filesystem went 100% full. The file is now deleted. /work on tre had to be remounted (OK on to and en).
