09:20 The GPFS daemon was hung on node "en" of the regattas. The node had to be rebooted and thus the jobs on node "en" lost. The login-node "tre" was unreachable from 09:20 to 10:00 (no jobs lost).
We will need to do a scheduled maintenance (firmware upgrade) of the disksystem for /net/bcmhsm (for users from BCCR symlinked from /migrate) and /net/bjerknes1. Note that /net/bcmhsm is mounted as /bcmhsm on fimm.
/net/bcmhsm and /net/bjerknes1 will be unavailable on Monday 15. from 09:00 to 11:00 (if all goes well possibly earlier)
Update (11:00): /net/bcmhsm and /net/bjerknes1 is now up again. The downtime was also used to apply a security update on the backup-server (where /net/bcmhsm is).
The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.
The backup server has been updated with latest OS-maintenance release for AIX (5200-08) and latest tape-device drivers. In addition TSM backup server was updated to version 5.2.7 and TSM client to version 5.2.4. Downtime for restore and /net/bcmhsm (/migrate for Bjerknes) was only a few minutes during reboot.
Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.
A user managed to generate a 800GB large file in /work on tre during the night - causing jobs to fail when the filesystem went 100% full. The file is now deleted. /work on tre had to be remounted (OK on to and en).