Regatta nodes have some problems due to the failed cooling in machine room.
We are investigating.
UPDATE: 18:35 - all nodes up again
Author Archives: lsz075
Problems with HSM/Backup-server
There is a problem with the HSM/Backup-server jambu. /migrate and /net/bcmhsm is down. We are investigating.
Update: 15:15 We are waiting on external support to upgrade/fix the firmware on this machine. It is unkown when we will get the machine up again. Possibly tomorrow.
Update: Friday 11:00 We are still waiting for a part to the machine from abroad. Estimated time of arrival was yesterday afternoon - but it still has not arrived yet.
Update: Friday 15:00 The message from the transport company used by the vendor is now that they will not be able to deliver the part until Monday. Unfortunately, this means that HSM and backup will be unavailable until later in the day on Monday 16.
Update: Monday 14:45 The HSM/backup-server jambu is now up again and /migrate and /net/bcmhsm works.
Update: 15:15 We are waiting on external support to upgrade/fix the firmware on this machine. It is unkown when we will get the machine up again. Possibly tomorrow.
Update: Friday 11:00 We are still waiting for a part to the machine from abroad. Estimated time of arrival was yesterday afternoon - but it still has not arrived yet.
Update: Friday 15:00 The message from the transport company used by the vendor is now that they will not be able to deliver the part until Monday. Unfortunately, this means that HSM and backup will be unavailable until later in the day on Monday 16.
Update: Monday 14:45 The HSM/backup-server jambu is now up again and /migrate and /net/bcmhsm works.
New NOTUR cpu-quota
CPU-quota for the period 2007-1 has now been activated on tre (and fimm). Send a request to support-uib@notur.no if you (incorrectly) have wrong quota access. Please note that according to prior agreements the project nn4648k have had their quota moved from tre to fimm with a conversion factor of 1:1.
Note: the machine "fire" is no longer available in NOTUR. Users on tre should note that the machine is now in lower maintenance mode and is prepared to be removed from the system
Note: the machine "fire" is no longer available in NOTUR. Users on tre should note that the machine is now in lower maintenance mode and is prepared to be removed from the system
Crash/hw-problem of tre and to
Node "to" looks to have had an internal power/cooling failure and seems to have shut itself down at 20:15. It looks as the tre node has also shut itself down (at 20:35) to avoid filesystem corruption with the crash of to.
22:45: investigating
23:55: all nodes up again
Technical note: tre had a problem when it was booted again that it blocked itself out from the rest (HACMP problem). After the other nodes where rebooted and up again. It was possible to start the hacmp service on tre again.
22:45: investigating
23:55: all nodes up again
Technical note: tre had a problem when it was booted again that it blocked itself out from the rest (HACMP problem). After the other nodes where rebooted and up again. It was possible to start the hacmp service on tre again.
Hang of node “en” on tre
Node "en" on tre was down from 05:15 to 09:00. Reason unknown. The node was rebooted at 08:15 and is now up again. Jobs running on the node was lost.
GPFS hang on node “en”
09:20 The GPFS daemon was hung on node "en" of the regattas. The node had to be rebooted and thus the jobs on node "en" lost. The login-node "tre" was unreachable from 09:20 to 10:00 (no jobs lost).
Update: 11:00 node "en" is up again.
Update: 11:00 node "en" is up again.
Online maintenance of filesystems
There is a currently ongoing maintenance on the /work2 filesystem. Since this is generating a general load on the IO-system - all IO (including e.g. "ls") will be slow to global filesystems - that is /work, /work2 and /home.
The time to complete this needed maintenance is uncertain, but could be another 2 days depending on the overall IO-load.
This maintenance will fix a configuration error that leads to a generally much slower IO-performance on the system. Since this involves moving ALL the data away from ALL the disks this will have to be done in stages. The current maintenance will take care of 1/3 of the storage-size.
Users is adviced to use the local /scratch partition on both the frontend node and compute node to get faster access to disk while the maintenance is ongoing. Remember that /scratch is only visible to the local node but can be copied to by using the /net/hostname/scratch path. For example, on a compute node:
mkdir -p /scratch/$USER/something
cd /scratch/$USER/something
tar zxf /net/fimm.local/$USER/something.tar.gz
The time to complete this needed maintenance is uncertain, but could be another 2 days depending on the overall IO-load.
This maintenance will fix a configuration error that leads to a generally much slower IO-performance on the system. Since this involves moving ALL the data away from ALL the disks this will have to be done in stages. The current maintenance will take care of 1/3 of the storage-size.
Users is adviced to use the local /scratch partition on both the frontend node and compute node to get faster access to disk while the maintenance is ongoing. Remember that /scratch is only visible to the local node but can be copied to by using the /net/hostname/scratch path. For example, on a compute node:
mkdir -p /scratch/$USER/something
cd /scratch/$USER/something
tar zxf /net/fimm.local/$USER/something.tar.gz
Reboot of fimm frontend
Reboot of fimm frontend to clear filesystem hang.
Scheduled maintenance on /net/bcmhsm and /net/bjerknes1
We will need to do a scheduled maintenance (firmware upgrade) of the disksystem for /net/bcmhsm (for users from BCCR symlinked from /migrate) and /net/bjerknes1. Note that /net/bcmhsm is mounted as /bcmhsm on fimm.
/net/bcmhsm and /net/bjerknes1 will be unavailable on Monday 15. from 09:00 to 11:00 (if all goes well possibly earlier)
Update (11:00): /net/bcmhsm and /net/bjerknes1 is now up again. The downtime was also used to apply a security update on the backup-server (where /net/bcmhsm is).
/net/bcmhsm and /net/bjerknes1 will be unavailable on Monday 15. from 09:00 to 11:00 (if all goes well possibly earlier)
Update (11:00): /net/bcmhsm and /net/bjerknes1 is now up again. The downtime was also used to apply a security update on the backup-server (where /net/bcmhsm is).
Hang of fimm frontend
The fimm frontend had a hang that was discovered at 00:10 Sunday 12.11. The actual time the frontend went down is uncertain, but could be some time on the evening of Saturday. The cause seems to be an extreme load due to lots of httpd processes (unknown reason). The frontend was rebooted and at the same time was given some hardware and software maintenance - including kernel upgrade that was planned for a later time.
No jobs were affected, frontend up again at 13:00 Sunday 12.11.
No jobs were affected, frontend up again at 13:00 Sunday 12.11.