Hardware

The scheduled maintenance of the fimm cluster is now (mostly) complete. Please note the following changes:

- Cluster is now running Rocks 4.3 which is based on CentOS 4.5
- Login to fimm.bccs.uib.no now ends up on one of the compute nodes acting as a login node. Currently this is called compute-1-14.
- Compilers are upgraded to Intel 10.0 and PGI 7.0
- Totalview is upgraded to 8.2
- MPI libraries are upgraded and located in /local
- Several libraries and programs in /local is upgraded

All jobs that were waiting on the old queue need to be submitted again into the new queue after the upgrade.

Send questions to support-uib@notur.no

The IBM p690 Regatta tre.bccs.uib.no / tre.ii.uib.no will be shut down and decommissioned in the morning of Monday October 1st 2007 at 08:00.

All jobs must be finished, and all data and personal files must be copied out of the machine before this time. The only exception would of course be data on external disk like /migrate, /net/bcmhsm and /net/bjerknes*.

Any questions regarding this can be sent to support-uib@notur.no.

There is a currently ongoing maintenance on the /work2 filesystem. Since this is generating a general load on the IO-system - all IO (including e.g. "ls") will be slow to global filesystems - that is /work, /work2 and /home.

The time to complete this needed maintenance is uncertain, but could be another 2 days depending on the overall IO-load.

This maintenance will fix a configuration error that leads to a generally much slower IO-performance on the system. Since this involves moving ALL the data away from ALL the disks this will have to be done in stages. The current maintenance will take care of 1/3 of the storage-size.

Users is adviced to use the local /scratch partition on both the frontend node and compute node to get faster access to disk while the maintenance is ongoing. Remember that /scratch is only visible to the local node but can be copied to by using the /net/hostname/scratch path. For example, on a compute node:

mkdir -p /scratch/$USER/something
cd /scratch/$USER/something
tar zxf /net/fimm.local/$USER/something.tar.gz

The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.

http://www.parallaw.uib.no/syslog/56

We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.

Downtime started Monday June 27 00:42:51.

fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes

Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.

2 out of 3 raid arrays on fimm has failed, so /home/fimm and /work* is gone at the moment. This is a major fault, and can take some time before it's fixed.

Will update this entry when I know more.


Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.

10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.

10:35 Got confirmation from Nexsan support.

13:20 Chatted with Nexsan-support. They'll call me back ASAP.

15:43 Called Nexsan up again.. Where's the support??

16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.

18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.

Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.

Total downtime: 31 hours, 30 minutes