The scheduled maintenance of the fimm cluster is now (mostly) complete. Please note the following changes:
- Cluster is now running Rocks 4.3 which is based on CentOS 4.5
- Login to fimm.bccs.uib.no now ends up on one of the compute nodes acting as a login node. Currently this is called compute-1-14.
- Compilers are upgraded to Intel 10.0 and PGI 7.0
- Totalview is upgraded to 8.2
- MPI libraries are upgraded and located in /local
- Several libraries and programs in /local is upgraded
All jobs that were waiting on the old queue need to be submitted again into the new queue after the upgrade.
Send questions to support-uib@notur.no
Hardware
Important notice: tre.bccs.uib.no will be taken out of service
The IBM p690 Regatta tre.bccs.uib.no / tre.ii.uib.no will be shut down and decommissioned in the morning of Monday October 1st 2007 at 08:00.
All jobs must be finished, and all data and personal files must be copied out of the machine before this time. The only exception would of course be data on external disk like /migrate, /net/bcmhsm and /net/bjerknes*.
Any questions regarding this can be sent to support-uib@notur.no.
All jobs must be finished, and all data and personal files must be copied out of the machine before this time. The only exception would of course be data on external disk like /migrate, /net/bcmhsm and /net/bjerknes*.
Any questions regarding this can be sent to support-uib@notur.no.
Online maintenance of filesystems
There is a currently ongoing maintenance on the /work2 filesystem. Since this is generating a general load on the IO-system - all IO (including e.g. "ls") will be slow to global filesystems - that is /work, /work2 and /home.
The time to complete this needed maintenance is uncertain, but could be another 2 days depending on the overall IO-load.
This maintenance will fix a configuration error that leads to a generally much slower IO-performance on the system. Since this involves moving ALL the data away from ALL the disks this will have to be done in stages. The current maintenance will take care of 1/3 of the storage-size.
Users is adviced to use the local /scratch partition on both the frontend node and compute node to get faster access to disk while the maintenance is ongoing. Remember that /scratch is only visible to the local node but can be copied to by using the /net/hostname/scratch path. For example, on a compute node:
mkdir -p /scratch/$USER/something
cd /scratch/$USER/something
tar zxf /net/fimm.local/$USER/something.tar.gz
The time to complete this needed maintenance is uncertain, but could be another 2 days depending on the overall IO-load.
This maintenance will fix a configuration error that leads to a generally much slower IO-performance on the system. Since this involves moving ALL the data away from ALL the disks this will have to be done in stages. The current maintenance will take care of 1/3 of the storage-size.
Users is adviced to use the local /scratch partition on both the frontend node and compute node to get faster access to disk while the maintenance is ongoing. Remember that /scratch is only visible to the local node but can be copied to by using the /net/hostname/scratch path. For example, on a compute node:
mkdir -p /scratch/$USER/something
cd /scratch/$USER/something
tar zxf /net/fimm.local/$USER/something.tar.gz
/net/bjerknes1 filesystem on regattas have hardware problem
We are having som hardware problems with the disks on /net/bjerknes1. We are working on a fix.
Update: we have rebooted the server again to (temporary) clear the issue while we are looking into the real cause.
Update: we have rebooted the server again to (temporary) clear the issue while we are looking into the real cause.
Memory and disk problem on regatta node “en”
Regatta node "en" had a memory fault at 0923 10.01.06. The node was rebooted. After reboot the node rejected one of the disks in /work filesystem. We are working to correct the problem. The other nodes are unaffected by this.
Update 13:45: node "en" is now up again.
Update 13:45: node "en" is now up again.
fimm filesystems down
The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.
http://www.parallaw.uib.no/syslog/56
We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.
Downtime started Monday June 27 00:42:51.
fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes
Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.
http://www.parallaw.uib.no/syslog/56
We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.
Downtime started Monday June 27 00:42:51.
fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes
Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.
disks on fimm failed
2 out of 3 raid arrays on fimm has failed, so /home/fimm and /work* is gone at the moment. This is a major fault, and can take some time before it's fixed.
Will update this entry when I know more.
Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.
10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.
10:35 Got confirmation from Nexsan support.
13:20 Chatted with Nexsan-support. They'll call me back ASAP.
15:43 Called Nexsan up again.. Where's the support??
16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.
18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.
Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.
Total downtime: 31 hours, 30 minutes
Will update this entry when I know more.
Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.
10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.
10:35 Got confirmation from Nexsan support.
13:20 Chatted with Nexsan-support. They'll call me back ASAP.
15:43 Called Nexsan up again.. Where's the support??
16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.
18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.
Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.
Total downtime: 31 hours, 30 minutes
Low latency, fast interconnect on fimm
25 nodes of fimm are now interconnected in a 2D Torus using SCI interconnect from Dolphin Interconnect Solutions.
Please read http://www.parallaw.uib.no/resources/cluster/scampi for more details.
Please read http://www.parallaw.uib.no/resources/cluster/scampi for more details.
disk failed in /home/parallab
A disk failed in the /home/parallab filesystem. It should be redundant, so I'm working on removing it, and re-replicating the filesystem.
node32 in linux cluster back online
The power supply in node32 in the linux cluster failed last week. IBM has replaced it, and the node is now back online.