Fire is now back on-line. All nodes has been re-installed with the Rocks linux cluster distribution. Unfortunately the installation took a bit more time than expected because front-end node for some reason refused to install rocks distribution. Everything worked well when we used node 5 as front-end node.
Total downtime: 52 hours, 30 minutes
Author Archives: lsz075
disks on fimm failed
2 out of 3 raid arrays on fimm has failed, so /home/fimm and /work* is gone at the moment. This is a major fault, and can take some time before it's fixed.
Will update this entry when I know more.
Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.
10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.
10:35 Got confirmation from Nexsan support.
13:20 Chatted with Nexsan-support. They'll call me back ASAP.
15:43 Called Nexsan up again.. Where's the support??
16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.
18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.
Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.
Total downtime: 31 hours, 30 minutes
Will update this entry when I know more.
Wed Mar 30 04:32 Multiple disks and raid-controllers failed on two separate storage units.
10:15 Started restore of /home/fimm from backup, just in case we're unable to recover the filesystems on disk.
10:35 Got confirmation from Nexsan support.
13:20 Chatted with Nexsan-support. They'll call me back ASAP.
15:43 Called Nexsan up again.. Where's the support??
16:23 Got procedure to reset drives from serial menu. This seems to make the system functional again. Haven't tested accessing the volumes from linux yet.
18:39 Try accessing the volumes from linux, and notice that now the third satablade also has failed. Woun't be able to reset this one before tomorrow morning. Hope Nexsan has some idea by then to what has triggered this problem.
Thu mar 31 11:50 All disks and filesystems are up! Still got no idea on why this error occured, so we might have to take the filesystem down again if Nexsan engineering maybe has some firmware upgrades that fixes the problem.
Total downtime: 31 hours, 30 minutes
Linux cluster “FIRE” is being reinstalled
The linux cluster fire.bccs.uib.no will be down tuesday/wednesday March 29./30.. All nodes will be re-installed with the Rocks linux cluster distribution. This is a major system upgrade, so most software will have to be rebuilt to run on fire after the upgrade.
Amber8 installed on fimm
Amber8 was installed on fimm /local/amber8/.
Information on failed tests are in /local/amber8/MakeTestResults.txt,
but they mostly look like minor round-off errors in the last digit. A qualified
evaluation from a user understanding the results would be much appreciated.
Information on failed tests are in /local/amber8/MakeTestResults.txt,
but they mostly look like minor round-off errors in the last digit. A qualified
evaluation from a user understanding the results would be much appreciated.
/home/parallab and /work failure, regatta rebooted
An important network switch just failed, and took down the GPFS filesystems on TRE. Will borrow a new switch from the it-department ASAP.
09:50
New switch in place. Rebooting the nodes to get everything back up in shape.
10:12
Everything on node TRE is up. Rebooting node TO.
10:26
TO is all up. Rebooting node EN.
10:48
All nodes are up. /migrate and /net/bcmhsm is also resolved.
Total downtime:
09:10-10:48 = 1:38 on en, to, tre and fire.
Fimm was mostly unhurt.. only jobs accessing /home/parallab were affected.
09:50
New switch in place. Rebooting the nodes to get everything back up in shape.
10:12
Everything on node TRE is up. Rebooting node TO.
10:26
TO is all up. Rebooting node EN.
10:48
All nodes are up. /migrate and /net/bcmhsm is also resolved.
Total downtime:
09:10-10:48 = 1:38 on en, to, tre and fire.
Fimm was mostly unhurt.. only jobs accessing /home/parallab were affected.
Home directory on fimm – Please read if you’re using fimm.
Because of major stability problems caused by NFS, the home directories has moved from NFS to GPFS on fimm. We hope this should fix all performance issues for the home-directories, and also make fimm more robust. It should no longer be depending on external filesystems, and the load on the regatta shouldn't affect the fimm-cluster anymore.
The new home directory on fimm is under /home/fimm/$department/$username/. This is only accessible on fimm.
The old home directory was /home/parallab/$department/$username/. This will still be the home directory on TRE, but if you only or mainly use FIMM, please move your files from /home/parallab, to /home/fimm/.
Very sorry for any inconvenience this sudden change has caused, but the situation on fimm was getting quite bad, and something needed to be done.
The new home directory on fimm is under /home/fimm/$department/$username/. This is only accessible on fimm.
The old home directory was /home/parallab/$department/$username/. This will still be the home directory on TRE, but if you only or mainly use FIMM, please move your files from /home/parallab, to /home/fimm/.
Very sorry for any inconvenience this sudden change has caused, but the situation on fimm was getting quite bad, and something needed to be done.
NFS problems on tre, fire, fimm
There's a hang on a NFS-server causing problems on fire (all down), fimm (some hanging nodes) and tre (/net/bcmhsm and /migrate is hanging). Am working on resolving this.
Will update this log whenever there's progress.
Will update this log whenever there's progress.
Low latency, fast interconnect on fimm
25 nodes of fimm are now interconnected in a 2D Torus using SCI interconnect from Dolphin Interconnect Solutions.
Please read http://www.parallaw.uib.no/resources/cluster/scampi for more details.
Please read http://www.parallaw.uib.no/resources/cluster/scampi for more details.
Totalview upgraded to 6.7.0-1 on TRE and fimm
Totalview as upgraded to v6.7.0-1 on FIMM and TRE.
* New Memory Debugging Features
- Heap Debugging Filters
- Export Memory Debugging Information
- Error Event Reporting Controls
- Improved Memory Event Details window
- Graphical Heap Browsing
- Pointer Queries
For more detail, check http://www.etnus.com/TotalView/Latest_Release.html
and http://www.etnus.com/Documentation/rel6/pdf/new_features.pdf
* New Memory Debugging Features
- Heap Debugging Filters
- Export Memory Debugging Information
- Error Event Reporting Controls
- Improved Memory Event Details window
- Graphical Heap Browsing
- Pointer Queries
For more detail, check http://www.etnus.com/TotalView/Latest_Release.html
and http://www.etnus.com/Documentation/rel6/pdf/new_features.pdf
Totalview debugger installed on fimm
Totalview v6.6.0-2 was installed on fimm. The license covers 2 simultaneous users, on max 2 nodes.
Totalview is an advanced debugger. Please check http://www.etnus.com/Documentation/rel6/html/index.html for features and documentation.
Totalview is an advanced debugger. Please check http://www.etnus.com/Documentation/rel6/html/index.html for features and documentation.