Updated software/libraries on hexagon, Mar. 20th

lsz075 • March 20, 2009

Several libraries have been updated on hexagon.

MPI:
xt-mpt 3.1.1 -> 3.1.2

Libs/math:
xt-libsci 10.3.2 -> 10.3.3
petsc 3.0.0 -> 3.0.0.1
hdf5 1.8.2 -> 1.8.2.1
netcdf_hdf5parallell 4.0 -> 4.0.0.1
netcdf 4.0 -> 4.0.0.1

Compiler:
xt-asyncpe 2.1 -> 2.3 (wrapper)
pgi 8.0.3 -> 8.0.4

NOTES:

xt-mpt:

MPI_Reduce has been optimized to be SMP aware and this optimization is
enabled by default. The SMP aware algorithm performs significantly better
than the default algorithm for most message sizes. Performance improvements of over 3x for message sizes below 128K have been observed. A new environment variable MPICH_REDUCE_LARGE_MSG can be used to adjust the cutoff for when this optimization is enabled. See the man page for more info.

xt-libsci:

- libGoto 1.29 includes moderate performance improvements in BLAS and LAPACK.

- CRAFFT1.1 (Cray Adaptive FFT) is a productivity enhancement for the efficient use of Fast Fourier transforms with little programming effort. CRAFFT 1.1 adds single precision support. See intro_crafft for a description of the double precision API. Replace the "z" and "d" in the double precision
routine names by "c" or "s" to use the single precision routines.
E.g.crafft_d2z1d in double precision would be crafft_s2c1d in single
precision.
The fftw/3.2.0 module must be loaded to use CRAFFT1.1. If the FFTW module is not loaded, then the user's link stage will fail with unresolved references to FFTW routines.
Prior to running a CRAFFT-linked executable, users must copy the correct
FFTW wisdom files into their current run directory. The wisdom files are
fftw_wisdom-3.2 for double precision and fftw_wisdom_single-3.2 for single
precision, and are found at the following location: /opt/xt-libsci/10.3.3/

netcdf_hdf5parallell:

Known problem:
Use of the '-fsecond-underscore' compiler option with pathscale compilers is needed. Failure to do so will result in a link error.

Hexagon crash with HSN, Mar. 14th

lsz075 • March 14, 2009

Hexagon crashed at 00:20, most likely due to a power problem on an IO-node.
We are dumping debug-information and rebooting the machine.

Update 02:20, Machine is up again.

Power blink in building stops hexagon Mar. 4th

lsz075 • March 4, 2009

Because of a power blink (short outage) in the building at 17:30, the cabinets without UPS power went down. We are working on getting the machine up again.

Update 18:00: Hexagon is now running again. All jobs running at the time of the problem has to be resubmitted.

Updated software/libraries on hexagon, Feb. 20th

lsz075 • February 20, 2009

Several libraries and programs have been updated on hexagon. Users are encouraged to recompile their programs to get fixes and performance-increases. In particular, codes that use MPI_Bcast will see an improvement with the new xt-mpt release, see notes below.

MPI:
xt-mpt 3.1.0 -> 3.1.1

COMPILER and tools:
pgi 8.0.2 -> 8.0.3
xt-asyncpe 2.0 -> 2.1 (compiler wrapper)
java 1.6.0-7 -> 1.6.0-11

LIBRARIES:
hdf5 and hdf5-parallell 1.6.7a -> 1.8.2
netCDF 3.6.2 -> 4.0
fftw 3.1.1 -> 3.2.0
PetSC 2.3.3a -> 3.0.0
ACML 4.1.0 -> 4.2.0 (previously installed but not listed)
xt-libsci 10.3.1 -> 10.3.2 (previously installed but not listed)
libfast 1.0 -> 1.0.2 (previously installed but not listed)

NEW LIBRARIES:
netcdf-hdf5parallell 4.0 (combined netcdf-hdf5-parallell)

NEW TOOLS:
xt-lgdb 1.1 (Cray version of gdb to use for MPI debugging on XT)

NOTES FOR XT-MPT:

- MPI_Bcast has been optimized to be SMP aware and this optimization is enabled by default. The performance improvement varies depending on message size and number of ranks but improvements of between 10% and 35% for messages below 128K bytes have been observed.

- Improvements have been made to the MPICH_COLL_OPT_OFF environment variable by allowing a finer-grain switch to enable/disable the optimized collectives.
The user may now:
- Enable all of the optimized collectives (this is the default)
- Disable all the opt collectives (export MPICH_COLL_OPT_OFF=0)
- Disable a selected set of the optimized collectives by providing
a comma-separated list of the collective names
e.g. export MPICH_COLL_OPT_OFF=MPI_Allreduce,MPI_Bcast,MPI_Alltoallv
If a user chooses to disable any Cray-optimized collective, they will get the standard MPICH2 algorithm.

Removed old netcdf libraries on Hexagon Feb. 11th

lsz075 • February 12, 2009

Old netcdf-pgi-cnl and netcdf-pathscale-cnl modules have been removed from the module list on Hexagon.

Use instead Crays version, available as module netCDF, or netcdf-cnl which we have compiled.

Both netCDF and netcdf-cnl checks which PrgEnv you have loaded, and loads the appropriate version.

netcdf-cnl/4.0 is currently not netcdf4 enabled.

Hexagon crash Feb. 6th

lsz075 • February 6, 2009

Hexagon crashed 12:15 today. We are working on getting the system up and running again.

Update 13:30: Hexagon is now running again. Most probably the crash was caused by overuse of memory on several login nodes.

All jobs running when it crashed has to be resubmitted. We are sorry for the inconvenience.

Cooling failure, hexagon forced to be shut off

lsz075 • January 28, 2009

Due to a double cooling failure (primary plus backup) for the building-provided chilled water supply, we were forced to shutdown hexagon due to over-temperature in the room.

Update 17:30: we hope to have the cooling back Thursday morning.

Update Thursday 10:00: we have now partial cooling and have started the machine and allowed logins. Until we know more about when we will get full cooling we have a system reservation on all nodes, you can add jobs to the queue but they will not start until we remove the reservation.

Update Thursday 11:00: we have now restored 1 of the cooling machines to operation so we have now full cooling, and reservation is removed.

Replacing controller on hexagon Jan. 12th

lsz075 • January 5, 2009

Monday Jan. 12th at 13:30 will we replace the failed controller on Hexagon. During this maintenance the entire machine will be unavailable. Note that jobs with a walltime lasting longer than 13:30 the 12th will be prevented from starting until the maintenance is completed.

Update 15:30, machine is running again.

Lustre filesystem (/work) hang on hexagon, Dec. 27th

lsz075 • December 27, 2008

The /work file system on hexagon hangs, we are doing debug dumps and will restart the system. Existing jobs will have to be re-submitted.

Update 15:00, one of the disk controllers have problems, the downtime will be longer than anticipated. We will update this note when we have more information.

Update 20:00, we will need to wait for support on Monday before continuing the work to fix the controller.

Update Monday Dec 29th, 12:00, we are currently waiting for a new controller.

Update Monday Dec 29th, 17:15, the shipment with the controller is expected to arrive on Wednesday 31st. We are sorry for this delay.

Update Wednesday Dec 31st, 14:50, we have got a notice that the expected delivery of the replacement controller is delayed even further, to Monday Jan. 5th. We are looking to other ways to get the file system working.

Update Thursday Jan 1st, 04:00, the system is running again with a workaround. We will have to reboot the system again when the replacement controller arrives (so long-running jobs will have to be resubmitted).

Update Monday Jan 5th, 13:50, the new controller has now arrived we have scheduled this to be replaced on Monday the 12th at 13:30.

File system crash on Fimm

lsz075 • December 18, 2008

Filesystem /home/fimm on fimm cluster crashed this morning,We are working on solving the problem.

12:48 Update: File system is up again. All running jobs before file system crash has to be resubmitted. If any user experiencing file lost, please contact support-uib@notur.no.

we are sorry for the inconvenience.

HPC Syslog

Log over changes and events on UiB's HPC systems