Software

Several libraries and programs have been updated on hexagon. Users are encouraged to recompile their programs to get fixes and performance-increases. In particular, codes that use MPI_Bcast will see an improvement with the new xt-mpt release, see notes below.

MPI:
xt-mpt 3.1.0 -> 3.1.1

COMPILER and tools:
pgi 8.0.2 -> 8.0.3
xt-asyncpe 2.0 -> 2.1 (compiler wrapper)
java 1.6.0-7 -> 1.6.0-11

LIBRARIES:
hdf5 and hdf5-parallell 1.6.7a -> 1.8.2
netCDF 3.6.2 -> 4.0
fftw 3.1.1 -> 3.2.0
PetSC 2.3.3a -> 3.0.0
ACML 4.1.0 -> 4.2.0 (previously installed but not listed)
xt-libsci 10.3.1 -> 10.3.2 (previously installed but not listed)
libfast 1.0 -> 1.0.2 (previously installed but not listed)

NEW LIBRARIES:
netcdf-hdf5parallell 4.0 (combined netcdf-hdf5-parallell)

NEW TOOLS:
xt-lgdb 1.1 (Cray version of gdb to use for MPI debugging on XT)

NOTES FOR XT-MPT:

- MPI_Bcast has been optimized to be SMP aware and this optimization is enabled by default. The performance improvement varies depending on message size and number of ranks but improvements of between 10% and 35% for messages below 128K bytes have been observed.

- Improvements have been made to the MPICH_COLL_OPT_OFF environment variable by allowing a finer-grain switch to enable/disable the optimized collectives.
The user may now:
- Enable all of the optimized collectives (this is the default)
- Disable all the opt collectives (export MPICH_COLL_OPT_OFF=0)
- Disable a selected set of the optimized collectives by providing
a comma-separated list of the collective names
e.g. export MPICH_COLL_OPT_OFF=MPI_Allreduce,MPI_Bcast,MPI_Alltoallv
If a user chooses to disable any Cray-optimized collective, they will get the standard MPICH2 algorithm.

Several key software and library packages have now been updated on hexagon.
We recommend that you recompile your programs to get the increased performance and fixes that has been introduced. Note that you need to log out and in again to get the new modules loaded by default.

See below for some excerpts from the release notes.

MPI and compiler wrappers:
xt-mpt 3.0.4 -> 3.1.0
xt-asyncpe 1.2 -> 2.0

Math libs (LAPACK, BLAS etc):
xt-libsci 10.3.0 -> 10.3.1

Notes regarding new MPI version from Cray:

This MPT 3.1 version contains the following new features.

* Move from MPICH2 1.0.4p1 to MPICH2 1.0.6p1
* Cpu affinity support
* Raise the maximum number of MPI ranks from 64,000 to 256,000 ranks.
* Raise the maximum number of SHMEM PEs from 32,000 to 256,000 SHMEM PEs.
* Automatically-tuned default values for MPICH environment variables
* Dynamic allocation of MPI internal message headers
* Improvements to start-up times when running at high process counts(40K
cores or more)
* Significant performance improvements for the MPI_Allgather collective
* Improvements for some error messages
* Wildcard matching for filenames in MPICH_MPIIO_HINTS
* Support for the Cray Compiling Environment (CCE) 7.0 compiler in
x86 ABI compatible mode
* MPI Barrier before collectives
* MPI-IO collective buffering alignment
* MPI Thread Safety
* Improved performance for on-node very large discontiguous messages

More detail for some of these below.

* Move from MPICH2 1.0.4p1 to MPICH2 1.0.6p1
- Performance improvements for derived datatypes (including packing and
communication) through loop-unrolling and buffer alignment.

- Performance improvements for MPI_Gather when non-power-of-two processes are
used, and when a non-zero ranked root is performing the gather.

- MPI_Comm_create now works for intercommunicators.

- Many other bug fixes, memory leak fixes and code cleanup.

- Includes a number of specific fixes from MPICH2 1.0.7 for regressions
introduced in MPICH1 1.0.6p1


* Automatically-tuned default values for MPICH environment variables

Several of the MPICH environment variable default values are now dependent
on the total number of processes in the job. Previously, these defaults
were set to static values. This feature is designed to allow higher scaling
of MPT jobs with fewer tweaks to environment variables. For more information
on how the new defaults are calculated, please see the "mpi" man page. As
before, the user is able to override any of these defaults by setting the
corresponding environment variable. The new default values are displayed
via the MPICH_ENV_DISPLAY setting.



* Dynamic allocation of MPI internal message headers

If additional message headers are required during program execution, MPI
dynamically allocates more message headers in quantities of MPICH_MSGS_PER_PROC.


* Significant performance improvements for the MPI_Allgather collective

This change adds in a new MPI_Allgather collective routine which scales well
for small data sizes. The default is to use the new algorithm for any
MPI_Allgather calls with 2048 bytes of data or less. The cutoff value can be
changed by setting the new MPICH_ALLGATHER_VSHORT_MSG environment variable.
In addition, some MPI functions use allgather internally and will now be
significantly faster. For example MPI_Comm_split will be significantly faster
at high pe counts. Initial results show improvements of around 2X around 16
cores to over 100X above 20K cores.


* Improvements for some error messages

This change fixes a small number of messages specific to Cray platforms that
were incorrect due to the merging of the Cray and ANL messages and message
handling processes.


* Wildcard matching for filenames in MPICH_MPIIO_HINTS

Support has been added for wildcard pattern matching for filenames in the
MPICH_MPIIO_HINTS environment variable. This allows easier specification of
hints for multiple files that are opened with MPI_File_open in the program.
The filename pattern matching follows standard shell pattern matching rules for
meta-characters ?, \, [], and *.


* MPI Barrier before collectives

In some situations a Barrier inserted before a collective may improve
performance due to load imbalance. This feature adds support for a new
MPICH_COLL_SYNC environment variable which will cause a Barrier call to
be inserted before all collectives or only certain collectives. See the
"mpi" man page for more information.


* MPI-IO collective buffering alignment

This feature improves MPI-IO by aligning collective buffering file domains
on Lustre boundaries. The new algorithms take into account physical I/O
boundaries and the size of the I/O requests. The intent is to improve
performance by having the I/O requests of each collective buffering node
(aggregator) start and end on physical I/O boundaries and to not have more
than one aggregator reference for any given stripe on a single collective
I/O call. The new algorithms are enabled by setting the MPICH_MPIIO_CB_ALIGN
environment variable but may become the default in a future release.
Initial results have shown as much as a 4X improvement on some benchmarks.
See the "mpi" man page for more information.


* MPI Thread Safety

The MPI Thread Safety feature provides a high-performance implementation
of thread-safety levels MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, and
MPI_THREAD_SERIALIZE in the main MPI library.

The MPI_THREAD_MULTIPLE thread-safety level support is in a separate
"mpich_threadm" library and is not a high-performance implementation.
Use "-lmpich_threadm" when linking to MPI_THREAD_MULTIPLE routines.

Set the MPI Thread Safety MPICH_MAX_THREAD_SAFETY environment variable
to the desired level (MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED,
MPI_THREAD_SERIALIZED, or MPI_THREAD_MULTIPLE), to control the value
returned in the "provided" argument of the MPI_Init_thread() routine.

See the "mpi" man page and the MPI standard for more information.


* Improved performance for on-node very large discontiguous messages

This feature enables a new algorithm for the on-node SMP device to process large
discontiguous messages. The new algorithm allows the use of our on-node
Portals-assisted call that is used in our MPT 3.0 single-copy feature rather
than buffering the data into very small chunks as was currently being done.
Some applications have seen as much as a 3X speedup with discontiguous messages
in excess of 4M bytes.

Several key software and library packages have now been updated on hexagon.
We recommend that you recompile your programs to get the increased performance and fixes that has been introduced. Note that you need to log out and in again to get the new modules loaded by default.

Compiler and MPI:
xt-mpt 3.0.3 -> 3.0.4
pgi 7.2.5 -> 8.0.1

Profiler with supporting libraries:
xt-craypat 4.3.2 -> 4.4.0
apprentice2 4.3.0 -> 4.4.0
xt-papi 3.6.1a -> 3.6.2
dwarf 8.6.0 -> 8.8.0
elf 0.8.9 -> 0.8.10

The libsci library is updated to version 10.3.0 and includes optimizations and new libraries. Users are encouraged to recompile their applications to benefit from optimazation and bugfixes.

Description of new features in xt-libsci 10.3.0:

CRAFFT (Cray Adaptive FFT) is a new feature in libsci-10.3.0. CRAFFT uses
offline and online testing information to adaptively select the best FFT
algorithm from the available FFT options. CRAFFT provides a very simple
user interface into advanced FFT functionality and performance. Planning
and execution are combined into one call with CRAFFT. The library comes
packaged with pre-computed plans so that in many cases the planning stage
can be omitted. Please see the manual page intro_crafft for more information.

Usage note : for the most optimal usage of CRAFFT, please copy the file
/opt/xt-libsci/10.3.0/fftw_wisdom into the luster directory from which the
executable is run from.

LibGoto 1.26 includes enhanced BLAS performance. There are several libsci
library variants installed with the libsci-10.3.0 package.

To use threaded BLAS, the thread-enabled libsci library whose name is
suffixed with '_mp' should be linked explicitly

e.g. ftn -o myexec -lsci_quadcore_mp

Dependencies:
=============

Libsci-10.3 and fftw-3.1.1 are now dependent. If you wish to use fftw
version 2.1.5 then do the following

module swap fftw/3.1.1 fftw/2.1.5.1

Since the last big software update on June 16th several libraries and programs have been updated.

MPT (MPI) 3.0.2
pgi 7.2.3
pathscale 3.2
CrayPat 4.3.1
libfast 1.0 (new library with some optimized math functions)
fftw 2.1.5.1
PAPI 3.6
Totalview 8.4.1b
gcc 4.2.4 (only for login-node programs)
xt-asyncpe 1.0c (new compiler wrappers)
xt-binutils-quadcore 2.0.1 (binutils for AMD quadcore)
Moab 5.2.3 scheduler (remember to log out and in again)

Users will need to log out and in again to get the above as default modules.
Because all applications that run on the compute nodes are statically compiled, we encourage re-compiling of applications and libraries, especially if you have experienced problems.

There will be a planned maintenance on hexagon for software upgrade on Monday June 16th starting at 14:00 and expected to last approximately 3 hours.

The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.

Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0

Look for update on when we have re-compiled libraries:

All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.

Early on March 26th hexagon will be shutdown for the initial quad-core upgrade. We hope to be able to have parts of the machine up while the second half is upgraded. It will nevertheless mean that the entire machine will be taken down first, before being booted to a smaller size.The physical upgrade will probably take three days. There will then be some more days with tuning and reconfiguring.

One very important part of this is that ALL programs and libraries will have to be re-compiled when hexagon is booted up after the finished upgrade.

Wednesday, 09:00: Upgrade has started. Machine is now down for a while for diagnostics.

Wednesday, 12:30: Half of the machine is now running again, while the other half is being upgraded to quad-core. We expect to take the entire machine down Friday morning. Please consider the machine to be in testing state, so unannounced downtime might occure.

Wednesday, 16:45: The upgrade is ahead of schedule, therefore the machine will be taken down tomorrow around 10am.

Thursday, 12:00: Two racks are now running, which will run till tomorrow morning, Friday 28th, and then the entire machine will be shutdown at 8am. The machine will then stay down untill, at least, Monday.

Friday, 08:00: Hardware part of upgrade is now finished. The machine is now unavailable until the software, diagnostics and testing has finished.

Saturday, 17:00: Main part of software upgrade is finished. The machine is running, but is unavailable due to testing.

Tuesday, April the 1st, 18:00: Hexagon is now available again, see http://www.parallaw.uib.no/syslog/153 for more details.