Updated libraries on hexagon, Dec. 15th

Several key software and library packages have now been updated on hexagon.
We recommend that you recompile your programs to get the increased performance and fixes that has been introduced. Note that you need to log out and in again to get the new modules loaded by default.

See below for some excerpts from the release notes.

MPI and compiler wrappers:
xt-mpt 3.0.4 -> 3.1.0
xt-asyncpe 1.2 -> 2.0

Math libs (LAPACK, BLAS etc):
xt-libsci 10.3.0 -> 10.3.1

Notes regarding new MPI version from Cray:

This MPT 3.1 version contains the following new features.

* Move from MPICH2 1.0.4p1 to MPICH2 1.0.6p1
* Cpu affinity support
* Raise the maximum number of MPI ranks from 64,000 to 256,000 ranks.
* Raise the maximum number of SHMEM PEs from 32,000 to 256,000 SHMEM PEs.
* Automatically-tuned default values for MPICH environment variables
* Dynamic allocation of MPI internal message headers
* Improvements to start-up times when running at high process counts(40K
cores or more)
* Significant performance improvements for the MPI_Allgather collective
* Improvements for some error messages
* Wildcard matching for filenames in MPICH_MPIIO_HINTS
* Support for the Cray Compiling Environment (CCE) 7.0 compiler in
x86 ABI compatible mode
* MPI Barrier before collectives
* MPI-IO collective buffering alignment
* MPI Thread Safety
* Improved performance for on-node very large discontiguous messages

More detail for some of these below.

* Move from MPICH2 1.0.4p1 to MPICH2 1.0.6p1
- Performance improvements for derived datatypes (including packing and
communication) through loop-unrolling and buffer alignment.

- Performance improvements for MPI_Gather when non-power-of-two processes are
used, and when a non-zero ranked root is performing the gather.

- MPI_Comm_create now works for intercommunicators.

- Many other bug fixes, memory leak fixes and code cleanup.

- Includes a number of specific fixes from MPICH2 1.0.7 for regressions
introduced in MPICH1 1.0.6p1


* Automatically-tuned default values for MPICH environment variables

Several of the MPICH environment variable default values are now dependent
on the total number of processes in the job. Previously, these defaults
were set to static values. This feature is designed to allow higher scaling
of MPT jobs with fewer tweaks to environment variables. For more information
on how the new defaults are calculated, please see the "mpi" man page. As
before, the user is able to override any of these defaults by setting the
corresponding environment variable. The new default values are displayed
via the MPICH_ENV_DISPLAY setting.



* Dynamic allocation of MPI internal message headers

If additional message headers are required during program execution, MPI
dynamically allocates more message headers in quantities of MPICH_MSGS_PER_PROC.


* Significant performance improvements for the MPI_Allgather collective

This change adds in a new MPI_Allgather collective routine which scales well
for small data sizes. The default is to use the new algorithm for any
MPI_Allgather calls with 2048 bytes of data or less. The cutoff value can be
changed by setting the new MPICH_ALLGATHER_VSHORT_MSG environment variable.
In addition, some MPI functions use allgather internally and will now be
significantly faster. For example MPI_Comm_split will be significantly faster
at high pe counts. Initial results show improvements of around 2X around 16
cores to over 100X above 20K cores.


* Improvements for some error messages

This change fixes a small number of messages specific to Cray platforms that
were incorrect due to the merging of the Cray and ANL messages and message
handling processes.


* Wildcard matching for filenames in MPICH_MPIIO_HINTS

Support has been added for wildcard pattern matching for filenames in the
MPICH_MPIIO_HINTS environment variable. This allows easier specification of
hints for multiple files that are opened with MPI_File_open in the program.
The filename pattern matching follows standard shell pattern matching rules for
meta-characters ?, \, [], and *.


* MPI Barrier before collectives

In some situations a Barrier inserted before a collective may improve
performance due to load imbalance. This feature adds support for a new
MPICH_COLL_SYNC environment variable which will cause a Barrier call to
be inserted before all collectives or only certain collectives. See the
"mpi" man page for more information.


* MPI-IO collective buffering alignment

This feature improves MPI-IO by aligning collective buffering file domains
on Lustre boundaries. The new algorithms take into account physical I/O
boundaries and the size of the I/O requests. The intent is to improve
performance by having the I/O requests of each collective buffering node
(aggregator) start and end on physical I/O boundaries and to not have more
than one aggregator reference for any given stripe on a single collective
I/O call. The new algorithms are enabled by setting the MPICH_MPIIO_CB_ALIGN
environment variable but may become the default in a future release.
Initial results have shown as much as a 4X improvement on some benchmarks.
See the "mpi" man page for more information.


* MPI Thread Safety

The MPI Thread Safety feature provides a high-performance implementation
of thread-safety levels MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, and
MPI_THREAD_SERIALIZE in the main MPI library.

The MPI_THREAD_MULTIPLE thread-safety level support is in a separate
"mpich_threadm" library and is not a high-performance implementation.
Use "-lmpich_threadm" when linking to MPI_THREAD_MULTIPLE routines.

Set the MPI Thread Safety MPICH_MAX_THREAD_SAFETY environment variable
to the desired level (MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED,
MPI_THREAD_SERIALIZED, or MPI_THREAD_MULTIPLE), to control the value
returned in the "provided" argument of the MPI_Init_thread() routine.

See the "mpi" man page and the MPI standard for more information.


* Improved performance for on-node very large discontiguous messages

This feature enables a new algorithm for the on-node SMP device to process large
discontiguous messages. The new algorithm allows the use of our on-node
Portals-assisted call that is used in our MPT 3.0 single-copy feature rather
than buffering the data into very small chunks as was currently being done.
Some applications have seen as much as a 3X speedup with discontiguous messages
in excess of 4M bytes.