Hexagon crashed at 20:20 due to EPO (emergency power off) problem on cabinet 3. We are doing diagnostics and will then restart machine.
Update 22:50, machine is back up.
Author Archives: lsz075
Hexagon: Updated software/libraries
Hexagon has updated software/libraries. Users should recompile their application to gain new features and updates.
MPI
xt-mpt 5.1.2 -> 5.1.3
Compilers and debugging
xt-asyncpe 4.5 -> 4.6
Intel compiler 11.1.073 -> 12.0.084
Chapel 1.2.0 -> 1.2.1
ATP 1.0.3 -> 1.1.0
STAT 1.1.1 -> 1.1.2
Performance tools and Math libs
xt-craypat/apprentice2 5.1.2 -> 5.1.3
xt-libsci 10.4.9 -> 10.5.0
NOTES:
xt-mpt
The following features were added to MPT 5.1.3:
- Improvements to MPI-IO collective buffering.
ATP
Abnormal Termination Processing (ATP) is a system that monitors Cray XT
System user applications, and should an application take a system trap,
ATP performs analysis on the dying application. With release 1.1 all of
the stack backtraces of the application processes are gathered into a
merged stack backtrace tree and written to disk as the file
"atpMergedBT.dot". The stack backtrace for the first process to die is
sent to stderr as is the number of the signal that caused the death.
atpMergedBT.dot can be viewed with 'statview', a component of the STAT
package (module load stat). The merged stack backtrace tree provides
a concise, yet comprehensive, view of what the application was doing at
the time of its death.
Further information on ATP can found in the intro_atp man page.
Release notes for release 1.1.0
--------------------------------
1.1.0:
- ATP is now automatically linked in to user applications
and automatically initialized. That is, users do not need
to modify their source code nor their link line (and, in
fact, should not). One must use the Cray compiler drivers
(cc, CC, ftn) to achieve this.
In order for this to occur one must do all of the following:
o have the module atp/1.1.0 or greater loaded (which
is automatically done by the latest PrgEnv modules)
o use the Cray compiler drivers when linking
o relink your application
- It is now necessary to overtly define the environment variable
'ATP_ENABLED' so that the running of an application gets ATP
processing.
- ATP will now perform its analysis in the event of the
queuing system aborting the job due to the wall clock
expiring. Note that the amount of time between when the
queuing system signals the that wall clock has expired
and when the queuing system SIGKILLs the job is
something that sites can customize. If sufficient time
is not configured, this feature may not be able to
complete its task. Thirty seconds is typically more than
generous.
- The environment variable ATP_HOLD_TIME can be used to
define the number of minutes that ATP should hold a dying
application in stasis so that it can be attached via
a debugger.
- ATP is now willing to collect data, even if some nodes
have stopped responding. Since such a system is clearly
sick in some manner, this may not always be successful.
- Fixed a memory corruption bug that could cause various,
unpredictable
symptoms.
- ATP no longer needs be installed on a cross mounted location
xt-libsci
The following features were added to LibSci 10.5.0:
* Now includes CASE, a collection of simplified interfaces into high
performance LAPACK and ScaLAPACK style routines that find the
eigenvalues
and eigenvectors of a symmetric or hermitian matrix. CASE is
written in
Fortran but has interfaces for C users as well. CASE is provided for
serial or parallel problems, with real or complex data types and
single
or double precision. It has generic interfaces callable from
Fortran and
specific interfaces callable from Fortran or C. See the
'intro_case' man
page for more information.
The LibSci 10.5.0 release adds new interfaces for CRAFFT providing
users
an option to use CRAFFT Serial and Distributed Routines in C
applications.
CRAFFT offers a simpler interface for FFT routines to improve
application
developer productivity. In some cases the performance of the CRAFFT
distributed transforms is 10-50% better than FFTW2 MPI transforms.
Users
requiring more information on usage should see the intro_crafft
man page.
STAT
Eliminate the need to have the STAT daemon installed on the lustre
file system.
Perftools (craypat/apprentice2)
IMPORTANT NOTE: The perftools modulefile needs to be loaded,
otherwise the
following error will occur if a user attempts to load craypat or
apprentice2.
ERROR: xt-craypat and apprentice2 have been merged into one module
called perftools.
Please run the following to load perftools:
module unload xt-craypat apprentice2
module load perftools
General
* UPC and CAF require CCE version 7.2.7 or later (see bug 763219)
pat_build
* update the following trace groups: adios, dmapp, hdf5, netcdf,
petsc, pgas, pthreads, upc
* allow tracing of functions defined as WEAK (bug 764102)
* remove PAT_BUILD_ADDSYM and PAT_BUILD_TRACE_ARCHIVE environment
variables
* add new directives to control addsym utility features
* improve tracing functions that have aggregates as formal parameters
(bug 764058)
pat_report
* now shows inclusive loop times from CCE -hprofile_generate option
MPI
xt-mpt 5.1.2 -> 5.1.3
Compilers and debugging
xt-asyncpe 4.5 -> 4.6
Intel compiler 11.1.073 -> 12.0.084
Chapel 1.2.0 -> 1.2.1
ATP 1.0.3 -> 1.1.0
STAT 1.1.1 -> 1.1.2
Performance tools and Math libs
xt-craypat/apprentice2 5.1.2 -> 5.1.3
xt-libsci 10.4.9 -> 10.5.0
NOTES:
xt-mpt
The following features were added to MPT 5.1.3:
- Improvements to MPI-IO collective buffering.
ATP
Abnormal Termination Processing (ATP) is a system that monitors Cray XT
System user applications, and should an application take a system trap,
ATP performs analysis on the dying application. With release 1.1 all of
the stack backtraces of the application processes are gathered into a
merged stack backtrace tree and written to disk as the file
"atpMergedBT.dot". The stack backtrace for the first process to die is
sent to stderr as is the number of the signal that caused the death.
atpMergedBT.dot can be viewed with 'statview', a component of the STAT
package (module load stat). The merged stack backtrace tree provides
a concise, yet comprehensive, view of what the application was doing at
the time of its death.
Further information on ATP can found in the intro_atp man page.
Release notes for release 1.1.0
--------------------------------
1.1.0:
- ATP is now automatically linked in to user applications
and automatically initialized. That is, users do not need
to modify their source code nor their link line (and, in
fact, should not). One must use the Cray compiler drivers
(cc, CC, ftn) to achieve this.
In order for this to occur one must do all of the following:
o have the module atp/1.1.0 or greater loaded (which
is automatically done by the latest PrgEnv modules)
o use the Cray compiler drivers when linking
o relink your application
- It is now necessary to overtly define the environment variable
'ATP_ENABLED' so that the running of an application gets ATP
processing.
- ATP will now perform its analysis in the event of the
queuing system aborting the job due to the wall clock
expiring. Note that the amount of time between when the
queuing system signals the that wall clock has expired
and when the queuing system SIGKILLs the job is
something that sites can customize. If sufficient time
is not configured, this feature may not be able to
complete its task. Thirty seconds is typically more than
generous.
- The environment variable ATP_HOLD_TIME can be used to
define the number of minutes that ATP should hold a dying
application in stasis so that it can be attached via
a debugger.
- ATP is now willing to collect data, even if some nodes
have stopped responding. Since such a system is clearly
sick in some manner, this may not always be successful.
- Fixed a memory corruption bug that could cause various,
unpredictable
symptoms.
- ATP no longer needs be installed on a cross mounted location
xt-libsci
The following features were added to LibSci 10.5.0:
* Now includes CASE, a collection of simplified interfaces into high
performance LAPACK and ScaLAPACK style routines that find the
eigenvalues
and eigenvectors of a symmetric or hermitian matrix. CASE is
written in
Fortran but has interfaces for C users as well. CASE is provided for
serial or parallel problems, with real or complex data types and
single
or double precision. It has generic interfaces callable from
Fortran and
specific interfaces callable from Fortran or C. See the
'intro_case' man
page for more information.
The LibSci 10.5.0 release adds new interfaces for CRAFFT providing
users
an option to use CRAFFT Serial and Distributed Routines in C
applications.
CRAFFT offers a simpler interface for FFT routines to improve
application
developer productivity. In some cases the performance of the CRAFFT
distributed transforms is 10-50% better than FFTW2 MPI transforms.
Users
requiring more information on usage should see the intro_crafft
man page.
STAT
Eliminate the need to have the STAT daemon installed on the lustre
file system.
Perftools (craypat/apprentice2)
IMPORTANT NOTE: The perftools modulefile needs to be loaded,
otherwise the
following error will occur if a user attempts to load craypat or
apprentice2.
ERROR: xt-craypat and apprentice2 have been merged into one module
called perftools.
Please run the following to load perftools:
module unload xt-craypat apprentice2
module load perftools
General
* UPC and CAF require CCE version 7.2.7 or later (see bug 763219)
pat_build
* update the following trace groups: adios, dmapp, hdf5, netcdf,
petsc, pgas, pthreads, upc
* allow tracing of functions defined as WEAK (bug 764102)
* remove PAT_BUILD_ADDSYM and PAT_BUILD_TRACE_ARCHIVE environment
variables
* add new directives to control addsym utility features
* improve tracing functions that have aggregates as formal parameters
(bug 764058)
pat_report
* now shows inclusive loop times from CCE -hprofile_generate option
Hexagon: failed seastar in one module
Hexagon got a problem in high speed network. We are working to fix the problem. All running jobs failed.
Update: 22:55 Machine is back online.
Update: 22:55 Machine is back online.
Fimm: compute nodes down due to cooling failure
At 8:30 this morning all compute nodes on fimm was shutdown due to cooling failure in machine room.
Now cooling is back to normal , we will take up all compute nodes within next 20 minutes.
Now cooling is back to normal , we will take up all compute nodes within next 20 minutes.
Hexagon: down due to cooling failure
Hexagon was shutdown due to a cooling failure in the building at 06:30.
We are investigating.
Update: 08:00 We will do some already planned maintenance while the machine is down.
Update: 13:05 Machine is back online. Maintenance done to disk-system firmware and some Lustre config checks, as well as a couple of hardware replacements.
We are investigating.
Update: 08:00 We will do some already planned maintenance while the machine is down.
Update: 13:05 Machine is back online. Maintenance done to disk-system firmware and some Lustre config checks, as well as a couple of hardware replacements.
Hexagon: Updated software/libraries
Hexagon has updated software/libraries.
MPI / compilers
xt-mpt 5.1.1 -> 5.1.2
xt-asyncpe 4.4 -> 4.5
Math libraries
xt-libsci 10.4.8 -> 10.4.9
PETSc 3.1.03 -> 3.1.04
Trilinos 10.2.0 -> 10.6.0
libfast 1.0.7 -> 1.0.8
NOTES:
xt-mpt
Bugfixes.
xt-asyncpe
Bugs fixed in this release:
744483 linux ayncpe drivers fail to handle verbose flag correctly
for GNU and PGI
750934 On a CADE system, using only the '-V' option does not pass
target option
754620 cc -help does not give individual compiler options.
758200 cc -V and ftn -V no longer work properly
765639 compiling w/ -Bdynamic (icc) requires PrgEnv-intel at
run time.
765956 pgi compiler wrapper with xt-libsci/10.4.8 has no openmp
765957 CPR - "-l mpich_cpr" could be added by a module.
765991 The CC driver is not setting the proper -hcpu type.
766162 Change lib search path for trilinos 10.6.
766779 xtpe-mc8 module missing istanbul_mp setting
766805 Intel compiler problem with Trilinos-10.6
766847 PrgEnv-cray needs to point to Trilinos built by GCC/4.4
Differences:
Beginning with xt-asyncpe 4.4, the f77 script for pgi is now
aliased to
ftn. In a future release, it will be removed altogether.
xt-libsci
The following features were added to Libsci 10.4.9:
- Now includes new explicit entry-points for faster complex linear
solvers using the 3m algorithm for complex matrix-matrix
multiplication.
The faster matrix multiplication can also be used without code
modification by setting an environment variable LIBSCI_USE_3M in
the job
execution script. Note that in previous libsci versions this
algorithm was
used by default, and therefore users may require setting the
LIBSCI_USE_3M
to obtain previous performance levels. This affects the LAPACK driver
routines ZPOSV, CPOSV, ZGESV, CGESV and LAPACK factorization routines
ZPOTRF, CPOTRF, ZGETRF, CGETRF. See the intro_lapack man page for
usage
information.
PETSc
Purpose:
--------
PETSc-3.1.04 includes new Cray Adaptive Sparse Kernels (CASK) for
triangular solution that allows further performance improvement for
very sparse matrices. This module is equivalent to the official
version of PETSc-3.1 with patch level 5 that include several
bug fixes.
Product and OS Dependencies:
----------------------------
xt-asyncpe 4.4
xt-libsci 10.4.9 or later
MPT 5.0.2 or later for gcc 4.5 support
PETSc 3.1.04 is not supported with the PathScale compiler.
Documentation:
--------------
http://www.mcs.anl.gov/petsc/petsc-as/documentation/index.html
http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview
Trilinos
Purpose:
--------
Trilinos-10.6 includes many new features and bug fixes. Detailed
information is available at:
http://trilinos.sandia.gov/release_notes-10.6.html
Cray Trilinos provides 33 packages. Please see the man page,
intro_trilinos for the package names and descriptions.
Cray's Epetra package introduces the new version of Cray Adaptive
Sparse Kernels (CASK), including improved multiple-vector sparse
matrix vector multiplication and triangular solution kernels. These
kernels improve the performance of the operations with very sparse
matrices.
Cray's Amesos package provides an interface to the sparse direct
solvers from SuperLU-4.0, SuperLU_DIST-2.3 and MUMPS-4.9.2 available
in the Cray petsc module. Cray's Zoltan package provides interface
to ParMetis-3.1.1 in Cray petsc module. These interfaces allow
users to call these popular sparse matrix and graph partitioning
packages with ease and interoperate with the other packages of
Trilinos.
Product and OS Dependencies:
----------------------------
xt-asyncpe 4.5 or later
xt-libsci 10.4.9 or later
petsc-3.1.04 or later
PGI 10.0.0 or later
Known Problems:
---------------
PGI:
Due to several template handling problems of PGI C++ compiler, PGI
compiler
users might have link-time or run-time errors when using relatively new
capabilities based on C++ template such as Tpetra and Teuchos packages.
As a workaround, we recommend using another compiler environment: Cray,
GNU or Intel instead of PGI or avoid using these new capabilities.
Intel:
To avoid link-time problems due to a missing libstdc++.a library, a gcc
module needs to be loaded. Please load gcc version 4.2, 4.3 or 4.4
only.
Example
module load PrgEnv-intel
module load petsc
module load trilinos
module load gcc/4.2.3
Documentation:
--------------
References and API guide are available at
http://trilinos.sandia.gov/index.html
To see descriptions of each individual Trilinos package, go to
http://trilinos.sandia.gov/capabilities.html
libfast
Fastmv 1.0.8 extends the domain of the sin, sincos and cos intrinsics to
all finite reals.
MPI / compilers
xt-mpt 5.1.1 -> 5.1.2
xt-asyncpe 4.4 -> 4.5
Math libraries
xt-libsci 10.4.8 -> 10.4.9
PETSc 3.1.03 -> 3.1.04
Trilinos 10.2.0 -> 10.6.0
libfast 1.0.7 -> 1.0.8
NOTES:
xt-mpt
Bugfixes.
xt-asyncpe
Bugs fixed in this release:
744483 linux ayncpe drivers fail to handle verbose flag correctly
for GNU and PGI
750934 On a CADE system, using only the '-V' option does not pass
target option
754620 cc -help does not give individual compiler options.
758200 cc -V and ftn -V no longer work properly
765639 compiling w/ -Bdynamic (icc) requires PrgEnv-intel at
run time.
765956 pgi compiler wrapper with xt-libsci/10.4.8 has no openmp
765957 CPR - "-l mpich_cpr" could be added by a module.
765991 The CC driver is not setting the proper -hcpu type.
766162 Change lib search path for trilinos 10.6.
766779 xtpe-mc8 module missing istanbul_mp setting
766805 Intel compiler problem with Trilinos-10.6
766847 PrgEnv-cray needs to point to Trilinos built by GCC/4.4
Differences:
Beginning with xt-asyncpe 4.4, the f77 script for pgi is now
aliased to
ftn. In a future release, it will be removed altogether.
xt-libsci
The following features were added to Libsci 10.4.9:
- Now includes new explicit entry-points for faster complex linear
solvers using the 3m algorithm for complex matrix-matrix
multiplication.
The faster matrix multiplication can also be used without code
modification by setting an environment variable LIBSCI_USE_3M in
the job
execution script. Note that in previous libsci versions this
algorithm was
used by default, and therefore users may require setting the
LIBSCI_USE_3M
to obtain previous performance levels. This affects the LAPACK driver
routines ZPOSV, CPOSV, ZGESV, CGESV and LAPACK factorization routines
ZPOTRF, CPOTRF, ZGETRF, CGETRF. See the intro_lapack man page for
usage
information.
PETSc
Purpose:
--------
PETSc-3.1.04 includes new Cray Adaptive Sparse Kernels (CASK) for
triangular solution that allows further performance improvement for
very sparse matrices. This module is equivalent to the official
version of PETSc-3.1 with patch level 5 that include several
bug fixes.
Product and OS Dependencies:
----------------------------
xt-asyncpe 4.4
xt-libsci 10.4.9 or later
MPT 5.0.2 or later for gcc 4.5 support
PETSc 3.1.04 is not supported with the PathScale compiler.
Documentation:
--------------
http://www.mcs.anl.gov/petsc/petsc-as/documentation/index.html
http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview
Trilinos
Purpose:
--------
Trilinos-10.6 includes many new features and bug fixes. Detailed
information is available at:
http://trilinos.sandia.gov/release_notes-10.6.html
Cray Trilinos provides 33 packages. Please see the man page,
intro_trilinos for the package names and descriptions.
Cray's Epetra package introduces the new version of Cray Adaptive
Sparse Kernels (CASK), including improved multiple-vector sparse
matrix vector multiplication and triangular solution kernels. These
kernels improve the performance of the operations with very sparse
matrices.
Cray's Amesos package provides an interface to the sparse direct
solvers from SuperLU-4.0, SuperLU_DIST-2.3 and MUMPS-4.9.2 available
in the Cray petsc module. Cray's Zoltan package provides interface
to ParMetis-3.1.1 in Cray petsc module. These interfaces allow
users to call these popular sparse matrix and graph partitioning
packages with ease and interoperate with the other packages of
Trilinos.
Product and OS Dependencies:
----------------------------
xt-asyncpe 4.5 or later
xt-libsci 10.4.9 or later
petsc-3.1.04 or later
PGI 10.0.0 or later
Known Problems:
---------------
PGI:
Due to several template handling problems of PGI C++ compiler, PGI
compiler
users might have link-time or run-time errors when using relatively new
capabilities based on C++ template such as Tpetra and Teuchos packages.
As a workaround, we recommend using another compiler environment: Cray,
GNU or Intel instead of PGI or avoid using these new capabilities.
Intel:
To avoid link-time problems due to a missing libstdc++.a library, a gcc
module needs to be loaded. Please load gcc version 4.2, 4.3 or 4.4
only.
Example
module load PrgEnv-intel
module load petsc
module load trilinos
module load gcc/4.2.3
Documentation:
--------------
References and API guide are available at
http://trilinos.sandia.gov/index.html
To see descriptions of each individual Trilinos package, go to
http://trilinos.sandia.gov/capabilities.html
libfast
Fastmv 1.0.8 extends the domain of the sin, sincos and cos intrinsics to
all finite reals.
FIMM: file system crash
Yesterday, 18th November 2010 around 14:00 , GPFS file system on fimm cluster is crashed , we were replacing switch which should be down without taking down GPFS file system , but unfortunately file system crashed.
Problem resolved around 15:30 same day, hopefully that will fix the continues GPFS file system crash on fimm.
Sorry for inconvenience.
Problem resolved around 15:30 same day, hopefully that will fix the continues GPFS file system crash on fimm.
Sorry for inconvenience.
FIMM: file system crash
We are still experiencing problems with our new 10GB internal
network, today around 18:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 21:15, and
now you can submit your job again.
Sorry for inconvenience.
network, today around 18:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 21:15, and
now you can submit your job again.
Sorry for inconvenience.
FIMM: file system crash
We are still experiencing problems with our new 10GB internal
network, yesterday around 21:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 22:15, but we put reservation on most
part of the cluster. this morning this reservation is removed, and
you can submit your job again.
Sorry for inconvenience.
network, yesterday around 21:30 GPFS file system crashed, and all
running jobs was killed.
We took up our file system at 22:15, but we put reservation on most
part of the cluster. this morning this reservation is removed, and
you can submit your job again.
Sorry for inconvenience.
Hexagon: new default permission settings from November 15th
By the new policy we are going to change default security bits for your
home and work directories. ($HOME and /work/$USER)
The default set will be: full control by owner, no access for group
and others (numeric: 700).
The change is going to be performed on Monday November 15th, at 12:00.
home and work directories. ($HOME and /work/$USER)
The default set will be: full control by owner, no access for group
and others (numeric: 700).
The change is going to be performed on Monday November 15th, at 12:00.