Hexagon: job checkpoint available

lsz075 • March 29, 2010

To use checkpointing feature application must be compiled with blcr and Cray MPT version 3.0.1 and up:

module load blcr

With loaded module all necessary options will be automatically added to the compiler wrapper. Only MPI and SHMEM programming models are supported.

Job script must have at least the following parameter:
#PBS -c enabled

See man qsub for more parameters.

To checkpoint and hold the job user executes:
qhold JOBID

To continue:
qrls JOBID

The Cray checkpoint/restart solution uses BLCR software from Berkley Lab's and inherits its limitations. For more information, refer to the BLCR documentation: http://upc-bugs.lbl.gov/blcr/doc/html/index.html.

Fimm:downtime for whole cluster

lsz075 • March 22, 2010

For reconfiguring home file system setup on Fimm cluster and avoid missing home folder issue on all computer nodes , we will have downtime for whole Fimm cluster on 6th of April.
All Fimm cluster is reserved for maintenance from 11:00 on 6th of April, New submitted jobs which will not be able to finish before that time will not be able to run. All jobs which is already running and will not be able to finish before that time will be killed.

We will come with more information regarding to new configuration of home file system on Fimm cluster and keep you updated of the maintenance.

If you have any question please contact hpc-support@hpc.uib.no or support-uib@notur.no.

Hexagon: Updated software/libraries

lsz075 • March 19, 2010

Several libraries and compilers have been updated on Hexagon.

MPI:
xt-mpt 4.0.2 -> 4.0.3

Math libs:
xt-libsci 10.4.2 -> 10.4.3
PETSc 3.0.0.9 -> 3.0.0.10
libfast 1.0.6 -> 1.0.7

Compilers:
PGI 10.2.0 -> 10.3.0
Intel 11.1.064 -> 11.1.069

NOTES:

xt-mpt:

Features:
The algorithms used for shmem_set_lock and shmem_clear_lock have been improved for much better scaling. In a basic test of calls to set_lock and clear_lock by a set of PEs all competing for the same lock, MPT 4.0.2 and MPT 4.0.3 perform about the same for a few nodes, but beyond just a few, the time per PE for MPT 4.0.2 steadily increases with the number of PEs whereas the time per PE for MPT 4.0.3 stays level. At just 128 PEs, MPT 4.0.3 is about 4 times faster than MPT 4.0.2 and the difference keeps increasing. In addition, the new algorithm grants the lock in the same order as the lock was requested whereas with the old algorithm it was somewhat random which PE waiting for the lock would get it next.

xt-libsci:

Bugs fix in Libsci 10.4.3 release
757748 LIBSCI - */lib/libsci_mc12.so missing for all compilers.
757785 libsci_m12.a missing in gnu/lib/44 and gnu/lib/43 formats
757821 Libsci 10.4.2 is not compatible with PGI 9.0 and earlier

libfast:

This release of libfast_mv 1.0.7 contains two new routines
* frda_sqrt(), an array version of the square root function, sqrt();
* frda_rsqrt(), an array version of the inverse square root function, 1/sqrt().

PETSc:

New hypre-2.6.0b https://computation.llnl.gov/casc/hypre/software.html

PGI:

The following bugs are fixed in the PGI 10.3.0 release.
754306 pgcc compiling #include with -Xa compiler option yields 968 lines of error messages [TPR 16276]
754847 SLES 11 missing macro def for __CPU_ISSET [TPR 16594]
755699 PGI pgf90 OpenMP doesn't issue message for missing SAVE attribute for var in THREADPRIVATE [16504]
756213 On XT the PGI (10.0.0) compiler fails with 'asm' instruction in [TPR 16620]
756425 PGF90-F-0000-Internal compiler error. [16527]
757047 PGI OpenMP pgf90 should give msg if ALLOCATABLE array in THREADPRIVATE doesn't have SAVE attribute [16504]
757169 PGI OpenMP pgf90 ignores task to create a file when task appears in sequential part of program [16602]
757662 PGI 10.2.0 incompatible with glibc >=2.7 CPU_SET [TPR 16594]

Hexagon: nice +5 for all login nodes

lsz075 • March 17, 2010

All users when logging into hexagon login nodes automatically will be "niced" to +5, each session on login node is limited to 100 running processes. This is not anyhow reflects on compute nodes. Jobs will not be affected.

This is done primarily to remove effect of one user high CPU tasks affects another users on the same login node.

Please give a feedback via support-uib@notur.no

Hexagon: cdo-login merged into cdo-cnl

lsz075 • March 17, 2010

cdo-login module has been merged with cdo-cnl module.
Please update your scripts!

Hexagon: crash of HSN, March 11th

lsz075 • March 11, 2010

We got HSN (High Speed Network) link error between 2 cabinets and machine crashed. We are working to bring machine up.

Update: 17:30 Machine is now running again. Jobs which were running must be resubmitted.

Fimm: uibkvant file system

lsz075 • March 5, 2010

We will stop providing uibkvant file system from 12th of March (next
Friday ). File system uibkvant will be unmounted from fimm.bccs.uib.no
at 12:00.

Please make sure you backup all your necessary file/data.

contact support :support-uib@notur.no if you have any difficulties to do so.

Hexagon: Updated software/libraries, Mar. 2nd

lsz075 • March 2, 2010

Several libraries and compilers have been updated on hexagon.

NOTE: We have found that the module xtpe-barcelona was not loaded by default for a time. If you have not loaded this manually your programs will not be fully optimized for hexagon. Please log out and in again and re-compile your programs.

Note also that "xt-atp" have changed name to "atp".

Updated libraries/compilers:

* xt-asyncpe 3.7
Bug Fixes and support for the CCE 7.2 compilers with DSLs.
* Libsci 10.4.2
OpenMP/SMP support and Dynamic share libraries support for
the CCE compiler.
* Trilinos 10.0.1
Performance enhancements.
* hdf5-netcdf 1.7
Support the CCE C++ ABI compliant compiler.
* MPT 4.0.2
Support the CCE C++ ABI compliant compiler.
* Cray Debugger tools
ATP 1.0.1
STAT 1.0.0
MRNet 2.2.0.1
Initial release of statview as part of STAT. Bug fixes to
ATP and MRNet.
* PGI 10.1.0 and 10.2.0
Bug Fix releases of PGI.
* GCC 4.4.3
Bug Fix releases of GNU.

More information:

xt-libsci:

Xt-libsci 10.4.2 contains dynamic shared libraries for Cray compiler.
This release also contains new dynamic shared libraries for barcelona,
istanbul and mc12 hardware.

The multi-threaded libsci implementation has been significantly enhance
for the Shared Memory Parallel programs. The new implementation uses
OpenMP, therefore, the previous environment variable GOTO_NUM_THREADS is
no longer used.
Performance improvements of 2X or more are common on multi-threaded
Level 2 BLAS routines, and significantly improved on Level 3 BLAS
routines, when running with OMP_NUM_THREADS greater than 1.

Loader Options for OpenMP Support.
To use the OpenMP libraries, you need to use the link-time options as
specified below. The examples below are for the Istanbul processor.

module load xtpe-barcelona
PGI
cc -mp foo.c *.o -lsci_quadcore_mp
ftn -mp foo.f90 *.o -lsci_quadcore_mp
GNU
cc -fopenmp foo.c *.o -lsci_quadcore_mp
ftn -fopenmp foo.f90 *.o -lsci_quadcore_mp
INTEL
cc -openmp foo.c *.o -lsci_quadcore_mp
ftn -openmp foo.f90 *.o -lsci_quadcore_mp
PATHSCALE
cc -mp foo.c *.o -lsci_quadcore_mp
ftn -mp foo.f90 *.o -lsci_quadcore_mp

Trilinos:

Trilinos is an object-oriented and componentized framework for
scientific computation, and as such allows greater flexibility,
control, portability and performance than a collection of custom
or independent solvers. The CASK library (Cray Adaptive Sparse
Kernels) is integrated with Trilinos to provide extra performance
with no additional involvement required by the user. The Cray
Trilinos package therefore enables the full productivity advantages
of the Trilinos framework while providing solvers tuned specifically
to the Cray XT hardware.

The Trilinos release 10.0.1 includes improved Cray Adaptive Sparse
Kernels (CASK) routines for sparse matrix vector multiplication with multiple vectors. Applications using Epetra will gain some performance benefits from this improvement.

Fimm: fimm login node crashed

lsz075 • February 27, 2010

Fimm login node crashed at 12:46 today, all user session was killed, Now login node is up and running again.

We investigating possible cause.

Fimm: Slow

lsz075 • February 22, 2010

Fimm home file system performance will be slow for 4-6 hours (specially log in to fimm and ls on /home/fimm), The reason is we are migrating all fimm home file system from current storage system to an other(better) storage system.
We will keep you updated regarding to time window.

23/02 08:00 Update

File system migration is over, now the performance of Fimm home file system is back to normal.

HPC Syslog

Log over changes and events on UiB's HPC systems