Fimm: login node crash

lsz075 • June 18, 2010

Fimm login node has been crashed. Reason GPFS filesystem hang. Login node is up and running. All open sessions has been killed.

Hexagon: Updated software/libraries

lsz075 • May 27, 2010

Hexagon has updates libraries.

* MPT 4.1.1
Bug fixes.
* xt-asyncpe 3.9
Bug fixes.
* Cray Scientific and Math Libraries 4.13
LibSci 10.4.4
CRAFFT update
Trilinos 10.2.0
CASK Update
* PGI 10.4
Bug fix release update from PGI.
* Cray Debugger Supporting Tools 1.0.2
ATP 1.0.2
Bug fixes
* TotalView 8.8.0
Replay Engine Feature release.

Fimm: bjerknode/Compute-1-32 is down

lsz075 • May 25, 2010

Due to over memory usage , compute-1-32 crashed last night, we are trying reinstall that node , but currently having some hardware failure issue,
we are still working on it , hope fully compute-1-32 will be up tomorrow, we will keep you updated.

We are sorry for incontinence.

UPDATE 26th May 11:47 compute-1-32 is up and running.

Fimm: file server crashed

lsz075 • May 21, 2010

One of the file server serving work and home file system on fimm cluster crashed 14:10, all jobs using work file system crashed due to Stale NFS file handle.

File server rebooted , and home file system and work file system mounted back to all compute nodes.

Hexagon: ncl with aprun support

lsz075 • May 4, 2010

Hexagon has got NCL version which is capable to run with aprun. Latest module version 5.2.0 is aprun compatible. This version is loaded by default if you do module load ncl_ncarg.

If you miss some features and you want to run ncl on login node, then module load ncl_ncarg/5.2.0-login shall be used.

Hexagon: down, HSN link problem

lsz075 • April 30, 2010

Failure in HSN link, hexagon is down. We are working on problem.
Update: 20:50 Machine is back online

Hexagon: scheduled maintenance, Mon. May 10

lsz075 • April 30, 2010

Hexagon will have a scheduled maintenance on Thursday May. 6th from 12:00.
This is to fix problem with cabinet 7.
The queue have a reservation in place such that only jobs that can complete (according to asked for walltime) before the maintenance will start.
This note will be updated when we have more information.

Update: Maintenance has been moved to Monday May 10th, from 12:00

Update: 10.05, 18:20 Maintenance finished, machine is back online.

Hexagon: down due to cabinet power problem

lsz075 • April 30, 2010

Hexagon went down at 12:35 due to cabinet7 power problem.
Update: 15:10 the machine is up without cabinet 7. We will have downtime at Thursday May 06 at 12:00 to fix cabinet 7. The queue have a reservation in place such that only jobs that can complete (according to asked for walltime) before the maintenance will start.

Hexagon: Updated software/libraries

lsz075 • April 16, 2010

Hexagon has updates libraries.

MPI
xt-mpt 4.0.3 -> 4.1.0.1

Math-libs
ACML 4.3.0 -> 4.4.0

Compilers
xt-asyncpe 3.7 -> 3.8

NOTES:

xt-mpt
Features:

The algorithms used for shmem_set_lock and shmem_clear_lock have
been improved for much better scaling. In a basic test of calls to set_lock
and clear_lock by a set of PEs all competing for the same lock, MPT
4.0.2 and MPT 4.0.3 perform about the same for a few nodes, but beyond
just a few, the time per PE for MPT 4.0.2 steadily increases with
the number of PEs whereas the time per PE for MPT 4.0.3 stays level.
At just 128 PEs, MPT 4.0.3 is about 4 times faster than MPT 4.0.2
and the difference keeps increasing. In addition, the new algorithm
grants the lock in the same order as the lock was requested whereas
with the old algorithm it was somewhat random which PE waiting for
the lock would get it next.

Adds support for dynamic libraries when using the cce compiler.

Bugs Fixed:
Bug 755075 MPICH2 threads/comm/ctxdup.c fails with "Too many communicators" in 4.0.0.3 vs 3.5.1"
Bug 755698 MPI_Allgatherv hangs when using thread-safety
Bug 755490 SHMEM performance over Seastar needs improvements
Bug 755426 Divide by zero by MPIIO if file is not a Lustre file

ACML
See ACML documentation at AMD

Hexagon: login6 is going to be rebooted

lsz075 • April 15, 2010

Hexagon login6 node has been evicted from ost8 Lustre /work filesystem. Files located on ost8 on /work filesystem are not available from login6.

Please logoff from login6 and use other hexagon login nodes. Login6 is going to be rebooted as soon as all jobs started from it will be finished.

17/04 22:00 login6 has been rebooted and is available.