We are holding HPC training week during week 46. The replies can take longer than usual.
Hexagon: issues with queue system
We had issues with the queue system on Hexagon, all jobs which was started before 13:30 were terminated.
Our apologies for inconvenience.
The system was recovered at around 14:00.
Hexagon: system crashed
Hexagon crashed and had to be restarted. We will come back with more information later on.
Update: 2015-10-21 19:30 System and job submission was recovered.
Update: 2015-10-23 14:48 We got confirmation from building maintenance that system crashed due to an electricity failure around 16:45.
Hexagon: updated software
During the maintenance we have:
- applied different firmware updates and patches
- installed newer libraries, compilers and tools
Please note that all libraries compiled with previous version of PGI will have to be recompiled.
Below you will find the complete list of the newly installed software:
- CCE 8.4.0
- Chapel 1.12.0
- Craype 2.4.2
- GCC 5.1.0
- FFTW 3.3.4.5
- HDF5 1.8.14
- PGI 15.3.0
- PerfTools 6.3.0
- MPI 7.2.5
- NetCDF 4.3.3.1
- Totalview 8.15.7
HPC training week 46
Please see details at https://docs.hpc.uib.no/wiki/HPC_course_2015
Hexagon: scheduled maintenance on Oct. 20th
- Apply Cray SW patches to improve stability, especially of the /work filesystem.
- Add qsub filter, it will replace email notifications when the job can’t start or has suboptimal parameters and instead it will provide output to terminal when one submits the job.
P.S. During the maintenance /work-common will not be available on GRUNCH and FIMM.
Update: 2015-10-20 09:00 - Scheduled maintenance has started.
Update: 2015-10-20 17:57 - Maintenance is finished. Please see changes at
http://syslog.hpc.uib.no/2015/10/20/hexagon-updated-software/
Hexagon: MDS server crash
A new MDS server crash. Some jobs may fail.
Hopefully the MDS crashes will be eliminated after the maintenance we are planing later this year (a separate announcement will come).
Hexagon: MDS server crash
Today at 8:23 primary MDS serving /work has crashed. This resulted that all IO to /work was suspended.
The failover MDS is up from 10:50 and serving /work fs. All IO should be recovered.
We will investigate cause of primary MDS crash on Monday.
Hexagon: 2 login nodes crashed in the last 20 hours
2 login nodes were crashed by a process from the user space, asking too much memory. The jobs running from these nodes have stopped.
The following jobs were affected:
1780945
1781097
1781123
1781528
1781848
1782040
1782089
1782093
1782097
1782101
1782121
1782155
1782280
Hexagon: new PGI compiler and TotalView debugger versions
PGI compiler version 15.3.0 has been installed on hexagon. Version 14.9.0 is still the default, but we kindly urge you to try out latest installed version. List of changes, enhancements you'll find here:
http://www.pgroup.com/support/release_tprs_2015.htm
From autumn, new Cray compiled libraries for PGI will not support anything below PGI 15.x.
A newer version - 8.15.4 - of TotalView debugger is installed on hexagon, version 8.15.0 being currently the default.
