Both metadata servers and all OSSes serving /work filesystem crashed.
We had to stop the machine and power cycle hexagon.
Author Archives: Lóránd Szentannai
Hexagon: reboot needed
All OSTs for /work filesystem are in read-only mode and we need to reboot hexagon. We will come back with more information later on.
Update:
15:20 25-04-2016 OST 8 has corrupted data and was marked read-only by the system. There are 379 inodes containing multiply-claimed blocks. We are trying to recover from it and identify corrupted files. Owners for identified corrupted files will be notified.
If you have corrupted data on /work, please contact us at support-uib@notur.no.
15:45 25-04-2016 Users were logged out and access closed in order to be able to perform maintenance on the system.
16:35 26-04-2016 Corrupted files were identified and /work filesystem is usable again. Hexagon was rebooted and access is reopened.
We will run further checks on /work filesystem while keeping it on-line. After this last check is finished, as earlier mentioned, the owners of corrupted files will be notified.
Update:
15:20 25-04-2016 OST 8 has corrupted data and was marked read-only by the system. There are 379 inodes containing multiply-claimed blocks. We are trying to recover from it and identify corrupted files. Owners for identified corrupted files will be notified.
If you have corrupted data on /work, please contact us at support-uib@notur.no.
15:45 25-04-2016 Users were logged out and access closed in order to be able to perform maintenance on the system.
16:35 26-04-2016 Corrupted files were identified and /work filesystem is usable again. Hexagon was rebooted and access is reopened.
We will run further checks on /work filesystem while keeping it on-line. After this last check is finished, as earlier mentioned, the owners of corrupted files will be notified.
Hexagon: issues with /work storage
Some of the OSTs serving /work filesystem has become full and caused few jobs to fail. We are working on rebalancing the usage between the OSTs but it is fairly difficult since /work is 87% used at the moment. We have notified top users of /work filesystem to clean-up un-necessarry files.
Hexagon: nodes/ppn directives are deprecated
Due to several bugs in the queuing system, affecting mostly OpenMP jobs, the nodes and the ppn directives are deprecated.
The new way of submitting OpenMP jobs is covered on the HPC docs site, available here:
The new way of submitting OpenMP jobs is covered on the HPC docs site, available here:
https://docs.hpc.uib.no/wiki/Job_execution_(Hexagon)#Parallel.2FOpenMP_jobs.
Have you any question on how to change your script please contact us.
Hexagon: updated software
We have installed new libraries, compilers and tools on hexagon. Below you will find the complete list of the newly installed software:
- CCE 8.4.3
- Chapel 1.12.0
- Craype 2.5.1
- GCC 5.2.0
- FFTW 3.3.4.6
- Intel Compiler 16.0.1
- HDF5 1.8.16
- LibSCI 1.13.0
- MPI 7.3.1
- PGI 15.10.0
- Totalview 8.15.10
Hexagon: user environment passing trough SSH is disabled
We have disabled passing users' environment over SSH due to multiple issues.
The default environment settings are:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
You can override it by adjusting your ~/.profile file.
The default environment settings are:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
You can override it by adjusting your ~/.profile file.
Hexagon: down, power blink
Hexagon went down due to power blink.
Update: 2015-12-10 20:47 Machine is up again.
Update: 2015-12-10 20:47 Machine is up again.
Hexagon: down, power spike
We got High Speed Network link error caused by cabinet fall-outs.
Cabinets fall-out around 02:10 14-11-2015 most likely due to power spikes. We are still investigating the issue.
Update: since 04:45 14-11-2015 system is up again.
Cabinets fall-out around 02:10 14-11-2015 most likely due to power spikes. We are still investigating the issue.
Update: since 04:45 14-11-2015 system is up again.
Hexagon: system crashed
Hexagon crashed and had to be restarted. We will come back with more information later on.
Update: 2015-10-21 19:30 System and job submission was recovered.
Update: 2015-10-23 14:48 We got confirmation from building maintenance that system crashed due to an electricity failure around 16:45.
Update: 2015-10-21 19:30 System and job submission was recovered.
Update: 2015-10-23 14:48 We got confirmation from building maintenance that system crashed due to an electricity failure around 16:45.
Hexagon: updated software
During the maintenance we have:
- applied different firmware updates and patches
- installed newer libraries, compilers and tools
Please note that all libraries compiled with previous version of PGI will have to be recompiled.
Below you will find the complete list of the newly installed software:
- CCE 8.4.0
- Chapel 1.12.0
- Craype 2.4.2
- GCC 5.1.0
- FFTW 3.3.4.5
- HDF5 1.8.14
- PGI 15.3.0
- PerfTools 6.3.0
- MPI 7.2.5
- NetCDF 4.3.3.1
- Totalview 8.15.7