We had to reboot login5 due to a serious routing issue.
Our apologies for any inconvenience this could cause.
/work-common/shared/imr will not be available on March 3rd
The disk space /work-common/shared/imr will not be available from 8:30 for a few hours. We will send a separate notice to affected users when the file system will be available.
We encourage users having data there to copy data necessary for your runs during this maintenance to /work file system. All jobs referencing to /work-common/shared/imr will be stopped before the maintenance.
Hexagon: new software and libraries available
The following new versions of software and libraries were installed on Hexagon:
- Cray Compiling Environment - CCE 8.3.8
- Cray Message Passing Toolkit - MPT 7.1.2 - MPT 7.1.2
- Cray Debugging Support Tools - CDST 15.02 - lgdb 2.4.1
- Cray Scientific and Math Libraries - CSML 15.02 - LibSci 13.0.3
- TotalView 8.15.0
Please find more details here .
The following are now default versions:
- totalview 8.15.0
- acml_5.3.1
- cray-ccdb_1.0.5
- cray-lgdb_2.4.0
- cray-libpmi-5.0.6-1.0000.10439.140.3.gem
- fftw_3.3.4.1
- ga_5.3.0.1
- gcc_4.9.2
- java_jdk1.7.0_45
- mpt_7.1.1
- parallel-netcdf_1.5.0
- petsc_3.5.2.1
- tpsl_1.4.3
- trilinos_11.12.1.0
Hexagon: rebooted login3
We had to reboot login3 because of some issues with the processes stuck in uninterruptible state. The following jobs were terminated and needs to be resubmitted:
- 1654462.sdb
- 1657052.sdb
- 1650122.sdb
- 1654844.sdb
- 1657054.sdb
- 1655817.sdb
- 1653859.sdb
- 1655140.sdb
Our apologies for any inconvenience this could cause.
Hexagon: new versions of software/libraries
We have installed new versions of the following packages:
- CCE 8.3.7
- Cray Message Passing Toolkit - MPT 7.1.1
- MPT 7.1.1 GA 5.3.0.1
- Cray Debugging Support Tools - CDST 15.01
- CCDB 1.0.5 lgdb 2.4.0
- Cray Scientific and Math Libraries - CSML 15.01
- PETSc 3.5.2.1 Trilinos 11.12.1.0 TPSL 1.4.3
- cray-modules 3.2.10.2
Please find details here.
We are introducing a new software and libraries update routine. We will install new versions as not default and will switch them to be default in 1 month period.
Hexagon: cooling failure forced machine to shut down
Due to a cooling failure, Hexagon was forced to shut down. We are investigating the issue and will keep you updated.
Update:
08:45 - Service is on-site trying to fix the cooling system. Will get back as soon as issue is remediated.
10:50 - Machine is up again.
Reboot of Hexagon, Fimm, Grunch
Due to important security update we will shortly reboot above mentioned systems.
Our apologies for any inconvenience caused by this.
Update: Hexagon and Grunch were stopped at 11:45 and again available at 12:35. Fimm login nodes were rebooted in the background.
Hexagon: again power blink, restarted
Again thunderstorm and power went down for a short moment, but long enough to stop Hexagon. We are working on bringing it up. The forecast is that it could be more lightnings in the next 24 hours.
These 2 last months were plenty of power interrupts due to weather, they were preventing stable runs.
Update: 22:10 Hexagon is up.
Hexagon: power blink, down
Hexagon went down because of power blink. There could be more power blinks, we will keep Hexagon down until storm Nina is over.
We expect to start it on Sunday morning.
Update: Hexagon is started and is up again since 11:30.
Hexagon: restarted, another power blink
Hexagon went down due to another power blink related to thunderstorm. We are starting machine, should be soon up.
Update: 00:00 Machine is up.
