Someone managed to kill login3 and login4 by oversubscribing memory. As a result, the jobs started from these login nodes were killed.
We've started these login nodes and will investigate reasons tomorrow.
We have once in a while NFS timeouts on different login nodes, the user logged in experience them as a short hangs. This been going for some last week, but not that often. The last week it started to be very often and almost on all nodes.
We've applied patch which is suppose to fix this issue. In order for changes to be picked up we need to restart Hexagon.
Update 15:30: Hexagon is up again.
The disk space /work-common/shared/imr will not be available from 8:30 for a few hours. We will send a separate notice to affected users when the file system will be available.
We encourage users having data there to copy data necessary for your runs during this maintenance to /work file system. All jobs referencing to /work-common/shared/imr will be stopped before the maintenance.
We had to reboot login3 because of some issues with the processes stuck in uninterruptible state. The following jobs were terminated and needs to be resubmitted:
1654462.sdb
1657052.sdb
1650122.sdb
1654844.sdb
1657054.sdb
1655817.sdb
1653859.sdb
1655140.sdb
Our apologies for any inconvenience this could cause.
We have installed new versions of the following packages:
CCE 8.3.7
Cray Message Passing Toolkit - MPT 7.1.1
MPT 7.1.1 GA 5.3.0.1
Cray Debugging Support Tools - CDST 15.01
CCDB 1.0.5 lgdb 2.4.0
Cray Scientific and Math Libraries - CSML 15.01
PETSc 3.5.2.1 Trilinos 11.12.1.0 TPSL 1.4.3
cray-modules 3.2.10.2
Please find details here.
We are introducing a new software and libraries update routine. We will install new versions as not default and will switch them to be default in 1 month period.
Due to important security update we will shortly reboot above mentioned systems.
Our apologies for any inconvenience caused by this.
Update: Hexagon and Grunch were stopped at 11:45 and again available at 12:35. Fimm login nodes were rebooted in the background.
Again thunderstorm and power went down for a short moment, but long enough to stop Hexagon. We are working on bringing it up. The forecast is that it could be more lightnings in the next 24 hours.
These 2 last months were plenty of power interrupts due to weather, they were preventing stable runs.
Update: 22:10 Hexagon is up.
Hexagon went down because of power blink. There could be more power blinks, we will keep Hexagon down until storm Nina is over.
We expect to start it on Sunday morning.
Update: Hexagon is started and is up again since 11:30.