Login2 on hexagon was having problems with out-of-memory due to a bad user script. All jobs running on the node crashed and needs to be restarted.
Please observe that you are using "aprun" to run your programs - to avoid that it impacts the login node.
Hexagon: scheduled maintenance on Oct. 10th
Hexagon will have a scheduled maintenance on Oct 10th. It will start at 10:00 and will last approximately for 2 hours.
The reservation is in place as such as jobs starting before the maintenance and finishing after can not be submitted.
During we maintenance we are going to replace failed and failing components.
Update: 15:15 Maintenance finished, machine is up.
The reservation is in place as such as jobs starting before the maintenance and finishing after can not be submitted.
During we maintenance we are going to replace failed and failing components.
Update: 15:15 Maintenance finished, machine is up.
Hexagon: power spike in building causes reboot
Hexagon needs to be rebooted after a power-spike left too many nodes offline.
Update 18:40, machine is now up again after a mezzanine replacement.
Update 18:40, machine is now up again after a mezzanine replacement.
Hexagon: updated software/libraries
Hexagon has updated software/libraries. Log out and in again for refreshing modules.
Note that several modules are in the process of changing names. The xt-XXXX modules are being renamed to cray-XXXX (i.e. xt-mpich2 == cray-mpich2).
For a time both names will work.
Full releasenotes are in: http://docs.cray.com/books/S-9401-1209//S-9401-1209.pdf
Updates:
cray-mpich2 5.5.3 -> 5.5.4
xt-asyncpe: 5.13 -> 5.14
CCE compiler: 8.0.7 -> 8.1.0
PGI compiler: 12.6.0 -> 12.8.0
java 1.7.0-03 -> 1.7.0-07
STAT: 1.2.1.2 -> 1.2.1.3
perftools (CrayPat): 5.3 -> 6.0
(new major release, see notes!)
cray-libsci (xt-libsci): 11.1.00 -> 11.1.01
cray-papi: 4.3.0.1 -> 5.0.0
fftw: 2.1.5.3 -> 2.1.5.4
petsc: 3.2.02 -> 3.3.00
TPSL: 1.2.01 -> 1.3.00
Trilinos: 10.8.3.1 -> 10.12.1.0
Note that several modules are in the process of changing names. The xt-XXXX modules are being renamed to cray-XXXX (i.e. xt-mpich2 == cray-mpich2).
For a time both names will work.
Full releasenotes are in: http://docs.cray.com/books/S-9401-1209//S-9401-1209.pdf
Updates:
cray-mpich2 5.5.3 -> 5.5.4
xt-asyncpe: 5.13 -> 5.14
CCE compiler: 8.0.7 -> 8.1.0
PGI compiler: 12.6.0 -> 12.8.0
java 1.7.0-03 -> 1.7.0-07
STAT: 1.2.1.2 -> 1.2.1.3
perftools (CrayPat): 5.3 -> 6.0
(new major release, see notes!)
cray-libsci (xt-libsci): 11.1.00 -> 11.1.01
cray-papi: 4.3.0.1 -> 5.0.0
fftw: 2.1.5.3 -> 2.1.5.4
petsc: 3.2.02 -> 3.3.00
TPSL: 1.2.01 -> 1.3.00
Trilinos: 10.8.3.1 -> 10.12.1.0
Fimm cluster down time 11th September
Dear fimm users :
We will update maui and torque on fimm.bccs.uib.no also will upgrade qbank accounting system to gold. For that reason fimm.bccs.uib.no will be down for 8~9 hours on 11th of September 2012. Down time starts at 08:00 in the morning.
Jobs already running but will not be able to finish by 08:00 clock 11th September will be killed, since cluster is reserved for maintenance , jobs will not be able to finish by that time will not run.
We will keep all progress updated on this page.
Updates:
maintenance is extended to kl 12:00 , 12 September 2012 due to some problem with software.
Updates: 10:55 12/09/2012
We just completed upgrade on fimm, resource manger and scheduler is updated on fimm.bccs.uib.no.
We are running :
Maui 3.3.1
Torque 4.1.0
We will update maui and torque on fimm.bccs.uib.no also will upgrade qbank accounting system to gold. For that reason fimm.bccs.uib.no will be down for 8~9 hours on 11th of September 2012. Down time starts at 08:00 in the morning.
Jobs already running but will not be able to finish by 08:00 clock 11th September will be killed, since cluster is reserved for maintenance , jobs will not be able to finish by that time will not run.
We will keep all progress updated on this page.
Updates:
maintenance is extended to kl 12:00 , 12 September 2012 due to some problem with software.
Updates: 10:55 12/09/2012
We just completed upgrade on fimm, resource manger and scheduler is updated on fimm.bccs.uib.no.
We are running :
Maui 3.3.1
Torque 4.1.0
Hexagon: updated software/libraries
Hexagon has updated software/libraries:
cray-mpich2 (earlier name of xt-mpich2 is depreciated):
5.5.2 -> 5.5.3
PGI 12.5 -> 12.6
xt-asyncpe 5.12 -> 5.13
netcdf 4.2.0 (replaced for small bug-fix)
hdf5 1.8.8 (replaced for small bug-fix)
ATP 1.5.0 -> 1.5.1
STAT 1.2.1.1 -> 1.2.1.2
The details is not yet released but will be available from:
http://docs.cray.com/relnotes/
cray-mpich2 (earlier name of xt-mpich2 is depreciated):
5.5.2 -> 5.5.3
PGI 12.5 -> 12.6
xt-asyncpe 5.12 -> 5.13
netcdf 4.2.0 (replaced for small bug-fix)
hdf5 1.8.8 (replaced for small bug-fix)
ATP 1.5.0 -> 1.5.1
STAT 1.2.1.1 -> 1.2.1.2
The details is not yet released but will be available from:
http://docs.cray.com/relnotes/
Hexagon: scheduled maintenance, August 23rd
Hexagon will have a scheduled maintenance on August 23rd from 09:00. We expect to be back up again later the same day.
We will put back cabinet 8 (taken out during thunderstorm crash) and install 2 software patches for increased stability.
Update August 23rd at 18:30, Hexagon is now up again. The hardware and software work was more extensive than planned, we apologize for the inconvenience.
We will put back cabinet 8 (taken out during thunderstorm crash) and install 2 software patches for increased stability.
Update August 23rd at 18:30, Hexagon is now up again. The hardware and software work was more extensive than planned, we apologize for the inconvenience.
Fimm file system crash
For all fimm users :
GPFS file system on fimm crashed today around 10:15, home file system and work file system was unmounted from all compute nodes there for all jobs running were killed.
Problem is resolved now, file system is back online, all crashed jobs has to be resubmitted.
We are apologize for inconvenience.
GPFS file system on fimm crashed today around 10:15, home file system and work file system was unmounted from all compute nodes there for all jobs running were killed.
Problem is resolved now, file system is back online, all crashed jobs has to be resubmitted.
We are apologize for inconvenience.
Hexagon: thunderstorm causes reboot
Hexagon needs a reboot after a thunderstorm caused power-blink in building power-supply.
Update 21:10: Hexagon is up again without cabinet 8 (needs manual intervention).
Update 21:10: Hexagon is up again without cabinet 8 (needs manual intervention).
Hexagon: scheduled maintenance on August 1st
Hexagon will have scheduled maintenance August 1st extending into August 2nd.
We will do changes to all the PDUs to increase reliability of the system and additionally install latest update of the OS (CLE and SMW). The latest software update will also increase stability and decrease startup and debug times for system failures.
The scheduled downtime will start August 1st at 09:00.
Please send any questions to support-uib@notur.no
An email concerning this was sent out on July 25th to all hexagon users.
Hexagon Sysadmins
Updates:
Update Thursday 10:30, The changes to the PDUs is taking longer than expected, therefore the maintenance will be extended into Friday August 3rd.
Update Friday 12:00, There will unfortunately be a further delay before the system is up again. Currently, we expect to boot the system on Saturday August 4th.
Update Sunday 19:25, The maintenance has been finished and the system is back online.
We will do changes to all the PDUs to increase reliability of the system and additionally install latest update of the OS (CLE and SMW). The latest software update will also increase stability and decrease startup and debug times for system failures.
The scheduled downtime will start August 1st at 09:00.
Please send any questions to support-uib@notur.no
An email concerning this was sent out on July 25th to all hexagon users.
Hexagon Sysadmins
Updates:
Update Thursday 10:30, The changes to the PDUs is taking longer than expected, therefore the maintenance will be extended into Friday August 3rd.
Update Friday 12:00, There will unfortunately be a further delay before the system is up again. Currently, we expect to boot the system on Saturday August 4th.
Update Sunday 19:25, The maintenance has been finished and the system is back online.