Downtime

26 May, 08:30 is planned time for Hexagon software upgrade. It will be upgraded from UNICOS/lc 2.0 to CLE 2.1UP02. It is an major software upgrade and will take from several hours to several days. We will use our bests to minimize the downtime.
Lustre FS is going to be upgraded from version 1.4 to 1.6 which will need /work filesystem check for several hours. We kindly ask hexagon users to remove all unused files from /work filesystem, this will result in shortening downtime.
ALL programs which are going to be used after software upgrade MUST be recompiled! This is very important! Running application compiled for current OS release (2.0) can create unexpected results after upgrade.

Update: Upgrade time moved to 26.05.2009 08:30. Therefore hexagon reserved from 08:30, 26th of May. Long jobs which are not able to complete before the downtime will not start.

Update: May 27th 00:00: The upgrade will continue tomorrow. The machine will be unavailable until the upgrade is finished.

Update: May 27th 16:00: We have started recompiling software on Hexagon.

Update: May 27th 21:30 Software upgrade finished. Hexagon is back online.

As it was mentioned before this is MAJOR software upgrade. Now hexagon is running CLE 2.1UP02, with Lustre 1.6 filesystem

Notes:

* All programs MUST be RECOMPILED!.

* The following programs/modules was removed as they are not supported anymore:
gmalloc
gnet
iobuf
libscifft
openGLUT

* This software was replaced by Cray versions:
all hdf5:
hdf5
hdf5-parallel
all netcdf:
netCDF (for version 3.6.2)
netcdf (for version 4.0.1)
java/jdk 1.6.0

* The following software will be shortly recompiled.
amber
antlr
berkley-upc
cdo
coreutils
git
gnuplot
grads
grib_api
gsl
ncl_ncarg
nco
ncview
nedit
nwchem
imagemagick
jasper
libdap
libnc-dap
matlab
pgplot
python (static)
subversion
vim 7.2 or newer
WPS
WRF

* The libraries like:
zlib
libxml2
libpng
glib2.4.2
are available by default without modules

* Module name: changed program name and version number structure, like:
%ProgramName/%Version
eg. nwchem-cnl/5.1.1
netCDF/3.6.2
While loading modules, users are advised to use as much as possible only the program name, the optimal version will be loaded by default:
module load nwchem

* Please update your PBS scripts as well as environment to load correct modules.

We have to shutdown Hexagon because of a major water leakage in our machine room. Shutdown at 14:20, May 8th.

Update 17:20: Due to a clog in the drainage system, sewage has been spilled under the computer floor. Because of the danger of short circuits we need to keep the system shutdown. Hexagon will probably not be started until Saturday at the earliest. The Fimm room is currently operational, but can be affected if sewage rises.

Update 19:30: The clog has now been removed. The sewage under the computer floor is currently being cleaned. Due to the humidity in the room, Hexagon will be down until the machine room is dry again.

Update 21:15: The computer room has now been cleaned. Hexagon will remain down until the floor is dry again.

Update May 9th 11:50: Hexagon is now started again. All jobs running at the time of the crash has to be resubmitted.

We are sorry for any inconvenience.

Because of new equipment we need to expand the electric power in our machine room. We have therefore reserved the fimm cluster from 06:00, 14th of May. Long jobs not able to complete before the downtime will not be able to start.

The exact length of the downtime is currently unknown, but should not last more than half a day. We will provide more information as soon as we know more.

Update May 12th: We also discovered some issues with our file systems. We will therefore use this opportunity to perform complete file system checks. The down time will therefore be longer.

Update May 13th: The power shutdown will be at 10:00 tomorrow. We will make fimm unavailable from 09:00 because of needed upgrades.

Update May 14th 09:30: Todays power shutdown has been postponed until tomorrow at 07:00. We will still use the current down time to perform some maintenance.

Update May 14th 18:00: The file system checks are still not finished. We will monitor the progress through the evening. Fimm will be unavailable until after power shutdown tomorrow.

Update May 15th 08:55: Fimm is now available for usage again.

Hexagon went down again at 02:30. We are investigating problem.

Update, It turns out another rack has some power issues. We have to investigate more before we can turn the machine back on.

Update 15:10, Hexagon is now running again.

Update Tuesday 12:15, Since few jobs are running we will take the system down at 12:30, earlier than planned.

Update 16:45, Hexagon is now running with all racks included.

Hexagon went down at 09:30 due to a power issue on cabinet c4.

We are investigating.

Update 10:00, We are doing future scheduled maintenance work while we are waiting on the diagnostics.

Update 13:00, hexagon is running again in degraded mode without cabinet c4. We are waiting for a replacement PDU for this cabinet. When we get the part we will need to shutdown the machine to include the cabinet again. This will likely happen on Monday April 6th.

Update 16:50, Maintenance work is now scheduled for Tuesday April 7th at 14:00, after which the machine will be rebooted. To be able to run, all jobs needs to finish (as specified by walltime) before 14:00 on Tuesday.

Update Tuesday 12:15, Since few jobs are running we will take the system down at 12:30, earlier than planned.

Hexagon crashed 12:15 today. We are working on getting the system up and running again.

Update 13:30: Hexagon is now running again. Most probably the crash was caused by overuse of memory on several login nodes.

All jobs running when it crashed has to be resubmitted. We are sorry for the inconvenience.

Due to a double cooling failure (primary plus backup) for the building-provided chilled water supply, we were forced to shutdown hexagon due to over-temperature in the room.

Update 17:30: we hope to have the cooling back Thursday morning.

Update Thursday 10:00: we have now partial cooling and have started the machine and allowed logins. Until we know more about when we will get full cooling we have a system reservation on all nodes, you can add jobs to the queue but they will not start until we remove the reservation.

Update Thursday 11:00: we have now restored 1 of the cooling machines to operation so we have now full cooling, and reservation is removed.