Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.
Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).
Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.
Author Archives: lsz075
Scheduled downtime on TRE+FIRE
The regatta cluster TRE will be down Tuesday June 14 08:00-14:00 for firmware upgrades, and replacement of a failed memory module on one of the nodes. Running jobs will be killed, and will have to be resubmitted after the maintenance stop.
Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.
Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.
Switch-problems on fimm
12-ports on one of the switches in the cluster stopped working at 02:00 this night, so we lost connection to 12 of the nodes for ~7 hours.
Affected nodes:
compute-0-18 compute-0-16 compute-0-11 compute-0-8 compute-0-7 compute-0-6 compute-0-5 compute-0-4 compute-0-3 compute-0-2 compute-0-1 compute-0-0
To resolve the problem, the failing switch had to be rebooted. This lead to a short (~30s) failure/unmount of the /work* and /home/fimm filesystems on all nodes. Uncertain how this affected running jobs. Most seems to have handled it without problems...
Affected nodes:
compute-0-18 compute-0-16 compute-0-11 compute-0-8 compute-0-7 compute-0-6 compute-0-5 compute-0-4 compute-0-3 compute-0-2 compute-0-1 compute-0-0
To resolve the problem, the failing switch had to be rebooted. This lead to a short (~30s) failure/unmount of the /work* and /home/fimm filesystems on all nodes. Uncertain how this affected running jobs. Most seems to have handled it without problems...
FIMM downtime
FIMM was down for scheduled maintanance 2005/05/09 08:00-10:00 = 2 hours of the full cluster.
The work that was done was:
o upgraded firmware on SATABlades
o move /local from the local disk of each node, to a shared disk, to save precious space for local /scratch usage.
The work that was done was:
o upgraded firmware on SATABlades
o move /local from the local disk of each node, to a shared disk, to save precious space for local /scratch usage.
NOTUR 2005 conference, http://www.notur.no/notur2005
The 5th anual gathering on High Performance Computing in Norway will be held in Trondheim, May 30-31, 2005. Please see http://www.notur.no/notur2005 for details.
Scheduled downtime on fimm
Fimm will be down monday May 9th. 08:00-12:00 for firmware upgrades on the SATABlade disk solution, and possibly other minor changes. This is to fix the bug that triggered the disk crashes on March 30th.
http://www.parallaw.uib.no/syslog/56
http://www.parallaw.uib.no/syslog/56
Intel compilers upgraded on fimm
The intel fortran and C/C++ compilers were upgraded from v8.1.023 to v8.1.027. This should fix a couple of internal compiler-errors we've been triggering.
PGI compilers upgraded on fimm
The Portland Group compilers were upgraded from v5.2 to v6.0.
portland compiler on fire + mpich
The portland group compilers v6.0 was installed on fire. Also mpich for the portland compilers was installed under /local/mpich/pgi/.
New NOTUR cpu-hour quota period starts April 11
The new NOTUR cpu-hour quota period starts April 11 instead of
April 4; all existing accounts and projects remain open until then.