Scheduled maintenance for hexagon, software upgrade, June 16th

lsz075 • June 5, 2008

There will be a planned maintenance on hexagon for software upgrade on Monday June 16th starting at 14:00 and expected to last approximately 3 hours.

The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.

Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0

Look for update on when we have re-compiled libraries:

All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.

Lustre IO-node crash on hexagon

lsz075 • May 29, 2008

An IO-node for the Lustre filesystem (/work) on hexagon has crashed. We are doing a debugging dump and will restart hexagon.
Machine will be unavailable for about 30 min during the dump and restart.

Update 09:40: hexagon is now up again. Jobs will need to be resubmitted.

Partial power problem on fimm caused downtime

lsz075 • May 26, 2008

Due to a power failure on part of the rack fimm was unavailable from 03:00 to 08:40. Most jobs failed due to one network switch being without power, some jobs may still be running on the nodes without being visible in the queue. We are looking into this and cleaning up.

File system upgrade of t1home and uibkvant on fimm

lsz075 • May 23, 2008

Wednesday, 28th of May 08:00: /home/t1home and /home/uibkvant will be unavailable during a firmware upgrade of those file systems.
This downtime will probably last for a couple of hours.

Update 28th of May:

10:30: File system is now being unmounted on all compute nodes, and will be unavailable until upgrade is complete.
11:30: The firmware upgrade is now started.
12:30: One disk controller failed during the upgrade and has to be replaced. We are expecting the replacements arrival later today. The downtime will therefore be extended.
16:30: /home/uibkvant is now up again. /home/t1home have to wait for the new controller to arrive.
01:30: Will continue tomorrow.

Update 29th of May:

09:00: We continue the work from yesterday.
18:00: Controllers have been revived. Working on recovering data disks.
00:00: Will continue tomorrow. Seems like no data has to be restored from backup.

Update 30th of May:

09:00: We continue with recovering data disks.
14:00: Running file system checks to vertify data. If data verification is successfull /home/t1home will be up again soon.
14:50: Verification was successfull. /home/t1home is now mounted on all compute nodes.

Upgrade of backup system Friday

lsz075 • May 8, 2008

Friday 9th of May, the backup system will be unavailable for a short time, because of a upgrade of our system. File systems like /migrate and /bcmhsm will be unavailable during this upgrade, which will start at 12:00 and be finished at 15:00.

Update: 15:30: Upgrade is finished.

Activated “Gold” accounting system on hexagon

lsz075 • April 26, 2008

Hexagon now has activated the accounting/allocation manager "Gold", this means that all jobs will need to have a valid cpu-account with enough cpu-hours to run the submitted job.
At the time of job-submission a "credit-check" against the specified cpu-account is done and the asked for cpu-hours are reserved until the job ends - at which time the real amount of cpu-hours is subtracted from the account. The number of cpu-hours reserved and subtracted is calculated as follows:

cpuhours = 4 * blocked nodes * wallclock time

For reservations "wallclock time" is the specified "wallclock" parameter used in the PBS script or on the command line (or the default 1 hour).
For job-end account subtraction "wallclock time" is the actual used wallclock (start-time -> end-time).
The number "4" comes from 4 cores per node. A node is considered blocked if one or more cores on the node is reserved for the user since only one job can run at any time on a node.

This means that setting e.g. mppnppn=1 and mppwidth=12 for 1 hour the actual cpu-hour usage will be calculated as:

4 * 12 * 1 = 48 cpuhours

whereas a job with mppnppn=4 (the default) and mppwidth=12 for 1 hour will have cpu-hour usage calculated as:

4* 3 * 1 = 12

If your job fails to start, you should use the command:

checkjob -v jobnumber

where jobnumber is the PBS jobnumber given to you upon job-submission. If the command returns "Cannot debit account" you need to check for correct "-A mycpuaccount" specification for your job as well as enough credits to reserve and run the job.
You can check the names and balance of your available cpu-accounts with the "cost" command.

Note also that the version of Moab scheduler was updated. Users currently logged in needs to do a "module swap moab/5.2.1 moab/5.2.2" or log out and in again to have the moab client commands use the correct version.

(Re)Scheduled maintenance on hexagon

lsz075 • April 20, 2008

The previously postponed maintenance of hexagon (http://www.parallaw.uib.no/syslog/154) is now scheduled for Thursday April 24th from 14:00 to approximately 18:00.

This note will be updated as we know more about the maintenance.

Update, Thursday 24th:

14:10: System is taken down for diagnostics and init change.
14:30: Hardware work begins.
16:00: Hardware work ends.
17:10: System is up and running.

Global file system hang on fimm

lsz075 • April 14, 2008

The global file system on fimm got a hang around 17:30 today. We are working on solving the problem.

Update: 18:15: File system is now working again. All jobs running at the time of the hang have crashed and has to be resubmitted.

Scheduled downtime on hexagon on Monday

lsz075 • April 11, 2008

Monday 14th at 14:00, hexagon will be shutdown.
An upgrade of hexagon's firmware will solve the problem with all the failing nodes on the system. The machine should be online within two to three hours.

Update: 15:45: Hexagon is now up again. The mppmem bug is also fixed.

Failure on io-node on hexagon

lsz075 • April 7, 2008

A failure on io-node for the /work filesystem on hexagon means we will have to stop hexagon briefly to fix the issue.

HPC Syslog

Log over changes and events on UiB's HPC systems