Author Archives: lsz075

About lsz075

IT-avdelingen

Module hw failure on hexagon

lsz075 • June 19, 2008

Thursday June 19th 21:10: A module (4 nodes) on hexagon crashed with hardware errors, which impacted the routing and the global file system. We are working on solving the problem.

Update Friday June 20th 02:20: Replaced and re-flashed firmware on module, did diagnostics. Machine is now up again.

Final quad-core upgrade of Hexagon, June 24th

lsz075 • June 10, 2008

Tuesday June 24th will Hexagon be taken down for the final quad-core upgrade.

During the upgrade we will take up parts of the machine so short jobs can be run.

Updates:
Tuesday 24th, 08:00: hexagon is shutdown for upgrading

Tuesday 24th, 09:00: half of hexagon is started, while the other half is upgraded. The rest of the machine will be turn off tomorrow morning (Wednesday) at 08:00 for upgrading. The last two racks will be turned on and made available until 14:00, then the entire machine will be taken down for the final upgrade. From then on hexagon, including the file system, will unavailable until the diagnostics and checkout procedures has been completed.

Wednesday 25th, 08:00: Only the last two racks are now running.

Wednesday 25th, 14:00: The entire machine is now down for the upgrade. We will update this page when the diagnostics are completed.

Wednesday 25th, 20:00: The machine is now booted with final hardware configuration, but not available to users due to diagnostics and checkout procedures.

Thursday 26th, 23:00: The machine is still going through checkout procedures and will tomorrow start on benchmarking for the Acceptance test of the system. More information on when the system will be available for users will come Friday at 11:00.

Friday 27th, 11:00: Hexagon is currently running benchmarks. These are scheduled to complete by 18:00 today, at which point users will be allowed to login.

Friday 27th, 18:00: Hexagon is now available for users. Note that it has a scheduled slot for further benchmarking at Tuesday July 8th starting at 16:00. Jobs need to ask for walltime shorter than that.

Brief power outage (blink) for hexagon

lsz075 • June 6, 2008

Hexagon experienced a short power blink in external power, since only part of the machine is on UPS the machine went down.

The machine was down from 07:45 to 08:30 but is now up and running again. All running jobs are regretfully lost and will have to be submitted again.

Scheduled maintenance for hexagon, software upgrade, June 16th

lsz075 • June 5, 2008

There will be a planned maintenance on hexagon for software upgrade on Monday June 16th starting at 14:00 and expected to last approximately 3 hours.

The Cray software release will be upgraded from 2.0.44 to 2.0.53.
This release will have more quad-core optimizations as well as a new version of the MPI library. We therefore recommend that you recompile your programs and libraries after the upgrade. We will notify when we have re-compiled the libraries/modules installed by us.

Update 16th, 14:40 System taken down.
Update 16th, 19:30 System back online with version 2.0.53 and MPT 3.0

Look for update on when we have re-compiled libraries:

All compute-node (cnl) software has been re-compiled.
Most login node software has been recompiled, except GNUPLOT.
UPC is not re-compiled yet.

Lustre IO-node crash on hexagon

lsz075 • May 29, 2008

An IO-node for the Lustre filesystem (/work) on hexagon has crashed. We are doing a debugging dump and will restart hexagon.
Machine will be unavailable for about 30 min during the dump and restart.

Update 09:40: hexagon is now up again. Jobs will need to be resubmitted.

Partial power problem on fimm caused downtime

lsz075 • May 26, 2008

Due to a power failure on part of the rack fimm was unavailable from 03:00 to 08:40. Most jobs failed due to one network switch being without power, some jobs may still be running on the nodes without being visible in the queue. We are looking into this and cleaning up.

File system upgrade of t1home and uibkvant on fimm

lsz075 • May 23, 2008

Wednesday, 28th of May 08:00: /home/t1home and /home/uibkvant will be unavailable during a firmware upgrade of those file systems.
This downtime will probably last for a couple of hours.

Update 28th of May:

10:30: File system is now being unmounted on all compute nodes, and will be unavailable until upgrade is complete.
11:30: The firmware upgrade is now started.
12:30: One disk controller failed during the upgrade and has to be replaced. We are expecting the replacements arrival later today. The downtime will therefore be extended.
16:30: /home/uibkvant is now up again. /home/t1home have to wait for the new controller to arrive.
01:30: Will continue tomorrow.

Update 29th of May:

09:00: We continue the work from yesterday.
18:00: Controllers have been revived. Working on recovering data disks.
00:00: Will continue tomorrow. Seems like no data has to be restored from backup.

Update 30th of May:

09:00: We continue with recovering data disks.
14:00: Running file system checks to vertify data. If data verification is successfull /home/t1home will be up again soon.
14:50: Verification was successfull. /home/t1home is now mounted on all compute nodes.

Upgrade of backup system Friday

lsz075 • May 8, 2008

Friday 9th of May, the backup system will be unavailable for a short time, because of a upgrade of our system. File systems like /migrate and /bcmhsm will be unavailable during this upgrade, which will start at 12:00 and be finished at 15:00.

Update: 15:30: Upgrade is finished.

Activated “Gold” accounting system on hexagon

lsz075 • April 26, 2008

Hexagon now has activated the accounting/allocation manager "Gold", this means that all jobs will need to have a valid cpu-account with enough cpu-hours to run the submitted job.
At the time of job-submission a "credit-check" against the specified cpu-account is done and the asked for cpu-hours are reserved until the job ends - at which time the real amount of cpu-hours is subtracted from the account. The number of cpu-hours reserved and subtracted is calculated as follows:

cpuhours = 4 * blocked nodes * wallclock time

For reservations "wallclock time" is the specified "wallclock" parameter used in the PBS script or on the command line (or the default 1 hour).
For job-end account subtraction "wallclock time" is the actual used wallclock (start-time -> end-time).
The number "4" comes from 4 cores per node. A node is considered blocked if one or more cores on the node is reserved for the user since only one job can run at any time on a node.

This means that setting e.g. mppnppn=1 and mppwidth=12 for 1 hour the actual cpu-hour usage will be calculated as:

4 * 12 * 1 = 48 cpuhours

whereas a job with mppnppn=4 (the default) and mppwidth=12 for 1 hour will have cpu-hour usage calculated as:

4* 3 * 1 = 12

If your job fails to start, you should use the command:

checkjob -v jobnumber

where jobnumber is the PBS jobnumber given to you upon job-submission. If the command returns "Cannot debit account" you need to check for correct "-A mycpuaccount" specification for your job as well as enough credits to reserve and run the job.
You can check the names and balance of your available cpu-accounts with the "cost" command.

Note also that the version of Moab scheduler was updated. Users currently logged in needs to do a "module swap moab/5.2.1 moab/5.2.2" or log out and in again to have the moab client commands use the correct version.

(Re)Scheduled maintenance on hexagon

lsz075 • April 20, 2008

The previously postponed maintenance of hexagon (http://www.parallaw.uib.no/syslog/154) is now scheduled for Thursday April 24th from 14:00 to approximately 18:00.

This note will be updated as we know more about the maintenance.

Update, Thursday 24th:

14:10: System is taken down for diagnostics and init change.
14:30: Hardware work begins.
16:00: Hardware work ends.
17:10: System is up and running.

HPC Syslog

Log over changes and events on UiB's HPC systems