Hardware

As previously noted, we will have a scheduled downtime from 16:00 Tuesday July 8th. We will replace a faulty module and do some I/O-benchmarking which requires a reserved system. It is estimated that the machine will be available for login at 19:00.

Update, 16:00: hexagon is shutdown for hw replacement
Update, 16:45: hexagon is up.
Update, 19:10: hexagon is up and allowing users to login.

4 compute nodes (1 module) on hexagon have stopped responding and due to this also some of the login nodes and lustre filesystem. We will unfortunately need to reboot hexagon to clear the issue, jobs will need to be re-submitted.

Update 20:30, hexagon is now up again. Note that 4 high-mem nodes are now unavailable due to hardware errors.

The previously postponed maintenance of hexagon (http://www.parallaw.uib.no/syslog/154) is now scheduled for Thursday April 24th from 14:00 to approximately 18:00.

This note will be updated as we know more about the maintenance.

Update, Thursday 24th:

14:10: System is taken down for diagnostics and init change.
14:30: Hardware work begins.
16:00: Hardware work ends.
17:10: System is up and running.

Hexagon will have planned downtime on Wednesday April 9th from 13:00 to approximately 16:00.
The maintenance will replace bad CPUs after the quad-core upgrade. A number of CPU-replacements is expected after a major CPU upgrade and the current failure-rate is within expected levels.

We will update this note with more information.

Update, Wednesday 9th 13:00: The scheduled maintenance will be postponed to a not yet determined time. We will update this note when we know when we are ready to do the maintenance.

Update, Friday 11th 19:00: The maintenance Monday the 14th is related to this postponed issue, which will reduce the number of failed nodes.

Early on March 26th hexagon will be shutdown for the initial quad-core upgrade. We hope to be able to have parts of the machine up while the second half is upgraded. It will nevertheless mean that the entire machine will be taken down first, before being booted to a smaller size.The physical upgrade will probably take three days. There will then be some more days with tuning and reconfiguring.

One very important part of this is that ALL programs and libraries will have to be re-compiled when hexagon is booted up after the finished upgrade.

Wednesday, 09:00: Upgrade has started. Machine is now down for a while for diagnostics.

Wednesday, 12:30: Half of the machine is now running again, while the other half is being upgraded to quad-core. We expect to take the entire machine down Friday morning. Please consider the machine to be in testing state, so unannounced downtime might occure.

Wednesday, 16:45: The upgrade is ahead of schedule, therefore the machine will be taken down tomorrow around 10am.

Thursday, 12:00: Two racks are now running, which will run till tomorrow morning, Friday 28th, and then the entire machine will be shutdown at 8am. The machine will then stay down untill, at least, Monday.

Friday, 08:00: Hardware part of upgrade is now finished. The machine is now unavailable until the software, diagnostics and testing has finished.

Saturday, 17:00: Main part of software upgrade is finished. The machine is running, but is unavailable due to testing.

Tuesday, April the 1st, 18:00: Hexagon is now available again, see http://www.parallaw.uib.no/syslog/153 for more details.

Because of a faulty fc-switch, we are scheduling maintenance on hexagon Wednesday the 6th of February at 14:00.
The mainenance should not take longer than an hour.

Update, Tuesday, 11:00: Due the hexagon reboot we replaced the fc-switch today. The maintenance tomorrow is therefore canceled.

Update Tuesday 12:15, The scheduled maintenance is completed and the machine is running again.

The new machine "hexagon.bccs.uib.no", a Cray XT4 MPP, is now available for users. The machine was installed in the beginning of January this year and has had a few test users until now. The machine will be upgraded from its current configuration with dual-cores to the final, quad-core, configuration later this spring. After the upgrade the formal acceptance test will be executed.

More information about the machine can be found on the documentation pages.

The machine is available for "local" users from the University of Bergen, IMR and NERSC. These users can apply for an account by using the form found here

In addition, users that have acquired quota from the NOTUR consortium will have access to the NOTUR part of the machine.

Any questions regarding support, the documentation or other matters related to this can be sent to hpc-support@hpc.uib.no or the alias support-uib@notur.no

One of the switches has failed on fimm. Because the file-servers were connected to this switch the filesystems went down.
We have now re-routed the fileservers and frontend to one of the other switches and changed the switch-interconnect. Most of fimm is now back up again, including filesystems, but jobs have failed due to missing filesystems.