Due to a power failure on part of the rack fimm was unavailable from 03:00 to 08:40. Most jobs failed due to one network switch being without power, some jobs may still be running on the nodes without being visible in the queue. We are looking into this and cleaning up.
Wednesday, 28th of May 08:00: /home/t1home and /home/uibkvant will be unavailable during a firmware upgrade of those file systems.
This downtime will probably last for a couple of hours.
Update 28th of May:
10:30: File system is now being unmounted on all compute nodes, and will be unavailable until upgrade is complete.
11:30: The firmware upgrade is now started.
12:30: One disk controller failed during the upgrade and has to be replaced. We are expecting the replacements arrival later today. The downtime will therefore be extended.
16:30: /home/uibkvant is now up again. /home/t1home have to wait for the new controller to arrive.
01:30: Will continue tomorrow.
Update 29th of May:
09:00: We continue the work from yesterday.
18:00: Controllers have been revived. Working on recovering data disks.
00:00: Will continue tomorrow. Seems like no data has to be restored from backup.
Update 30th of May:
09:00: We continue with recovering data disks.
14:00: Running file system checks to vertify data. If data verification is successfull /home/t1home will be up again soon.
14:50: Verification was successfull. /home/t1home is now mounted on all compute nodes.
The previously postponed maintenance of hexagon (http://www.parallaw.uib.no/syslog/154) is now scheduled for Thursday April 24th from 14:00 to approximately 18:00.
This note will be updated as we know more about the maintenance.
Update, Thursday 24th:
14:10: System is taken down for diagnostics and init change.
14:30: Hardware work begins.
16:00: Hardware work ends.
17:10: System is up and running.
Monday 14th at 14:00, hexagon will be shutdown.
An upgrade of hexagon's firmware will solve the problem with all the failing nodes on the system. The machine should be online within two to three hours.
Update: 15:45: Hexagon is now up again. The mppmem bug is also fixed.
Hexagon will have planned downtime on Wednesday April 9th from 13:00 to approximately 16:00.
The maintenance will replace bad CPUs after the quad-core upgrade. A number of CPU-replacements is expected after a major CPU upgrade and the current failure-rate is within expected levels.
We will update this note with more information.
Update, Wednesday 9th 13:00: The scheduled maintenance will be postponed to a not yet determined time. We will update this note when we know when we are ready to do the maintenance.
Update, Friday 11th 19:00: The maintenance Monday the 14th is related to this postponed issue, which will reduce the number of failed nodes.
Early on March 26th hexagon will be shutdown for the initial quad-core upgrade. We hope to be able to have parts of the machine up while the second half is upgraded. It will nevertheless mean that the entire machine will be taken down first, before being booted to a smaller size.The physical upgrade will probably take three days. There will then be some more days with tuning and reconfiguring.
One very important part of this is that ALL programs and libraries will have to be re-compiled when hexagon is booted up after the finished upgrade.
Wednesday, 09:00: Upgrade has started. Machine is now down for a while for diagnostics.
Wednesday, 12:30: Half of the machine is now running again, while the other half is being upgraded to quad-core. We expect to take the entire machine down Friday morning. Please consider the machine to be in testing state, so unannounced downtime might occure.
Wednesday, 16:45: The upgrade is ahead of schedule, therefore the machine will be taken down tomorrow around 10am.
Thursday, 12:00: Two racks are now running, which will run till tomorrow morning, Friday 28th, and then the entire machine will be shutdown at 8am. The machine will then stay down untill, at least, Monday.
Friday, 08:00: Hardware part of upgrade is now finished. The machine is now unavailable until the software, diagnostics and testing has finished.
Saturday, 17:00: Main part of software upgrade is finished. The machine is running, but is unavailable due to testing.
Tuesday, April the 1st, 18:00: Hexagon is now available again, see http://www.parallaw.uib.no/syslog/153 for more details.
Because of some additional power installation in our machine room, fimm has to be shutdown Thursday, February the 28th at 07:00.
The shutdown should not last more than an hour.
We are very sorry for the inconvenience caused by this.
Update Feb. 26th: Shutdown has been moved to 07:00, which means fimm will be shutdown shortly before then.
Update Feb. 28.th, 8:00: Fimm should now be running as normal.