As previously announced, the IBM p690 Regatta system "tre" is now decommissioned.
Questions regarding access to files etc. must quickly be sent to
support-uib@notur.no
AIX
Important notice: tre.bccs.uib.no will be taken out of service
The IBM p690 Regatta tre.bccs.uib.no / tre.ii.uib.no will be shut down and decommissioned in the morning of Monday October 1st 2007 at 08:00.
All jobs must be finished, and all data and personal files must be copied out of the machine before this time. The only exception would of course be data on external disk like /migrate, /net/bcmhsm and /net/bjerknes*.
Any questions regarding this can be sent to support-uib@notur.no.
All jobs must be finished, and all data and personal files must be copied out of the machine before this time. The only exception would of course be data on external disk like /migrate, /net/bcmhsm and /net/bjerknes*.
Any questions regarding this can be sent to support-uib@notur.no.
GPFS hang on node “en”
Node en had a GPFS hang. GPFS restartet. Downtime approx. 30 min.
Hang of node “en”
Node "en" of the regattas has a hang. We are investigating.
Update 09:45: Node rebooted. No jobs lost.
Update 09:45: Node rebooted. No jobs lost.
Hang on tre
The tre node has crashed at 15:30, possibly due to disk-issues. We are investigating.
Update 16:30: Machine is now up. Jobs on the "tre" node was lost.
Update 16:30: Machine is now up. Jobs on the "tre" node was lost.
Filesystem hang on tre and en
The GPFS filesystem on tre and en stopped working sometime after Friday 16:30. It was restored (restarted GPFS) again on Saturday 12:00. All jobs lost.
/work filesystem failure on “tre”
The /work filesystem on tre is down (from 07:30), most probably due to disk failure.
We are looking into the issue.
Update (10:55): The /work filesystem is un-recoverable. We are creating a new /work filesystem from the remaining functioning disks. All data on /work is then of course lost.
Update (12:30): All nodes and filesystems are up again.
NB: You must check that you have the necessary folders in /work ready before you submit your jobs.
We are looking into the issue.
Update (10:55): The /work filesystem is un-recoverable. We are creating a new /work filesystem from the remaining functioning disks. All data on /work is then of course lost.
Update (12:30): All nodes and filesystems are up again.
NB: You must check that you have the necessary folders in /work ready before you submit your jobs.
Maintenance stop of tre (to fix cooling)
We need to replace a fan-motor in the cooling unit that cools "tre".
During this maintenence it is expected that the temperature will rise so much that we will be forced to shutdown the regatta nodes.
The maintenance is expected to take place from 08:00 to 10:00 tomorrow, Friday 20.
Update: Friday 11:30, The nodes are now up again.
During this maintenence it is expected that the temperature will rise so much that we will be forced to shutdown the regatta nodes.
The maintenance is expected to take place from 08:00 to 10:00 tomorrow, Friday 20.
Update: Friday 11:30, The nodes are now up again.
Problems with HSM/Backup-server
There is a problem with the HSM/Backup-server jambu. /migrate and /net/bcmhsm is down. We are investigating.
Update: 15:15 We are waiting on external support to upgrade/fix the firmware on this machine. It is unkown when we will get the machine up again. Possibly tomorrow.
Update: Friday 11:00 We are still waiting for a part to the machine from abroad. Estimated time of arrival was yesterday afternoon - but it still has not arrived yet.
Update: Friday 15:00 The message from the transport company used by the vendor is now that they will not be able to deliver the part until Monday. Unfortunately, this means that HSM and backup will be unavailable until later in the day on Monday 16.
Update: Monday 14:45 The HSM/backup-server jambu is now up again and /migrate and /net/bcmhsm works.
Update: 15:15 We are waiting on external support to upgrade/fix the firmware on this machine. It is unkown when we will get the machine up again. Possibly tomorrow.
Update: Friday 11:00 We are still waiting for a part to the machine from abroad. Estimated time of arrival was yesterday afternoon - but it still has not arrived yet.
Update: Friday 15:00 The message from the transport company used by the vendor is now that they will not be able to deliver the part until Monday. Unfortunately, this means that HSM and backup will be unavailable until later in the day on Monday 16.
Update: Monday 14:45 The HSM/backup-server jambu is now up again and /migrate and /net/bcmhsm works.
Crash/hw-problem of tre and to
Node "to" looks to have had an internal power/cooling failure and seems to have shut itself down at 20:15. It looks as the tre node has also shut itself down (at 20:35) to avoid filesystem corruption with the crash of to.
22:45: investigating
23:55: all nodes up again
Technical note: tre had a problem when it was booted again that it blocked itself out from the rest (HACMP problem). After the other nodes where rebooted and up again. It was possible to start the hacmp service on tre again.
22:45: investigating
23:55: all nodes up again
Technical note: tre had a problem when it was booted again that it blocked itself out from the rest (HACMP problem). After the other nodes where rebooted and up again. It was possible to start the hacmp service on tre again.