We encountered errors on the /home file system. Therefore we have to
shutdown machine immediately for the maintenance.
We will use this opportunity to rerun HPL benchmark on whole machine
right after maintenance, this means that your submitted jobs will start
5 hours after maintenance is finished.
We apologize for any inconvenience.
Update 13:00: Machine is back online. The inconsistency on /home has been fixed.
Author Archives: lsz075
Hexagon: system crash
Hexagon is down due to a power issue. We are investigating.
Update 13:00 : Hexagon is back online. we found that panel breaker that tripped causing loss of power and crash of the system. All jobs which were running needs to be re-submitted.
Update 13:00 : Hexagon is back online. we found that panel breaker that tripped causing loss of power and crash of the system. All jobs which were running needs to be re-submitted.
Fimm.bccs.uib.no maintenance
Dear fimm cluster user :
We will have scheduled down time for cluster fimm.bccs.uib.no. on First
Of April at 08:00 am. cluster is reserved for this downtime today 13:30.
Reservation will last 24 hours until 08:00 04/02/2012
We will enforce quota on home file system during the maintenance, we
ask all users to check their home file system usage (repquota.sh), and
compare your quota(hardquota) and your actual usage, and
remove files accordingly.
If you don't do so, you home file system will be "locked" and you wont
be able to do anything even if you logged in after all.
We will also perform hardware and software maintenance which
includes upgrading firmware, reinstalling all compute nodes, some
cable and switch changes.
All jobs which will not be finished by 08:00 am , 04/01/2012
* WILL BE KILLED *, we kindly ask you to save/remove/take care of your
job if it will not finish on time.
If you submit a job after reservation (reservation set today 13:30),
system will check if your job can be finished before down time , if not
it will be queued until maintenance is over, if it can be finished
it will just run.
We will keep any update posted here.
Let us know if you have any further question.
Update : Down time extended until 18:00 02/04/2012
Update 15:05/02: maintenance is finished. due to network driver issue we have reserved some of the nodes for further maintenance, reservation on cluster is removed, but less nodes are in cluster.
We will have scheduled down time for cluster fimm.bccs.uib.no. on First
Of April at 08:00 am. cluster is reserved for this downtime today 13:30.
Reservation will last 24 hours until 08:00 04/02/2012
We will enforce quota on home file system during the maintenance, we
ask all users to check their home file system usage (repquota.sh), and
compare your quota(hardquota) and your actual usage, and
remove files accordingly.
If you don't do so, you home file system will be "locked" and you wont
be able to do anything even if you logged in after all.
We will also perform hardware and software maintenance which
includes upgrading firmware, reinstalling all compute nodes, some
cable and switch changes.
All jobs which will not be finished by 08:00 am , 04/01/2012
* WILL BE KILLED *, we kindly ask you to save/remove/take care of your
job if it will not finish on time.
If you submit a job after reservation (reservation set today 13:30),
system will check if your job can be finished before down time , if not
it will be queued until maintenance is over, if it can be finished
it will just run.
We will keep any update posted here.
Let us know if you have any further question.
Update : Down time extended until 18:00 02/04/2012
Update 15:05/02: maintenance is finished. due to network driver issue we have reserved some of the nodes for further maintenance, reservation on cluster is removed, but less nodes are in cluster.
Hexagon: major upgrade on March 9th
Update 19.03 18:40: Upgrade is finished, machine is open for SSH access.
Update, Monday 19th: We are finalizing the upgrade, the machine is up and we expect to allow logins later today. When logging in for the first time, please remember to recompile ALL your applications and libraries to be compatible with the new system.
Hexagon will get a major hardware and software upgrade in the first week of March.
The current schedule is for the upgrade to start on March 9th 2012 at 8:00 (a delay of 1 week from initial announcement) and to last for about 1 week.
NOTE: A reservation is set in the queue system. Thus, jobs must have a walltime set so that they can finish before the maintenance to be allowed to start.
The upgraded hexagon will have the following specs:
* Cray XE6m-200
* 204.9 TFlops peak performance
* 22272 cores
* AMD Opteron 6276 (2.3GHz "Interlagos")
* 1392 CPUs (sockets)
* 696 nodes
* 32 cores per node
* 32GB RAM per node (1GB/core)
* New interconnect: Cray Gemini
* New topology: 2.5D Torus
* OS: Cray Linux Environment, CLE 4.0 (Based on Novell Linux SLES11sp1)
Although the user experience will be very much the same after the upgrade (with just newer versions of familiar software, and a faster machine) please observe the following critical point:
IMPORTANT! All applications MUST be recompiled to be compatible with the new and upgraded hexagon.
You can expect that the software list that is available via "modules" to be short right after the upgrade for then to grow during the next few weeks. Please be patient while we recompile and install the necessary applications and libraries.
We remind you that you have to move all files not related to your current runs out from the /work file system. Please see our previous email for details.
IMPORTANT! The old /work will be available on new hexagon only up to April 9th. On April 11th it will be completely DESTROYED!
It is therefore very important that you participate in moving data out of hexagon or transfer it to the new file-system. The old /work will be mounted back after a reformat and used as secondary storage.
You can follow the upgrade at our Syslog:
http://computing.uni.no/syslog
Please contact support-uib at notur.no if you have any questions regarding the upgrade.
Update, Monday 19th: We are finalizing the upgrade, the machine is up and we expect to allow logins later today. When logging in for the first time, please remember to recompile ALL your applications and libraries to be compatible with the new system.
Hexagon will get a major hardware and software upgrade in the first week of March.
The current schedule is for the upgrade to start on March 9th 2012 at 8:00 (a delay of 1 week from initial announcement) and to last for about 1 week.
NOTE: A reservation is set in the queue system. Thus, jobs must have a walltime set so that they can finish before the maintenance to be allowed to start.
The upgraded hexagon will have the following specs:
* Cray XE6m-200
* 204.9 TFlops peak performance
* 22272 cores
* AMD Opteron 6276 (2.3GHz "Interlagos")
* 1392 CPUs (sockets)
* 696 nodes
* 32 cores per node
* 32GB RAM per node (1GB/core)
* New interconnect: Cray Gemini
* New topology: 2.5D Torus
* OS: Cray Linux Environment, CLE 4.0 (Based on Novell Linux SLES11sp1)
Although the user experience will be very much the same after the upgrade (with just newer versions of familiar software, and a faster machine) please observe the following critical point:
IMPORTANT! All applications MUST be recompiled to be compatible with the new and upgraded hexagon.
You can expect that the software list that is available via "modules" to be short right after the upgrade for then to grow during the next few weeks. Please be patient while we recompile and install the necessary applications and libraries.
We remind you that you have to move all files not related to your current runs out from the /work file system. Please see our previous email for details.
IMPORTANT! The old /work will be available on new hexagon only up to April 9th. On April 11th it will be completely DESTROYED!
It is therefore very important that you participate in moving data out of hexagon or transfer it to the new file-system. The old /work will be mounted back after a reformat and used as secondary storage.
You can follow the upgrade at our Syslog:
http://computing.uni.no/syslog
Please contact support-uib at notur.no if you have any questions regarding the upgrade.
GPFS file system crash on fimm
Hi,
We had unexpected GPFS file system crash today at 14:00 while we doing maintenance on storage system.
Your job might be affected due to GPFS related file system lost.
Please check your running jobs on fimm, if necessary, you have to restart them.
Sorry for inconvenience.
We had unexpected GPFS file system crash today at 14:00 while we doing maintenance on storage system.
Your job might be affected due to GPFS related file system lost.
Please check your running jobs on fimm, if necessary, you have to restart them.
Sorry for inconvenience.
Fimm: filesystem glitch on login node
There was a temporary filesystem failure on the login node. Seems OK after reboot.
Hexagon: system crash 25.12.2011
We had to restart hexagon due to multiple seastar heartbeat failures in c10 and c12 cabinets. Probably related to power and extreme weather which we had.
This happened on 25.12.2011 23:30.
This happened on 25.12.2011 23:30.
Fimm: job scheduling problems
Hello,
Maui job scheduler on fimm is still behaving strange. Jobs get scheduled to random nodes. This can break already running jobs on these nodes. Please check results of completed jobs and expect irregular job cancellations over the next days.
We are working on resolving the problem and will let you know when we're back with regular job running conditions.
Maui job scheduler on fimm is still behaving strange. Jobs get scheduled to random nodes. This can break already running jobs on these nodes. Please check results of completed jobs and expect irregular job cancellations over the next days.
We are working on resolving the problem and will let you know when we're back with regular job running conditions.
Fimm: maui down
Hi,
Update: 11:00
Maui job scheduler on fimm is taken down due to some problem.
we are working on resolving problem. will keep you updated.
Update: 13:20
We restart maui and some other processes, due to restart some of your jobs was killed, please check your job status , and submit it again if necessary.
We are sorry for inconvenience.
Update: 11:00
Maui job scheduler on fimm is taken down due to some problem.
we are working on resolving problem. will keep you updated.
Update: 13:20
We restart maui and some other processes, due to restart some of your jobs was killed, please check your job status , and submit it again if necessary.
We are sorry for inconvenience.
Hexagon: scheduled maintenance, Dec 19th
We will have a scheduled maintenance for hexagon on Monday, December 19th. Approximate time slot is from 10:00 to 14:00
We need to replace 2 PDUs in failed cabinets and some CPUs, memory.
Update: 14:50 Machine is back available.
We need to replace 2 PDUs in failed cabinets and some CPUs, memory.
Update: 14:50 Machine is back available.