Hexagon has shutdown automatically due to thunderstorm power blink. We are diagnosing.
Update: 22:00 Machine is up again.
Author Archives: lsz075
Hexagon: cabinet power failure
Hexagon cabinets c1 and c8 experienced Emergency Power Off failure on Dec 2. 23:41. We are investigating.
Due to the cabinets involved (and the topology of the interconnect) we cannot just start the machine without the two cabinets, looking into possibilities.
Update: 2011-12-05 12:45 2 cabinets can not be started because of the PDU failures. We have now started machine without 2 cabinets (c6 and c8).
Due to the cabinets involved (and the topology of the interconnect) we cannot just start the machine without the two cabinets, looking into possibilities.
Update: 2011-12-05 12:45 2 cabinets can not be started because of the PDU failures. We have now started machine without 2 cabinets (c6 and c8).
Fimm: work file system is down
Work file system on fimm cluster is taken down due to misconfiguration of GPFS file system.
We are working on correction of configuration , will keep you updated.
10/11/2011 Work file system is back online with more space (3.7TB)
Update 11/11/2011
We are balancing data on different disk on work file system since we added new disk to work file system, this is creating load on GPFS file system on fimm, which means the operation related to file system is going to be slow, we are expecting this balancing process will finish during the weekend.
We are working on correction of configuration , will keep you updated.
10/11/2011 Work file system is back online with more space (3.7TB)
Update 11/11/2011
We are balancing data on different disk on work file system since we added new disk to work file system, this is creating load on GPFS file system on fimm, which means the operation related to file system is going to be slow, we are expecting this balancing process will finish during the weekend.
/bcmhsm downtime on Nov. 3rd
We are going to stop /bcmhsm for a maintenance at 10:00 on Thursday November 3rd. The /bcmhsm will not be available for few hours.
Update: 03.11 10:10 Filesystem is back online
Update: 03.11 10:10 Filesystem is back online
Fimm software update
We have updated following main software on fimm:
PGI/11.8
GCC/4.6.1
intel/12.1.6_233
openmpi/1.4.4 compiled with pgi/11.8 gcc/4.6.1
netcdf/4.1.3 compiled with pgi/11.8 gcc/4.6.1
HDF5/1.8.7 compiled with pgi/11.8 gcc/4.6.1
szip/2.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
zlib/2.3.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
We have also implemented PrgEnv-pgi and PrgEnv-gcc on fimm which will work same as hexagon, it is a software environment set which helps you to load right set of the software.
We keep rest of the software updated.
PGI/11.8
GCC/4.6.1
intel/12.1.6_233
openmpi/1.4.4 compiled with pgi/11.8 gcc/4.6.1
netcdf/4.1.3 compiled with pgi/11.8 gcc/4.6.1
HDF5/1.8.7 compiled with pgi/11.8 gcc/4.6.1
szip/2.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
zlib/2.3.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
We have also implemented PrgEnv-pgi and PrgEnv-gcc on fimm which will work same as hexagon, it is a software environment set which helps you to load right set of the software.
We keep rest of the software updated.
Fimm software update
We have updated following main software on fimm:
PGI/11.8
GCC/4.6.1
intel/12.1.6_233
openmpi/1.4.4 compiled with pgi/11.8 gcc/4.6.1
netcdf/4.1.3 compiled with pgi/11.8 gcc/4.6.1
HDF5/1.8.7 compiled with pgi/11.8 gcc/4.6.1
szip/2.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
zlib/2.3.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
We have also implemented PrgEnv-pgi and PrgEnv-gcc on fimm which will work same as hexagon, it is a software environment set which helps you to load right set of the software.
We keep rest of the software updated.
PGI/11.8
GCC/4.6.1
intel/12.1.6_233
openmpi/1.4.4 compiled with pgi/11.8 gcc/4.6.1
netcdf/4.1.3 compiled with pgi/11.8 gcc/4.6.1
HDF5/1.8.7 compiled with pgi/11.8 gcc/4.6.1
szip/2.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
zlib/2.3.1 compiled with pgi/11.8 gcc/4.6.1 intel/12.1.6_233
We have also implemented PrgEnv-pgi and PrgEnv-gcc on fimm which will work same as hexagon, it is a software environment set which helps you to load right set of the software.
We keep rest of the software updated.
Hexagon: Part of /work has problems
There is an issue with part of the /work filesystem on Hexagon. We are investigating.
Update Tuesday 09:30, Still diagnosing the issue. No known fix-time as of now.
Update Tuesday 10:00, Machine goes down for maintenance.
Update Tuesday 13:30, Part of filesystem has been e2fsck checked.
Update Tuesday 14:00, Machine up again after maintenance.
Update Tuesday 09:30, Still diagnosing the issue. No known fix-time as of now.
Update Tuesday 10:00, Machine goes down for maintenance.
Update Tuesday 13:30, Part of filesystem has been e2fsck checked.
Update Tuesday 14:00, Machine up again after maintenance.
HSM downtime, Oct 7 12:00-14:00
We are going to change physical location of HSM server and HSM storage. Therefore downtime for /migrate and /bcmhsm will take place at Friday October 7th, from 12:00 till 14:00.
Update: The downtime have to be extended by half an hour.
Update: The downtime have to be extended by half an hour.
Fimm network down
Due to fimm.bccs.uib.no cluster core switch firmware update we will take down both internal and external core switch for maintenance tomorrow from 13:00~15:00, actual down time can be shorter then this.
All running job will be killed.
We are sorry for inconvenience and short notice.
We will keep you updated.
10:30 Fimm login node is blocked.
16:00 Both internal and external switch is updated to new firmware.
17:10 maintenance is finished. fimm cluster is operational.
All running job will be killed.
We are sorry for inconvenience and short notice.
We will keep you updated.
10:30 Fimm login node is blocked.
16:00 Both internal and external switch is updated to new firmware.
17:10 maintenance is finished. fimm cluster is operational.
Fimm: backend node crashed. queueing system not available.
The backend machine of Fimm crashed and has ongoing problems.
This means the queueing system and most other services are not avaliable.
13.08.2011, 10:00, service is back.
This means the queueing system and most other services are not avaliable.
13.08.2011, 10:00, service is back.