The fimm frontend had a hang that was discovered at 00:10 Sunday 12.11. The actual time the frontend went down is uncertain, but could be some time on the evening of Saturday. The cause seems to be an extreme load due to lots of httpd processes (unknown reason). The frontend was rebooted and at the same time was given some hardware and software maintenance - including kernel upgrade that was planned for a later time.
No jobs were affected, frontend up again at 13:00 Sunday 12.11.
Linux
Reboot of fimm frontend
Reboot of fimm frontend due to excessive memory usage by an interactive process. No jobs affected. Down: 10min.
Reboot av fimm frontend
Fimm frontend had a hang due to excessive memory-swapping. Rebooted (downtime 5 minutes).
Scheduler / passwd problems on fimm
The scheduler on fimm have some problems at the moment. The passwd distribution system to the nodes do not work properly. Some jobs will fail to start or fail to stop properly after starting. We are working on fixing it.
Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.
Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.
Rebalancing of /work and /home/fimm on fimm
The GPFS filesystems /work and /home/fimm on fimm has become unbalanced. The needed filesystem-balancing was started last night and is still running. It will increase the IO load untill finished - hopefully sometime later today.
Problem with /work on fimm
10:15 There is some problem with /work on fimm. We are working on it.
13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.
13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.
backup and HSM-problems
The Tivoli Storage Manager database recovery log ran full, and then could no longer process backup or HSM-requests. The problem was noted at about 09:30, and resolved by 10:20.
fimm filesystems down
The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.
http://www.parallaw.uib.no/syslog/56
We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.
Downtime started Monday June 27 00:42:51.
fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes
Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.
http://www.parallaw.uib.no/syslog/56
We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.
Downtime started Monday June 27 00:42:51.
fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes
Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.
Maintenance summary
Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.
Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).
Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.
for update of firmware.
Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).
Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.
Scheduled downtime on TRE+FIRE
The regatta cluster TRE will be down Tuesday June 14 08:00-14:00 for firmware upgrades, and replacement of a failed memory module on one of the nodes. Running jobs will be killed, and will have to be resubmitted after the maintenance stop.
Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.
Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.