Linux

Hang of fimm frontend

lsz075 • November 12, 2006

The fimm frontend had a hang that was discovered at 00:10 Sunday 12.11. The actual time the frontend went down is uncertain, but could be some time on the evening of Saturday. The cause seems to be an extreme load due to lots of httpd processes (unknown reason). The frontend was rebooted and at the same time was given some hardware and software maintenance - including kernel upgrade that was planned for a later time.

No jobs were affected, frontend up again at 13:00 Sunday 12.11.

Reboot of fimm frontend

lsz075 • October 18, 2006

Reboot of fimm frontend due to excessive memory usage by an interactive process. No jobs affected. Down: 10min.

Reboot av fimm frontend

lsz075 • August 21, 2006

Fimm frontend had a hang due to excessive memory-swapping. Rebooted (downtime 5 minutes).

Scheduler / passwd problems on fimm

lsz075 • June 22, 2006

The scheduler on fimm have some problems at the moment. The passwd distribution system to the nodes do not work properly. Some jobs will fail to start or fail to stop properly after starting. We are working on fixing it.

Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.

Rebalancing of /work and /home/fimm on fimm

lsz075 • February 2, 2006

The GPFS filesystems /work and /home/fimm on fimm has become unbalanced. The needed filesystem-balancing was started last night and is still running. It will increase the IO load untill finished - hopefully sometime later today.

Problem with /work on fimm

lsz075 • January 10, 2006

10:15 There is some problem with /work on fimm. We are working on it.

13:45 Update: /work is now accessible. The frontend had to be restarted, and gpfs restarted on one of the NAS boxes. All the compute nodes were OK and thus no running jobs were affected by this.

backup and HSM-problems

lsz075 • August 26, 2005

The Tivoli Storage Manager database recovery log ran full, and then could no longer process backup or HSM-requests. The problem was noted at about 09:30, and resolved by 10:20.

fimm filesystems down

lsz075 • June 27, 2005

The GPFS filesystems is unavailable because of several failed disks. The problem seems to be identical to what happened March 30.

http://www.parallaw.uib.no/syslog/56

We had installed a firmware fix for this problem, but that fix seems to be incomplete. A newer more complete fix will be installed ASAP.

Downtime started Monday June 27 00:42:51.

fimm got back on-line at 11:32:00
Downtime: 10 hours 50 minutes

Firmware on SATABlades upgraded to 'firmware 9037'. This will hopefully fix this 'failing disks' problem.

Maintenance summary

lsz075 • June 14, 2005

Regatta node TO and TRE had downtime from 08:00 to 12:45
for update of firmware.

Regatta node EN had downtime from 08:00 to 16:00
for update of firmware and change of 32GB memory module.
This node had problem booting from root-disks after hardware changes.
Moving the disks to TO and back again made EN bootable (unclear why).

Linux cluster FIRE had downtime from 08:00 to 16:00 due to dependancy on disks on EN.

Scheduled downtime on TRE+FIRE

lsz075 • June 7, 2005

The regatta cluster TRE will be down Tuesday June 14 08:00-14:00 for firmware upgrades, and replacement of a failed memory module on one of the nodes. Running jobs will be killed, and will have to be resubmitted after the maintenance stop.

Also the linux cluster FIRE will be down this periode, because it's depending on the regatta as file server.

HPC Syslog

Log over changes and events on UiB's HPC systems