Downtime

GPFS hang on node “en”

lsz075 • March 2, 2007

09:20 The GPFS daemon was hung on node "en" of the regattas. The node had to be rebooted and thus the jobs on node "en" lost. The login-node "tre" was unreachable from 09:20 to 10:00 (no jobs lost).

Update: 11:00 node "en" is up again.

Reboot of fimm frontend

lsz075 • January 23, 2007

Reboot of fimm frontend to clear filesystem hang.

Hang of fimm frontend

lsz075 • November 12, 2006

The fimm frontend had a hang that was discovered at 00:10 Sunday 12.11. The actual time the frontend went down is uncertain, but could be some time on the evening of Saturday. The cause seems to be an extreme load due to lots of httpd processes (unknown reason). The frontend was rebooted and at the same time was given some hardware and software maintenance - including kernel upgrade that was planned for a later time.

No jobs were affected, frontend up again at 13:00 Sunday 12.11.

Reboot of fimm frontend

lsz075 • October 18, 2006

Reboot of fimm frontend due to excessive memory usage by an interactive process. No jobs affected. Down: 10min.

Reboot of tre,to,en

lsz075 • September 18, 2006

Due to the NFS-hang (see entry for Sep. 16). All the nodes (tre,to,en) had to be rebooted - all running jobs lost.

Please check all jobs, and in particular any jobs that should have copied data to or from /migrate or /net/bcmhsm!!

TO: down from 07:55 to 08:20
TRE: down from 08:35 to 08:55
EN: down from 09:15 to 08:35

Total downtime: 34 cpuhours.

NFS problem accessing /net/bcmhsm on tre,to,en

lsz075 • September 16, 2006

The HSM filesystem for Bjerknes /net/bcmhsm - mounted from jambu - (NB: symlinked for some from /migrate) is currently not accessible on tre,to,en due to a nfs-hang that seems to be related to a nfs-client bug. We are looking into the problem. It could be that we have to reboot some or all of the machines to clear the nfs-hang. For urgent access to files: contact support-uib@notur.no and we will get the files from backup.

Reboot av fimm frontend

lsz075 • August 21, 2006

Fimm frontend had a hang due to excessive memory-swapping. Rebooted (downtime 5 minutes).

fimm maintenance upgrade to Rocks 4.1 (CentOS 4.3)

lsz075 • August 18, 2006

fimm will be upgraded to Rocks 4.1 (CentOS Linux 4.3).
More updates to follow.

Update 11:30: fimm is now back online. We had some trouble with the cpu-accounting "qbank" program.

Scheduler / passwd problems on fimm

lsz075 • June 22, 2006

The scheduler on fimm have some problems at the moment. The passwd distribution system to the nodes do not work properly. Some jobs will fail to start or fail to stop properly after starting. We are working on fixing it.

Update: The problem is fixed now. Some of the nodes will be re-installed (fixed) after the current running jobs finish.

Reboot of node “to” of the regattas

lsz075 • June 19, 2006

Due to a security upgrade node to will be rebooted. No jobs will be affected.

Update 09:25: machine up again. Downtime: 30 min. / 16 cpuhours

HPC Syslog

Log over changes and events on UiB's HPC systems

Downtime

GPFS hang on node “en”

Reboot of fimm frontend

Hang of fimm frontend

Reboot of fimm frontend

Reboot of tre,to,en

NFS problem accessing /net/bcmhsm on tre,to,en

Reboot av fimm frontend

fimm maintenance upgrade to Rocks 4.1 (CentOS 4.3)

Scheduler / passwd problems on fimm

Reboot of node “to” of the regattas