The norgrid.bccs.no node was having trouble scanning the
scsi buses. Seems to have been a problem with the scsi
backplane. IBM replaced it today, and the problem disappeared.
disk crash on backupserver – hsmbcm unavailable
A disk crashed on the backup and fileserver for /net/bcmhsm,
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.
so the /net/bcmhsm filesystem was unavailable for a couple
of minutes while the server crashed and rebootet.
XLF compiler upgrade
The february 2004 XLF compiler fixes was installed. Bugs fixed:
IY43656 - error msg 1587-113 is not very useful
IY49972 - Incorrect output at -O5
IY50157 - Seg Fault at (WHERE(.NOT.LOG) TAB=TAB2)
IY50896 - -qcheck option causes SIGTRAP
IY51019 - vector routines incorrect result with -qhot
IY51075 - Error in calcStk fails
IY51167 - Free heap overwritten when call to MPI_IRECV
IY51237 - -qsmp produces ICE in search_threadprivate_mbr
IY51264 - ICE: lbound(typ%i) on array of derived type
IY51426 - -qipa results in unresolved symbol dbgincut
IY51436 - ICE when compiling three nested modules.
IY51486 - EOSHIFT() function produces INCORROUT
IY51597 - improve real-to-integer conversion performance
IY51634 - 3 objects allocated with same statement
IY52006 - ICE in XLF V8.1.1 for AIX
IY52183 - Missing procedure list entries
IY52363 - ICE in IPA with -C and -qsmp=omp
IY52827 - Large environment causes compiler crash
IY52928 - -O3 causes wrong line numbers to be created
IY53532 - Feb 2004 XL Fortran V8.1 for AIX Compiler PTF
IY53533 - Feb 2004 XL Fortran V8.1 for AIX Runtime PTF
IY53015 - SMP Runtime Lib 1.3.8 January 2004 PTF
IY53435 - XLOPT 132 February 2004 PTF
IY43656 - error msg 1587-113 is not very useful
IY49972 - Incorrect output at -O5
IY50157 - Seg Fault at (WHERE(.NOT.LOG) TAB=TAB2)
IY50896 - -qcheck option causes SIGTRAP
IY51019 - vector routines incorrect result with -qhot
IY51075 - Error in calcStk fails
IY51167 - Free heap overwritten when call to MPI_IRECV
IY51237 - -qsmp produces ICE in search_threadprivate_mbr
IY51264 - ICE: lbound(typ%i) on array of derived type
IY51426 - -qipa results in unresolved symbol dbgincut
IY51436 - ICE when compiling three nested modules.
IY51486 - EOSHIFT() function produces INCORROUT
IY51597 - improve real-to-integer conversion performance
IY51634 - 3 objects allocated with same statement
IY52006 - ICE in XLF V8.1.1 for AIX
IY52183 - Missing procedure list entries
IY52363 - ICE in IPA with -C and -qsmp=omp
IY52827 - Large environment causes compiler crash
IY52928 - -O3 causes wrong line numbers to be created
IY53532 - Feb 2004 XL Fortran V8.1 for AIX Compiler PTF
IY53533 - Feb 2004 XL Fortran V8.1 for AIX Runtime PTF
IY53015 - SMP Runtime Lib 1.3.8 January 2004 PTF
IY53435 - XLOPT 132 February 2004 PTF
Replaced failing LTO drive
IBM replaced a failing LTO-1 drive in the tape robot.
Was: tape1/rmt3/serial=6811092091
Is now: tape1/rmt3/serial=6811147790
Was: tape1/rmt3/serial=6811092091
Is now: tape1/rmt3/serial=6811147790
NOTUR: deadline for application for CPU hours
Reminder: application for CPU-hours for April - September 2004
The CPU-hour grants given out by the NOTUR programme on the
HPC facilities at the four Norwegian universities expire
March 31.
Application forms for the next quota period
(April-September) and detailed instructions can be found on
http://www.notur.org/metacenter/grants/index.html
Available hardware :
http://www.notur.org/metacenter/hardware.html
Available software :
http://www.notur.org/metacenter/sw-overview.html
Deadline for application: MONDAY MARCH 1 2004
The CPU-hour grants given out by the NOTUR programme on the
HPC facilities at the four Norwegian universities expire
March 31.
Application forms for the next quota period
(April-September) and detailed instructions can be found on
http://www.notur.org/metacenter/grants/index.html
Available hardware :
http://www.notur.org/metacenter/hardware.html
Available software :
http://www.notur.org/metacenter/sw-overview.html
Deadline for application: MONDAY MARCH 1 2004
Power outage
The power failed in large parts of Bergen, and the UPS
didn't cover the downtime.
Downtime: 20040213 14:01 -> 15:12 = 1 hour 11 minutes on all nodes.
didn't cover the downtime.
Downtime: 20040213 14:01 -> 15:12 = 1 hour 11 minutes on all nodes.
Tape robot firmware upgrades
The tape robot and drives had a firmware upgrade to correct
some problems we have had with the tape drives. Access to
/migrate and /net/bcmhsm might have been slow or even
temporarily failed during the upgrade (14:00-15:00)
some problems we have had with the tape drives. Access to
/migrate and /net/bcmhsm might have been slow or even
temporarily failed during the upgrade (14:00-15:00)
Support for prioritized and unprioritized NFR quotas
The maui scheduler will now notice if a project has run out
of prioritized cpuquota, and move them to the FREECPU QoS.
There they will get lowest priority.
of prioritized cpuquota, and move them to the FREECPU QoS.
There they will get lowest priority.
UPS battery under voltaged
The UPS has failed one of the last couple of days.
The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.
I switched it back into battery backed power, and it
immediately went into normal operation.
The display said "ALARM: BATTERY UV", so it looks like the
batteries has had too low voltage (UV=UnderVoltage), and
therefore the UPS went into bypass-mode.
I switched it back into battery backed power, and it
immediately went into normal operation.
node14 and node15 rebooted
The networking failed on node14 and node15 of the linux
cluster. Both nodes were complaining about::
eth0: card reports no resources.
Not sure if this is a hardware or software bug, but it's
happened once before. This time we lost 4 of flikka's jobs. They were most likely working hard against NFS, maybe this triggered the crash?
Will upgrade all nodes to the latest kernel from redhat to
see if this fixes the problem.
Node14 downtime: 20:25 20040123 - 07:45 20040126 = 2 days,
11 hours, 10 minutes
Node15 downtime: 12:26 20040124 - 07:45 20040126 = 1 day,
10 hours, 11 minutes
cluster. Both nodes were complaining about::
eth0: card reports no resources.
Not sure if this is a hardware or software bug, but it's
happened once before. This time we lost 4 of flikka's jobs. They were most likely working hard against NFS, maybe this triggered the crash?
Will upgrade all nodes to the latest kernel from redhat to
see if this fixes the problem.
Node14 downtime: 20:25 20040123 - 07:45 20040126 = 2 days,
11 hours, 10 minutes
Node15 downtime: 12:26 20040124 - 07:45 20040126 = 1 day,
10 hours, 11 minutes