Hexagon reboot after power blink

Alexander Oltu • December 8, 2017

After a series of power blinks, Hexagon high performance network, as well as some nodes are in inconsistent state. We have to restart whole machine.

Hexagon is up after reconfiguration

Alexander Oltu • December 5, 2017

New configuration:

312 compute nodes
9984 processing elements
/work - 175TB
/shared - 217TB
SLURM scheduler
UiB usernames for UiB users

Please find below a short list of changes:

All users (except IMR) have to reaply for access at https://skjemaker.app.uib.no/view.php?id=2901837
SLURM is a new job scheduler
1. Documentation link https://docs.hpc.uib.no/wiki/Job_execution_(Hexagon)
2. External Torque/Moab to Slurm reference https://www.glue.umd.edu/hpcc/help/slurm-vs-moab.html
Please use Support for help and support.
The hexagon@hpc.uib.no mailing list will be migrated to a self managed mailing list in a short time. All current mailing list users will be removed soon. If you want to subscribe please get back to our syslog https://syslog.hpc.uib.no in a week, we will post a link to the new mailing list.

We didn’t manage to replace all HW components we’ve planned during this maintenance. We are planning to have a shorter maintenance somewhere in winter/spring to finish this job.

All software as modules is still available, we will review and remove old in the coming weeks.

There are some major changes, as the new job scheduler and the HW configuration, maybe some things stopped working for you, some configurations are not finally in place, we will continue on improving this as well as updating documentation during the following weeks, we ask for your patience. And of course all feedback is welcome at support@hpc.uib.no.

The following changes will come in the next months:

/shared will be bigger in a few weeks
Bigger /home after the next maintenance window

All local HPC will be down for ~2 weeks starting tomorrow

Alexander Oltu • November 20, 2017

We are reminding you that tomorrow morning (2017.11.21) Hexagon and Fimm are going to be shut down for the reconfiguration.

ALL DATA on /work, /work/shared (/work-common), /home and /fimm filesystems will be deleted.

Please find more details at https://docs.hpc.uib.no

Hexagon: login2 rebooted

Alexander Oltu • October 16, 2017

Login2 was rebooted due to the hardware errors with the Ethernet card, rendering login2 unavailable from the network. The problem should be resolved now.

Hexagon: slow IO on login nodes

Alexander Oltu • October 11, 2017

Most of the login nodes are having high disk (IO) load currently mostly due to copying process going on.

You can find less busy nodes by the following workaround:

module load pdsh
pdsh -w login[1-5] uptime
login2: 11:05am up 14 days 19:06, 18 users, load average: 4.62, 4.55, 3.98
login3: 11:05am up 14 days 19:06, 7 users, load average: 2.47, 2.96, 2.89
login1: 11:05am up 14 days 19:06, 9 users, load average: 16.21, 11.97, 13.34
login4: 11:05am up 14 days 19:06, 13 users, load average: 0.68, 0.31, 0.21
login5: 11:05am up 14 days 19:06, 8 users, load average: 40.72, 35.99, 23.38

In this example login4 is less busy and login5 is totally overloaded, you can ssh to login4 and try working on it.

We will see what we can do to decrease effect of the file transfers on the interactive user sessions. As a general rule we can recommend to you to run file transfers at night to decrease disk load on the login nodes interactive sessions.

Grunch: down

Lóránd Szentannai • October 5, 2017

Both operating system disks failed in a short timeframe in Grunch making the system unoperational. We are trying to recover from the failure ASAP.

Update 14:00_06.10.2017: grunch server is up again. both os disks are replaced and grunch server are reinstalled.

Hexagon: planned power maintenance – September 25th

Lóránd Szentannai • September 19, 2017

There is a planned maintenance on the electric power line in the server room for the 25th of September. Therefore Hexagon, related file systems and storage enclosures has to be taken offline.

The maintenance is scheduled to start at 20:00. According to plan, Hexagon should be back by the end of the day.

Running jobs will be stopped. All scheduled jobs in the queue will be started automatically when the system is operational again.

Update 20.09.2017: Please note the time change. The maintenance window has been moved from Saturday, 23rd of September to Monday, 25th of September 20:00.

Update 26.09.2017: UPS maintenance is over yesterday night, and we have a problem to take hexagon online due to some filesystem storage issues. We are working on it and we apologize for inconveniences.

Update 26.09.2017: Hexagon is up and available again since 09:08 AM.

Update 14:00_26.09.2017: Hexagon work file system crashed unexpectedly, we are working on it. sorry for inconveniences.

Update 14:40_26.09.2017 Hexagon has to be taken down due to hardware issues related to work filesystem. We try to resolve problems as soon as possible.

Update 16:40_26.09.2017 Hexagon is back online again. problem with work filesystem is resolved.

/migrate and /bcmhsm offline on September 4th

Alexander Oltu • September 4, 2017

Due to physical rearrangements in the server room the tape robot hosting /migrate and /bcmhsm will be unavailable today after 12:00 for several hours. Updates will be posted here.

Update 2017-09-11:

Uni Computing is experiencing troubles with the backend holding /migrate and /bcmhsm and it is unknown yet when this will be fixed. As these file systems were supposed to be already decommissioned earlier this year in June, we will not mount those back in ordinary place even after the file systems are healthy. However, we will finish transfer of IMR/HI files as it was agreed as soon as the filesystem is healthy. We will issue a separate update for this.
Other users than IMR/HI needing files from those file systems are advised to contact Uni Computing helpdesk at trouble@computing.uni.no.

Hexagon: IMR volumes offline

Lóránd Szentannai • August 25, 2017

The network equipment connecting Hexagon and IMR has to be changed and needs a maximum two hours downtime.

Therefore IMR volumes will be unmounted on Tuesday, 29th of August from 09:00 AM for approximately two hours. By that time, please stop all your processes on Hexagon which are using the IMR volumes.

Hexagon & Grunch: Planned downtime for 25th of August

Lóránd Szentannai • August 18, 2017

On Friday, 25th of August maintenance on electric lines in the server room will be carried out. Therefore Hexagon must be switched off. All related file systems (/work, /work-common) will be also off.

The maintenance will start at 07:00 and according to the plan should last until 13:00 o'clock.

During this time work-common will not be available on Grunch .

Update:

25.08.2017 07:00: Maintenance has started.
25.08.2017 12:50: Storage controller issues are delaying startup of the machine. We are working on the fix.
25.08.2017 15:05: Storage controller issues were remediated. Some disks are rebuilding for /work-common filesystem, thus performance impact might be expected for a couple of days.
25.08.2017 15:20: Hexagon is up again.

HPC Syslog

Log over changes and events on UiB's HPC systems