Uncategorized

Hexagon Crashed

saerda • March 11, 2018

Hexagon crashed today around 09:30, We are working on resolving the problem and taking up hexagon.

12:45 Update : hexagon is up, but we have hardware problem with fileserver which is
providing work file system.

Work filesystem crashed again on hexagon and Grunch

saerda • February 27, 2018

Work filesystem has crashed again on Hexagon. We are having a severe problem with work filesystem on hexagon and Grunch. We are working on to find out the root cause of the problem, meanwhile work filesystem will be unstable on Hexagon, we will get all users updated about the process.

We are sorry for the inconvenience and appreciate your understanding.

Hexagon work crashed

saerda • February 16, 2018

Hexagon work filesystem is down due to crashed lustre mds server. We are working on that issue.

Update 15:00 : hexagon work filesystem is back online. Jobs that are running during the crash probably died. We looking in to the root cause of the problem.

HPC course on January 25-26

Alexander Oltu • January 2, 2018

We are happy to announce a 2-day introductory HPC course at UiB on January 25-26.
https://docs.hpc.uib.no/wiki/HPC_course_2018.1

Hexagon: login2 rebooted

Alexander Oltu • October 16, 2017

Login2 was rebooted due to the hardware errors with the Ethernet card, rendering login2 unavailable from the network. The problem should be resolved now.

Hexagon: slow IO on login nodes

Alexander Oltu • October 11, 2017

Most of the login nodes are having high disk (IO) load currently mostly due to copying process going on.

You can find less busy nodes by the following workaround:

module load pdsh
pdsh -w login[1-5] uptime
login2: 11:05am up 14 days 19:06, 18 users, load average: 4.62, 4.55, 3.98
login3: 11:05am up 14 days 19:06, 7 users, load average: 2.47, 2.96, 2.89
login1: 11:05am up 14 days 19:06, 9 users, load average: 16.21, 11.97, 13.34
login4: 11:05am up 14 days 19:06, 13 users, load average: 0.68, 0.31, 0.21
login5: 11:05am up 14 days 19:06, 8 users, load average: 40.72, 35.99, 23.38

In this example login4 is less busy and login5 is totally overloaded, you can ssh to login4 and try working on it.

We will see what we can do to decrease effect of the file transfers on the interactive user sessions. As a general rule we can recommend to you to run file transfers at night to decrease disk load on the login nodes interactive sessions.

Hexagon: scheduled maintenance on May 22nd

Lóránd Szentannai • May 8, 2017

We will have a planned maintenance on Hexagon, starting on May 22nd at 09:00 AM. The maintenance is expected to last one day.
During the maintenance we will carry out software and firmware upgrades as well service the hardware.

The job submission system has reservation in place, thus jobs which are not able to finish before maintenance start, will not be started.

/work-common will be unavailable during the maintenance period and will be unmounted from Grunch and Fimm.

UPDATES:

2017-05-22 09:00: Maintenance has started.
2017-05-22 14:16: /work-common is available again and remounted on Grunch.
2017-05-22 15:59: Maintenance has finished and access to Hexagon is re-opened.

Hexagon: decommissioned end of June 2017

Lóránd Szentannai • January 25, 2017

All local CPU quotas will cease after 01.04.2017.

Login will be closed after 30.06.2017 so please make sure that all your data is transferred prior to that. Please plan this well in advance so that we avoid overload of the filesystem.

Software Developer Course for the new HPC-system

Lóránd Szentannai • November 14, 2016

UNINETT Sigma2 is organizing a Software Developer Course for the new HPC-system.

We are pleased to inform you that there will be a second HPC-course this autumn in Trondheim, at 30 November - 1 December, respectively.
Registration is open at https://response.questback.com/uninett/hpctrainingseminar

Please refer to the announcement on www.sigma2.no for further details.

Hexagon: queue system problem

saerda • August 23, 2016

We have problem with queuing system on hexagon and we are working on it. We will get back with more details later.

Update 13:15: Queue system is recovered. A handful of services on which the queue system and supporting tools are relying were in a "brain-split" or hanging state.

HPC Syslog

Log over changes and events on UiB's HPC systems