Hexagon crashed today around 09:30, We are working on resolving the problem and taking up hexagon.
12:45 Update : hexagon is up, but we have hardware problem with fileserver which is
providing work file system.
Uncategorized
Work filesystem crashed again on hexagon and Grunch
Work filesystem has crashed again on Hexagon. We are having a severe problem with work filesystem on hexagon and Grunch. We are working on to find out the root cause of the problem, meanwhile work filesystem will be unstable on Hexagon, we will get all users updated about the process.
We are sorry for the inconvenience and appreciate your understanding.
We are sorry for the inconvenience and appreciate your understanding.
Hexagon work crashed
Hexagon work filesystem is down due to crashed lustre mds server. We are working on that issue.
Update 15:00 : hexagon work filesystem is back online. Jobs that are running during the crash probably died. We looking in to the root cause of the problem.
Update 15:00 : hexagon work filesystem is back online. Jobs that are running during the crash probably died. We looking in to the root cause of the problem.
HPC course on January 25-26
We are happy to announce a 2-day introductory HPC course at UiB on January 25-26.
https://docs.hpc.uib.no/wiki/HPC_course_2018.1
Hexagon: login2 rebooted
Login2 was rebooted due to the hardware errors with the Ethernet card, rendering login2 unavailable from the network. The problem should be resolved now.
Hexagon: slow IO on login nodes
Most of the login nodes are having high disk (IO) load currently mostly due to copying process going on.
You can find less busy nodes by the following workaround:
module load pdsh
pdsh -w login[1-5] uptime
login2: 11:05am up 14 days 19:06, 18 users, load average: 4.62, 4.55, 3.98
login3: 11:05am up 14 days 19:06, 7 users, load average: 2.47, 2.96, 2.89
login1: 11:05am up 14 days 19:06, 9 users, load average: 16.21, 11.97, 13.34
login4: 11:05am up 14 days 19:06, 13 users, load average: 0.68, 0.31, 0.21
login5: 11:05am up 14 days 19:06, 8 users, load average: 40.72, 35.99, 23.38
In this example login4 is less busy and login5 is totally overloaded, you can ssh to login4 and try working on it.
We will see what we can do to decrease effect of the file transfers on the interactive user sessions. As a general rule we can recommend to you to run file transfers at night to decrease disk load on the login nodes interactive sessions.
You can find less busy nodes by the following workaround:
module load pdsh
pdsh -w login[1-5] uptime
login2: 11:05am up 14 days 19:06, 18 users, load average: 4.62, 4.55, 3.98
login3: 11:05am up 14 days 19:06, 7 users, load average: 2.47, 2.96, 2.89
login1: 11:05am up 14 days 19:06, 9 users, load average: 16.21, 11.97, 13.34
login4: 11:05am up 14 days 19:06, 13 users, load average: 0.68, 0.31, 0.21
login5: 11:05am up 14 days 19:06, 8 users, load average: 40.72, 35.99, 23.38
In this example login4 is less busy and login5 is totally overloaded, you can ssh to login4 and try working on it.
We will see what we can do to decrease effect of the file transfers on the interactive user sessions. As a general rule we can recommend to you to run file transfers at night to decrease disk load on the login nodes interactive sessions.
Hexagon: scheduled maintenance on May 22nd
We will have a planned maintenance on Hexagon, starting on May 22nd at 09:00 AM. The maintenance is expected to last one day.
During the maintenance we will carry out software and firmware upgrades as well service the hardware.
The job submission system has reservation in place, thus jobs which are not able to finish before maintenance start, will not be started.
/work-common will be unavailable during the maintenance period and will be unmounted from Grunch and Fimm.
UPDATES:
During the maintenance we will carry out software and firmware upgrades as well service the hardware.
The job submission system has reservation in place, thus jobs which are not able to finish before maintenance start, will not be started.
/work-common will be unavailable during the maintenance period and will be unmounted from Grunch and Fimm.
UPDATES:
- 2017-05-22 09:00: Maintenance has started.
- 2017-05-22 14:16: /work-common is available again and remounted on Grunch.
- 2017-05-22 15:59: Maintenance has finished and access to Hexagon is re-opened.
Hexagon: decommissioned end of June 2017
All local CPU quotas will cease after 01.04.2017.
Login will be closed after 30.06.2017 so please make sure that all your data is transferred prior to that. Please plan this well in advance so that we avoid overload of the filesystem.
Login will be closed after 30.06.2017 so please make sure that all your data is transferred prior to that. Please plan this well in advance so that we avoid overload of the filesystem.
Software Developer Course for the new HPC-system
UNINETT Sigma2 is organizing a Software Developer Course for the new HPC-system.
We are pleased to inform you that there will be a second HPC-course this autumn in Trondheim, at 30 November - 1 December, respectively.
Registration is open at https://response.questback.com/uninett/hpctrainingseminar
Please refer to the announcement on www.sigma2.no for further details.
Hexagon: queue system problem
We have problem with queuing system on hexagon and we are working on it. We will get back with more details later.
Update 13:15: Queue system is recovered. A handful of services on which the queue system and supporting tools are relying were in a "brain-split" or hanging state.
Update 13:15: Queue system is recovered. A handful of services on which the queue system and supporting tools are relying were in a "brain-split" or hanging state.