Grunch server has problem with Lustre filesystem

saerda • June 28, 2016

We have discovered lustre related problem on grunch server. Grunch server will be rebooted, we are still investigating problem. Until we find out what is happening, grunch server will be unstable.

Sorry for inconvenience.

Hexagon: login1 rebooted

Lóránd Szentannai • June 28, 2016

Login node login1 ran out of memory and had to be rebooted.

The following jobs have been affected: 1890515, 1890671, 1891136, 1891264, 1891269, 1891328, 1891105, 1891385.

DSA type SSH keys deprecated

Lóránd Szentannai • June 14, 2016

DSA type SSH keys are considered unsafe and will be deprecated on UiB's HPC resources starting with 28th of June 2016. Therefore, we kindly ask you to check your SSH keys and replace DSA type keys with RSA type keys.

You may use the "ssh-keygen -t rsa" command to generate RSA type keys. For more options please see the manpage of ssh-keygen command.

Hexagon: SSH server upgraded

Lóránd Szentannai • June 8, 2016

We have upgraded the SSH server on hexagon to address some security issues.

If you can not log in anymore using your SSH key, but only password, please check the content of ~/.ssh/authorized_keys file for entries like from="myhost.example.com" and change the hostname to IP address. Starting with this version only IP addresses and not host names may be used in the authorized_keys file.
Edit: authorized_keys file provides fine tuning for client access. If you would like to find out more about it, you may read sshd(8) man pages.

DSA key types are not considered safe anymore and we will phase them out soon. A new syslog entry will be added in the following days with more information about it.

Grunchfs file system change

saerda • June 3, 2016

This only concerns users who has data under /export/grunchfs/ not /fimm/home

Dear Grunchfs file system users:

On grunch.bccs.uib.no, we are in the process of changing underlying file system of grunchfs from GPFS to XFS. This is due to increasing maintenance cost of GPFS file system. Current usage of grunchfs is 221 TB. In the migration process we have to move the data from grunchfs to another temporary filesystem. We have created a temporary file system called grunchxfs and mounted it on grunch under /mnt/ro/grunchxfs. We have already done our first rsync between the two file systems. We plan to run two more rsync to finish whole process.

During the second rsync grunchfs will be still online, but during the last rsync we will take it offline. This necessary to assure filesystem consistency. This means, when we run last rsync, grunchfs will not be accessible for users for some period of time. We will have more accurate estimate of the downtime when we finish the second rsync. To make this downtime as short as possible (depends on the size of file system) we would kindly ask all grunch users to do the following:

check and remove any duplicated files and folders,
check and remove any unnecessary files and folders,
we have observed that some users have huge number of small files (inodes), if this is the case and is possible please pack (tar) them up.

These steps will significantly speed up the migration process to XFS, and we will have the shortest possible down time.

Thank you in advance and we appreciate your understanding.

Please contact hpc-support@hpc.uib.no if you have any questions.

Update 10:30, 08-06-2016 :
  Second rsync is running, hopefully will finish late today. planning for lest offline rsync.

Update 10:40, 10-06-2016 :

We postponed Grunch server downtime to next Friday 17-06-2016, we will start from 09:00, all grunch users will have logoff from grunch server before 09:00. We will run last offline rysnc, hopefully grunchfs will be online again by 10:00, Monday 20-06-2016. Please plan your work beforehand.

Update 11:00 20-06-2016

We will mount back temporary grunchfs as soon as last rsync process finishes(around 13:00 today). Since this is temporary file system you experience performance withdraw on grunchfs.

Update 13:30 20-06-2016

The rsync processes did not finish as we planned, therefor we can not open grunch server access. We monitor the process closely will open access as soon as we can.

Update 15:30 20-06-2016

The rsync process are still running, we are expecting process will finish later today.

Sorry for inconvenience.

Update 10:30 21-06-2016

maintenance is finally finished, grunch server is back online. We will continue file system change and will keep you updated.

Update 12:00 28-06-2016

We plan to have last downtime to finish the whole transformation process. Last downtime will start from 13:00 01-07-2016 and will end 13:00 04-07-2016. During this time we will run last offline rsync. Users are not allowed to access to the grunch server during this downtime.

We appreciate your understanding and support.

Please do not hesitate to contact us if you have any further question.

Update 10:36 04-07-2016

Due to unpredicted change and high usage of grunch files system, our rsync process is delayed. Therefor we have to extend our downtime until Wednesday 06-07-2016. By that time we plan to completely finalize grunchfs move process. Downtime may finish earlier, we will open access as soon as last rsync finishes.

We appreciate your understanding and support.

Please do not hesitate to contact us if you have any further question.

Update 10:00 06-07-2016

We have finished all grunchfs filesystem transfer yesterday afternoon, now grunchfs is mounted back as xfs.

Grunch server will be rebooted Friday 12:00 noon

saerda • June 1, 2016

Grunch server will be rebooted due to kernel update and some other library updates. All users are advised to logout before 12:00 noon.

Fimm lustre file system crash

saerda • May 23, 2016

Currently we have problem with /fimm lustre file-system. we are working on to resolve problems, and will keep updated.

During this time queue system and /fimm will not be stable.

Thanks for understanding and sorry for inconvenience.

Hexagon: new Intel compiler

Lóránd Szentannai • May 19, 2016

Intel compiler version 16.0.3 was installed on hexagon. This is a replacement for version 16.0.1.

List of enhancements and changes are available at https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2016-update-3-readme.

Hexagon: rebooted

Lóránd Szentannai • April 27, 2016

Both metadata servers and all OSSes serving /work filesystem crashed.
We had to stop the machine and power cycle hexagon.

Hexagon: reboot needed

Lóránd Szentannai • April 25, 2016

All OSTs for /work filesystem are in read-only mode and we need to reboot hexagon. We will come back with more information later on.

Update:
15:20 25-04-2016 OST 8 has corrupted data and was marked read-only by the system. There are 379 inodes containing multiply-claimed blocks. We are trying to recover from it and identify corrupted files. Owners for identified corrupted files will be notified.
If you have corrupted data on /work, please contact us at support-uib@notur.no.

15:45 25-04-2016 Users were logged out and access closed in order to be able to perform maintenance on the system.

16:35 26-04-2016 Corrupted files were identified and /work filesystem is usable again. Hexagon was rebooted and access is reopened.
We will run further checks on /work filesystem while keeping it on-line. After this last check is finished, as earlier mentioned, the owners of corrupted files will be notified.

HPC Syslog

Log over changes and events on UiB's HPC systems