Scheduled maintenance


Update 12_11 21:30:

Migration is over, we manage to take up Lustre filesystem with new MDS server. /shared and /work filesystem is mounted on cyclone.hpc.uib.no and grunch.hpc.uib.no. Hexagon is up and running again. Samba and NFS exports are also running on Leo.hpc.uib.no.

Update 12_11 15:00 :

Migration is still ongoing, we will keep you posted.

Update 02_11 09:30 :

Due to the delayed delivery of physical parts, we have to postpone our downtime to 12th November. Corresponding node reservation on the hexagon is also postponed to 12th November.

Thank you for your consideration!

Dear HPC User,

The metadata server for the /shared file system has to be replaced/upgraded and therefore it must be unmounted from all the clients.

This will result in scheduled downtime for Hexagon, Grunch and Cyclone machines. We start at 08:00 AM on the 5th of November and expect to be ready by the end of the working day.

Thank you for your consideration!

Hexagon will have planned maintenance on 15th August from 08:00.

Currently /work filesystem is running on reduced performance due to broken storage controller.

During the maintenance, we will replace the broken storage controller for the storage system where /work filesystem resides. Due to the high risk of data loss, we urge all /work filesystem users to backup their important, not reproducible data.
Please keep it in mind that work is not in backedup and work is scratch filesystem.


After the maintenance we expect /work filesystem will be back on full performance.

We appreciate your understanding.

Update 15.08.2018 11:00 

Hexagon maintenance is over, we have successfully replaced the broken, controller. Work file-system is back to it's expected performance.

Update 23.05.2018 15:19 File system issues were solved and mounted back to both Hexagon and Grunch. Access to is reopened.


There is a scheduled downtime for Hexagon and /shared file system for Wednesday, 23rd of May. Scheduled downtime will start at 09:00 and we expect to have the systems back by 16:00, same day.


Our apologies for any inconvenience this downtime can give you.

Hexagon have accumulated a number of the hardware failures, which have to be fixed to ensure stable operations. Hexagon will be fully stopped and login nodes will not be accessible. We expect to finish in 4 hours.

We have also discovered a bug in our SLURM statistics, that will lead to that we will have to delete all jobs from the queue system during this downtime, including PENDING.

Our apologies for any inconvenience this downtime can give you.

Date: March 26
Timeslot: 9:00-13:00

Update:

  • 26.03.18 15:20 The machine is still down due to hardware issues. We are working on it. We will keep you updated.
  • 27.03.18 14:00 Hardware problems are fixed and access to the machine is reopened now.

We will shutdown Hexagon for maintenance on January 3rd at 09:00 to continue on reconfiguration tasks. We are expecting to have Hexagon up again same day at around 16:00.

Update 2018-01-03 19:23:
  • Access to Hexagon is re-opened.
  • /work file system had to be reformatted. Please accept our apologies for any inconvenience it might have caused.
  • /home storage area is increased and default quota is doubled from 10GB to 20GB for each user.

New configuration:


  • 312 compute nodes
  • 9984 processing elements
  • /work - 175TB
  • /shared - 217TB
  • SLURM scheduler
  • UiB usernames for UiB users
Please find below a short list of changes:

  1. All users (except IMR) have to reaply for access at https://skjemaker.app.uib.no/view.php?id=2901837
  2. SLURM is a new job scheduler
    1. Documentation link https://docs.hpc.uib.no/wiki/Job_execution_(Hexagon)
    2. External Torque/Moab to Slurm reference https://www.glue.umd.edu/hpcc/help/slurm-vs-moab.html
  3. Please use Support for help and support.
  4. The hexagon@hpc.uib.no mailing list will be migrated to a self managed mailing list in a short time. All current mailing list users will be removed soon. If you want to subscribe please get back to our syslog  https://syslog.hpc.uib.no in a week, we will post a link to the new mailing list.

We didn’t manage to replace all HW components we’ve planned during this maintenance. We are planning to have a shorter maintenance somewhere in winter/spring to finish this job. 

All software as modules is still available, we will review and remove old in the coming weeks.

There are some major changes, as the new job scheduler and the HW configuration, maybe some things stopped working for you, some configurations are not finally in place, we will continue on improving this as well as updating documentation during the following weeks, we ask for your patience. And of course all feedback is welcome at support@hpc.uib.no.

The following changes will come in the next months:

  • /shared will be bigger in a few weeks
  • Bigger /home after the next maintenance window

There is a planned maintenance on the electric power line in the server room for the 25th of September. Therefore Hexagon, related file systems and storage enclosures has to be taken offline.

The maintenance is scheduled to start at 20:00. According to plan, Hexagon should be back by the end of the day.

Running jobs will be stopped. All scheduled jobs in the queue will be started automatically when the system is operational again.

Update 20.09.2017: Please note the time change. The maintenance window has been moved from Saturday, 23rd of September to Monday, 25th of September 20:00.

Update 26.09.2017: UPS maintenance is over yesterday night, and we have a problem to take hexagon online due to some filesystem storage issues. We are working on it and we apologize for inconveniences.

Update 26.09.2017:  Hexagon is up and available again since 09:08 AM.

Update 14:00_26.09.2017: Hexagon work file system crashed unexpectedly, we are working on it. sorry for inconveniences.

Update 14:40_26.09.2017 Hexagon has to be taken down due to hardware issues related to work filesystem. We try to resolve problems as soon as possible.

Update 16:40_26.09.2017 Hexagon is back online again. problem with work filesystem is resolved.

On Friday, 25th of August maintenance on electric lines in the server room will be carried out. Therefore Hexagon must be switched off.  All related file systems (/work, /work-common) will be also off.

The maintenance will start at 07:00 and according to the plan should last until 13:00 o'clock.

During this time work-common will not be available on  Grunch .

Update:
  • 25.08.2017 07:00: Maintenance has started.
  • 25.08.2017 12:50: Storage controller issues are delaying startup of the machine. We are working on the fix.
  • 25.08.2017 15:05: Storage controller issues were remediated. Some disks are rebuilding for /work-common filesystem, thus performance impact might be expected for a couple of days.
  • 25.08.2017 15:20: Hexagon is up again.