How to upgrade Slurm

clusterslurm

I've been asked to upgrade our Slurm Workload Manager installation. I have a slurm 2.3.4 on a Debian 7.0 wheezy cluster (1 master + 8 nodes). I've not installed it so I'm a bit confused about how to do this and how to proceed without destroying anything. (I can't really backup data, for there are too much Terabytes of data to think to copy it anywhere else.)

I was thinking to upgrade at least to Jessie (Debian 8) but what about Slurm? I've read carefully the upgrading section (https://slurm.schedmd.com/quickstart_admin.html) of the doc, reading that the upgrade must be done incrementally and not jumping from 2.3.4 to 17, for example.

Stil is not clear to me precisely how to do this. How would you proceed if asked to upgrade a cluster you just don't know nothing about? What would you check? What version of o.s. and slurm would you choose? What would you backup? And how would you proceed?

Any info is gold! Thank you

Best Answer

I have done similar upgrades with Torque/Moab but not with Slurm but I can offer some advice. If you can get a test system or a VM to verify things will work after the upgrade that would be ideal. Otherwise this is the tricky part that the doc mentions:

Slurm permits upgrades between any two versions whose major release numbers differ by two or less (e.g. 15.08.x or 16.05.x to 17.02.x) without loss of jobs or other state information. State information from older versions will not be recognized and will be discarded, resulting in loss of all running and pending jobs.

This means if you have running and pending jobs after the upgrade they won't be there. So the users need to submit jobs again which means you will loose priority and other job related metadata and state information.

With Torque/Moab there was a job folder which usually can be copy and migrated to the new version. Is there anything similar?

Basically, if you cannot have a test machine then in this case you will need to schedule a downtime and inform the users that all the current jobs in the queue will be lost which means they have to resubmit everything. If that is not an option then you need to find a way to migrate the jobs to the upgraded system.

Best Answer

Related Solutions

HPC Cluster (SLURM): recommended ways to set up a secure and stable system

Slurm node daemon error: Can’t open PID file

Related Topic