HPC Cluster (SLURM): recommended ways to set up a secure and stable system

I'm working with a SLURM driven HPC Cluster, containing of 1 control node and 34 computation nodes and since the current system is not exactly very stable I'm looking for guidelines or best practices on how to build such a cluster in a way that it becomes more stable and secure. And to be clear I'm not looking for detailed answers about resource management or maybe additional tools but for advises about the very basic setup (see "Question" below).

My current Setup

1 Control Node

This machine has slurm installed on /usr/local/slurm and runs the slurmctld daemon. The complete slurm directory (including all the executables and the slurm.conf) is exported.
34 Computation Nodes

These machines mount the exported slurm directory from the control node to /usr/local/slurm and run the slurmd daemon.

I don't use any Backup Control Node.

If our control node gets lost, it seems always a matter of luck if a currently running job will survive or not, so I'm looking for a way to create a more stable setup.

Possible issues with the current setup

1) The shared slurm directory. I couldn't find anything on the net if this is acutally a good or a bad practice, but since the slurm config file has to be the same on all machines, I thought I might as well share the complete slurm installation. But of course, if the compute node gets lost, all the files will become unavailable too.

2) The missing backup control node. This requires a shared NFS directory where the current state can be saved. The question would be, where should this directory be located? Of course it doesn't make sense to put it on the control node, but should it be on the backup control node? Or on an entire different machine?

Question

So, are there some guidelines to follow to build up an HPC cluster? Questions would be, what different kinds of nodes are involved, what is their job and what kind of data should be shared via NFS and where should those shared directories live? I would also be thankful about any kinds of literature or tutorials, that point me into the right direction.

Best Answer

It has been a while since I touched SLURM so take the following with a grain of salt. Also, the design of your cluster will be determined by your workload. Generally, you start with a head node and a number of compute nodes and you build up from there. A package like Rocks can be a good place to start.

I can see the shared directory being a problem. Depending on your workload you may already have a lot of traffic going over NFS so I would install SLURM locally. You can make a copy of your slurm config available on an NFS exported volume and copy in place with a 'fornodes' script or use a scripted scp. If you are doing a lot of changes to your slurm config you could even add the slurmd restart to the script.

In regards to the backup control node, I wouldn't worry about it. Your head node is probably a single point of failure so if you lose that you are already going to have problems with your jobs. I'm also not sure how the backup mechanism works for SLURM accounting if it is enabled since that normally uses a database like MySQL.

For exports, I normally export the /home directory and /opt on smaller clusters. Depending on your data needs, you could consider a separate storage node with additional storage which would distribute your NFS load. Since you mention that you are having issues with stability you may consider using a package like Ganglia which will monitor node load, memory utilization, network throughput, and other values and present them in a series of graphs. You can also learn quite a bit with command line tools like top running on your compute nodes. You will also want to test scaling of your jobs. If your jobs run horribly when they span nodes(MPI?) you may need a faster, lower latency interconnect like Infiniband or 10Gb Ethernet.

Good luck with SLURM. I liked using it before I changed jobs but since it isn't as popular as Torque/Maui or Sun/Oracle Grid Engine answers to my odd questions were always hard to find.

Best Answer

Related Solutions

Linux – Intel MPI Gives ‘channel initialization failed’ error (mpirun)

Ubuntu – Slurm: Have two separate queues for GPU and CPU-only jobs

Related Topic