How to reserve complete nodes on Sun Grid Engine

clustergridengine

How do you use SGE to reserve complete nodes on a cluster?

I don't want 2 processors from one machine, 3 processors from another, and so on. I have a quadcore cluster and I want to reserve 4 complete machines, each having 4 slots. I cannot just specify that I want 16 slots because it does not guarantee that I will have 4 slots on 4 machines each.

Changing the allocation rule to FILL_UP isn't enough because if there are no machines that are completely idle, SGE will simply "fill up" the least loaded machines as much as possible instead of waiting for 4 idle machines and then scheduling the task.

Is there any way I can do this? Is there a better place to ask this question?

Best Answer

SGE is weird with this, and I haven't found a good way to do this in the general case. One thing that you can do, if you know the memory size of the node you want, is to qsub while reserving an amount of memory almost equal to the full capacity of the node. This will ensure it grabs a system with nothing else running on it.

Related Solutions

Overlapping queues on Sun Grid Engine

Sure, this is totally possible. SGE queues are independent of one another, so you can assign whatever nodes you would like to each queue, letting them overlap however you wish.

To create a queue, type qconf -aq: this will open up your default editor (usually vim). Type the name of the queue as the qname, add the hosts you would like to assign in the hostlist, and for slots, add a comma-delimited list of entries of the format [hostname=numslots]. Typically the number of slots is the number of cores in the host, but you can under- or over-subscribe if you prefer. If you want the queues to overlap, just add the same hosts to multiple queues.

Note, however, that by default the overlapping queues are not aware of each others' usage. They will both cheerfully assign jobs to the same node and expect them to run.

The most common way to prevent this is to makes nodes job-exclusive, so only one job may run at a time. (This is the default in other schedulers like PBS.) SGE makes this a little complicated, and involves creating a virtual "resource" which can only be used once per node. To do this, type qconf -mc to manage consumable resources. This will open an editor listing consumable resources: add a new one called "exclusive", like so:

#name          shortcut     type         relop  requestable  consumable  default  urgency
#-----------------------------------------------------------------------------------------
exclusive      excl         BOOL         EXCL   YES          YES          1        1000

For more information, see the grid engine wiki.

You can also configure what are called subordinate queues. In this, you set one queue up so that it will automatically override the other when over a certain number of slots-per-node are assigned. To set this up, run qconf -mq queue1 and under "subordinate", specify queue2=N. Then whenever the number of slots used on a node in queue1 is over N, the job in queue2 will be suspended until the queue1 job is complete.

SGE Auto configured consumable resource

The solution I found is to make a new parallel environment that has the $pe_slots allocation rule (see man sge_pe). I set the number of slots available to that parallel environment to be equal to the max since $pe_slots limits the slot usage to per-node. Since starcluster sets up the slots at cluster bootup time, this seems to do the trick nicely. You also need to add the new parallel environment to the queue. So just to make this dead simple:

qconf -ap by_node

and here are the contents after I edited the file:

pe_name            by_node
slots              9999999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

Also modify the queue (called all.q by starcluster) to add this new parallel environment to the list.

qconf -mq all.q

and change this line:

pe_list               make orte

to this:

pe_list               make orte by_node

I was concerned that jobs spawned from a given job would be limited to a single node, but this doesn't seem to be the case. I have a cluster with two nodes, and two slots each.

I made a test file that looks like this:

#!/bin/bash

qsub -b y -pe by_node 2 -cwd sleep 100

sleep 100

and executed it like this:

qsub -V -pe by_node 2 test.sh

After a little while, qstat shows both jobs running on different nodes:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     25 0.55500 test       root         r     10/17/2012 21:42:57 all.q@master                       2      
     26 0.55500 sleep      root         r     10/17/2012 21:43:12 all.q@node001                      2

I also tested submitting 3 jobs at once requesting the same number of slots on a single node, and only two run at a time, one per node. So this seems to be properly set up!

Best Answer

Related Solutions

Overlapping queues on Sun Grid Engine

SGE Auto configured consumable resource

Related Topic