Optimizing Ansible playbooks to run against many hosts

ansible

I'm running Ansible 2.0 on SLES 11 SP4 against about 430 machines and it is very slow, I can't really tell why it is so slow, but it goes much faster if I limit the number of machines in the inventory. It took about 7 hours to run a 3 task playbook (including gathering facts) and the 3rd task was a local action. It takes about as much time to gather 2 machines facts files when I'm running inventory of all 430 as it does to fully process 6 machines.

And it uses 99.9% of the CPU right off the bat:

root     11646 99.8  0.4 220188 61016 pts/1    Rl+  07:24   6:41                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...
root     11651  0.1  0.4 187396 58828 pts/1    Sl+  07:24   0:00                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...
root     11652  0.1  0.4 187812 59216 pts/1    Sl+  07:24   0:00                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...
root     11653  0.1  0.4 188052 59428 pts/1    Sl+  07:24   0:00                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...
root     11654  0.1  0.4 186148 57496 pts/1    Sl+  07:24   0:00                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...
root     11655  0.1  0.4 186552 57924 pts/1    Sl+  07:24   0:00                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...
root     11656  0.4  0.2 154948 25828 pts/1    Sl+  07:24   0:01                          \_ /usr/bin/python /usr/bin/ansible-playbook /etc/ansible/playbooks/checkhostnames.yml ...

Which is scary since I was really hoping that this would optimize our serialized ssh processes, looks like it's just gonna suck up all the resources.

when I strace the main pid, it just appears to be running stat on the inventory file over and over and over again.

I'm keeping all my host vars in one inventory file that I generate from a database. I tried using a dynamic inventory, but that took too long to even initialize (I'm guessing it's hitting the sql query over and over again)

So, is there a trick to running it against lots of machines?

I have already tried all the tricks in https://www.ansible.com/blog/ansible-performance-tuning

I've also tried breaking it up by putting host_vars for each host in their own file – I figured strace was telling me that it was parsing my 500k inventory file constantly. But that didn't help too much.


I switched my playbook to just echo hello, no gathering facts

when I run an inventory file with only 3 hosts in it I get

real    0m1.996s
user    0m0.400s
sys     0m0.112s

when I run an inventory file with all 430 hosts and limit to just the first 3 I get it done in (note, these are different hosts – but the same make of machine):

real    0m11.989s
user    0m13.693s
sys     0m0.552s

and when I run an inventory file with all 430 hosts with no limit (and ctrl-c after the 3rd one, I get:

real    2m50.961s
user    2m56.495s
sys     0m0.764s

So, it makes me think that not a lot is really going on behind the scenes and something is intensely blocking.

Best Answer

First of all, you need to consider caching the facts.

Take a look here for how to:

http://docs.ansible.com/ansible/playbooks_variables.html#fact-caching

You will see an amazing performance on gather-facts, even with caching to a file.

Then you may consider of improving the level of parallelism with -f

man ansible-playbook

   -f NUM, --forks=NUM
       Level of parallelism.  NUM is specified as an integer, the default is 5.

to something bigger than 5