Rabbitmq: with hundreds of celery workers, beam.smp consumes > 200% CPU

celerycpu-usagerabbitmq

I have one machine (test-server) with a rabbitmq server and 4 celery workers, and another machine (test-worker) with 240 celery workers, which connect to the rabbitmq server on test-server.

All queues are currently empty.

With this setup, beam.smp (which I gather is a rabbitmq-related process) goes at 200-250% CPU, and consumes a few hundred MBs of RAM (this might be OK, not sure).

If I stop the workers on the remote machine, it goes back to normal. If I only start e.g. 40 workers instead of 240, then it's more or less OK – still consumes CPU, but around 50%.

The main beam.smp thread is stuck on select, which I think is OK, since it's just listening to the child threads. Below is an strace of the child threads. There are some calls to epoll_wait with zero timeout, and also many futex calls.

I also found this bug, described in oslo (dunno what that is) https://bugs.launchpad.net/oslo.messaging/+bug/1518430, that also mentions zero-timeout epoll_wait calls, and mentions rabbitmq.

Any idea if this is the expected behavior from rabbit under these conditions? Where should I look to find the cause?

Thanks

test-server$ sudo strace -p 26866 2>&1 | head -n 50

Process 26866 attached
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
epoll_wait(3, {}, 256, 0)               = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 785829269}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
writev(473, [{NULL, 0}, {"\1\0\3\0\0\0-\0<\0<\5None3\0\0\0\0\0\0\5\326\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\3\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445
clock_gettime(CLOCK_MONOTONIC, {87999, 786592082}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
epoll_wait(3, {}, 256, 0)               = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 787427449}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
epoll_wait(3, {}, 256, 0)               = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 788308663}) = 0
writev(201, [{NULL, 0}, {"\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\35\245\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445
clock_gettime(CLOCK_MONOTONIC, {87999, 789017598}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 789278489}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
writev(392, [{NULL, 0}, {"\1\0\3\0\0\0-\0<\0<\5None3\0\0\0\0\0\0\16\270\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\3\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445
clock_gettime(CLOCK_MONOTONIC, {87999, 792374556}) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 792553480}) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 792796024}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 793154206}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 793493003}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 793842449}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 794054061}) = 0
writev(318, [{NULL, 0}, {"\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\25\370\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\25\371\0\10cele"..., 73}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 10) = 890
clock_gettime(CLOCK_MONOTONIC, {87999, 794411001}) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 795090977}) = 0
epoll_wait(3, {}, 256, 0)               = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 796129182}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1

Another excerpt:

Process 26867 attached
clock_gettime(CLOCK_MONOTONIC, {88350, 863599878}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x82e500, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {88350, 865231792}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 865436250}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 865776903}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 872757864}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 872984686}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 873209787}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 873382297}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 873578979}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
epoll_wait(3, {}, 256, 0)               = 0
clock_gettime(CLOCK_MONOTONIC, {88350, 875428570}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 875624976}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 875847357}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 876478262}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x82e500, FUTEX_WAIT_PRIVATE, 2, NULL) = 0

Best Answer

I didn't manage to actually solve this, but I worked around it by reducing the number of workers and increasing the concurrency. Seems like there's a per-worker overhead for rabbit...

So instead of

celery multi start -A proj 240 -c2

I now do

celery multi start -A proj 20 -c24

FWIW