Bacula backups very slow with TCP socket in ‘unkn-4’ state

baculadebian-wheezylinux-networkingslow-connection

Several bacula-fd clients are backing up to one bacula SD (using 2G disk files on massive RAID, not tape), usually 2-3 of them concurrently. It works fine, except when 2 bigger clients (about 400-900GB each) need to run full backups, which get terribly slow (usually about 200-2500 Kbyte/s), so full backup does not finish in several days, which is a problem for us (so we cancel it and go with incremental).

The server and clients are in different VLANs/subnets, so are routed over another debian wheezy machine with VLANs and few switches. NICs are 1Gbps on all machines (one having active-passive network bonding – failing it over does not help with speed), as are the switches. Machines are quad and 8-core, having 8-64GB RAM, not going into swap, and having load between 0.2-2, so I guess it is not CPU, I/O nor memory starvation. Bacula-sd is also on hardware RAID which does not seem to be under load. Network is also mostly idle at the time, so it should not be bandwidth congestion. Bacula version is standard wheezy 5.2.6+dfsg-9, and Linux kernel is also standard wheezy 3.2.60-1+deb7u3.

It seems the transfer starts just fine (at about 20+ Mbytes/sec, which is to be expected with lots of small files), than at one moment Send-Q goes up and does not come down for dozens of seconds (or minutes), and netstat shows socket in "unkn-4" timer, restarting with exponentially rising timeouts:

# netstat -tpno   | grep bacula
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0 967688 10.66.3.135:49668   10.66.2.11:9103         ESTABLISHED 2964/bacula-fd   unkn-4 (1.86/0/0)
tcp        0      0 10.66.3.135:9102    10.66.2.11:54499        ESTABLISHED 2964/bacula-fd   keepalive (5882.64/0/0)

then after some time, the packets seems to start again:

# netstat -tpno    | grep bacula
tcp        0  38054 10.66.3.135:49668   10.66.2.11:9103         ESTABLISHED 2964/bacula-fd   on (0.21/0/0)
tcp        0      0 10.66.3.135:9102    10.66.2.11:54499        ESTABLISHED 2964/bacula-fd   keepalive (385.49/0/0)

and backup continues (confirmed with status client=blowgun-fd on bconsole). For example:

* status client=axe-fd
newaxe-fd Version: 5.2.6 (21 February 2012)  x86_64-pc-linux-gnu debian 7.0
Daemon started 24-Oct-14 17:27. Jobs: run=0 running=0.
 Heap: heap=683,600 smbytes=761,617 max_bytes=807,280 bufs=396 max_bufs=426
 Sizeof: boffset_t=8 size_t=8 debug=200 trace=1 
Running Jobs:
JobId 12640 Job axe.2014-10-24_23.05.01_40 is running.
    Full Backup Job started: 24-Oct-14 23:05
    Files=2,529,050 Bytes=253,018,715,824 Bytes/sec=1,479,901 Errors=6
    Files Examined=2,529,050
    Processing file: /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
    SDReadSeqNo=5 fd=5
Director connected at: 26-Oct-14 21:34

bg.jpg is 1.2MB in size, and it was staying on it for minutes…

Hearbeat interval is set to 120 in director, SD and file deamon configurations, and seems to be working fine. Enabling debug trace file with setdebug level=200 trace=1 all, I see:

newaxe-fd: backup.c:371-0 FT_REG saving: /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
newaxe-fd: backup.c:469-0 bfiled: sending /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg to stored
newaxe-fd: crypto.c:607-0 crypto_digest_new jcr=2f01748
newaxe-fd: backup.c:1467-0 No strip for /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
newaxe-fd: backup.c:609-0 type=3 do_read=1
newaxe-fd: bfile.c:963-0 open file /home/users/novoselo/public_html/~2511/templates/js_corp/images/bg.jpg
newaxe-fd: backup.c:1194-0 Send data to SD len=65135
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: backup.c:1194-0 Send data to SD len=65562
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0
newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0

strace seems to confirm that:

# strace -tt -ff -s999 -p 3907
Process 3907 attached with 4 threads - interrupt to quit
[pid 27650] 22:25:15.705796 write(5, "[....]"..., 55110 <unfinished ...>
[pid 27661] 22:25:15.706103 select(6, [5], NULL, NULL, {2, 804806} <unfinished ...>
[pid  3912] 22:25:15.706147 restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid  3907] 22:25:15.706168 select(4, [3], NULL, NULL, NULL <unfinished ...>
[pid  3912] 22:25:16.619938 <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid  3912] 22:25:16.620008 futex(0x397d82d0240, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  3912] 22:25:16.620092 futex(0x397d82d0284, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 13229, {1414358746, 620076000}, ffffffff <unfinished ...>
[pid 27661] 22:25:18.513819 <... select resumed> ) = 0 (Timeout)
[pid 27661] 22:25:18.513858 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:18.513928 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout)
[pid 27661] 22:25:23.519025 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:23.519139 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout)
[pid 27661] 22:25:28.524240 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:28.524317 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout)
[pid 27661] 22:25:33.529409 write(15, "newaxe-fd: heartbeat.c:96-0 wait_intr=0 stop=0\n", 47) = 47
[pid 27661] 22:25:33.529508 select(6, [5], NULL, NULL, {5, 0}^C <unfinished ...>

fd 5 is network connection to bacula-sd, and the process is blocking when writing to that.
Researching unkn-4 seems to indicate it is actually "zero window probe timer is pending".

So, it looks to me like it is either bacula-sd doing the throttling for some reason (bug?) or (more likely IMHO) some sort of networking issue.

it doesn't seem like there are errors on active ethernet adapters. I've tried using ethtool to disable offloading and other features, it didn't help. ping -f does not lose packets even when the problem in TCP is manifesting itself:

axe# ping -s1400 -f -c1000 10.66.2.11
--- slingshot.tomsoft.lan ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 607ms
rtt min/avg/max/mdev = 0.391/0.582/0.672/0.039 ms, ipg/ewma 0.608/0.585 ms

I'm looking for ideas how to proceed to troubleshoot (and finally fix, of course) this?

UPDATE1: even after tuning up TCP buffers situation is not any better – just the queue gets bigger, but still blocks, and backup is still slow. After looking some more at wireshark dumps, it seems to be bacula-sd software problem, and that TCP ZeroWindow is normal kernel way of throttling TCP in that case. So machine seems to receive data OK, but bacula-sd does not seem to be able to process data fast enough, although the machine is not under load; this is on bacula-sd side:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp   3952353      0 10.66.2.11:9103         10.66.3.132:45226   ESTABLISHED 15839/bacula-sd  keepalive (4863.09/0/0)
# uptime
 05:23:13 up 2 days, 14:40,  2 users,  load average: 0.42, 0.32, 0.27

Best Answer

It was SQL.

By default, each time bacula-fd sends a new file, bacula-sd tries to (via bacula-dir) insert file attributes into SQL batch table. If you have a lots of small files, and your SQL is not blazingly fast, it will insert small delay. Many small delays = throttled speed = canceled backups due to Max Run Sched Time exceeded. And due to architecture even if you had multiple backups running, all would slow down.

Solution (kind of) was to add:

Spool Data = no
Spool Attributes = yes

in JobDefs {...} section of bacula-dir.conf (note I'm using Spool Data = no because it is disk storage, not tape storage, so it would just add overhead). With Spool Attributes = yes bacula writes file attributes to a file, and only when job is finished is the file sent to SQL server. Note that in that connection to bacula-fd is freed (and disk/network load on client ends) as soon as data transfer is done (so it does not wait for SQL insertion to finish).

Some notes:

  • "your SQL is not blazingly fast" means it only does meager few dozens queries per second.
  • in my case, the problematic server had about 3 million files and 350 GB disk space. It couldn't finish full backup in 4 days, with transfer speeds dropping down to 200 Kbyte/sec.
  • with the attribute spooling, it manages to finish in 2 days and 13 hours. Not such a great improvement at fist sight (although original backups never finished, so hey, maybe it is), but: the data transfer itself takes only 4.5 hours, with average speed of ~18 Mbytes/sec (which is not that bad for a lots of small files, and compression, while server does other things). However, attribute despooling from local file to SQL server takes next 56.5 hours! More than two days!!
  • the SQL server in question is dedicated MySQL (5.5.40-0+wheezy1) on 8-core I7 @ 3.70GHz with 64GB RAM, on 4 disk RAID-10. It does other things too and has binlogs enabled, but is mostly idle while backups run. It work quite fine for other load, and innodb_buffer_pool_size is bigger than all the indexes of bacula. It should be working fine.
  • it seems that even with attribute spooling enabled, bacula does not use batch inserts (nor LOAD DATA LOCAL INFILE), but sends that 3 million inserts one by one, waiting for confirmation on each (attribute data is stored in compact binary file about 1GB big. Translated to ASCII SQL INSERT statements it will surely be double that). So the increased latency due to MySQL being on other machine seems to totally kill the performance.
  • I'll try to look more into why is SQL slow, but I guess best solution would be to put MySQL on (much lower performance) local machine. Probably replacing it by preferred PostgreSQL while I'm at it.

Edit1: upgrading bacula to (manually created package) 5.2.13 which does contain support for batch inserts (instead of packaged 5.2.6 in Debian wheezy/jessie which doesn't), and tuning MySQL database so temporary tables are created in tmpfs, reduced the mentioned attribute despooling time from 56.5 hours down to 30 minutes. I guess replacing it with PostgreSQL on local machine (same where bacula-sd & bacula-dir are) might probably improve even on that, but this is good enough for us.

Related Topic