Ssh – dd command piped over gzip and ssh faster and faster as dd progresses

ddgzipssh

I am running the following command to copy a LVM from one host to another:

dd if=/dev/vg_1/lv1 conv=noerror,sync bs=4M | gzip | ssh user@ip 'gzip -d | dd of=/dev/vg_2/lv1 bs=4M'

To begin with I was getting a speed of about 11 MB/s about an hour ago. As time has gone by the transfer rate has grown up to about 34.4 MB/s and is still growing at a constant rate.

I am very curious as to know why.

My best guess is that the LVM I am copying is very large but only a small part of it is actually data. As a result maybe large blocks of data are filled with 0. Would this make the gzip compression much more efficient?

Best Answer

Your command could be simplified by leaving out the two gzip commands. If compression is useful in your case, it is much simpler to compress data in transit by giving a -C argument to the ssh command, it is also less error prone as you don't accidentally use gzip on one end and not the other.

In order to answer your original question, and in order to say if compression is improving throughput or not, you first need to find out, where the bottleneck is.

There are five candidates for the bottleneck:

  1. I/O on the source
  2. CPU on the source
  3. Network throughput
  4. CPU on the target
  5. I/O on the target

Looking at top on each computer, you should be able to see if there is a process related to the transfer spending close to 100% CPU time. If that is the case, it is a sure sign that CPU on that computer is the bottleneck.

If OTOH you see the dd command at either end spending lots of time in D state (meaning non-interruptible sleep), it is an indication that I/O on that computer is the bottleneck.

To find out if network is the bottleneck, look at netstat output. If the network is the bottleneck, you should see large send queue on the source and empty receive queue on the destination.

If both send queue and receive queue are large, it is indicating that the bottleneck is on the destination. If both are empty, it is indicating the bottleneck is on the source.

If a copy without compression ends up with a bottleneck on the network connection, then compression is likely to improve performance. If the bottleneck was somewhere else, compression is unlikely to help. If CPU time spend encrypting and decrypting data was the bottleneck in the first place, compression may hurt performance unless the data is very redundant and gets a high compression ratio.

Throughput can change over time for a number of reasons, this can cause the location of the bottleneck to change as you are trying to locate it. Compression is likely to cause much more variations in throughput due to variations in compression ratio, which is the most likely explanation for what you are seeing.

But throughput can vary for many other reasons including:

  • Fragmentation on the underlying media
  • Bad sectors on the media slowing data transfers down
  • Physical properties of the media causing variation in throughput depending on location on the media.
  • Load on the computer caused by other unrelated processes
  • Variations in available network capacity