Multi threaded network file transfer

file-transfer

Many of the links that connect NY and AMS are often saturated. That means that running a transfer over them (e.g., move 300GB at 1MB/s) would take an age if compared with what our connections can offer.

I came across to the problem in the past, like 3 years ago, when i was really newbie to coding and linux, i came to my conclusion that i will post on the bottom of this post. However, it's dirty and i don't like it. The script dosen't work as it is, since that it was written for a very specific environment, however it give you the idea.

My question is, do you know of anything better alternatives to transfer files across the ocean in a fast way?

#!/bin/sh

upto="$1"
filepath="$2"
remotepath="$3"

if [ ! -f ${filepath} ]
then
exit 0
fi

password=$(/all/script/password 10)
filesize=$(du -b ${2} | sed 's/\([0-9]*\)\(.*\)/\1/')

if [ $filesize -gt 5368709120 ]; then
parts="80"
elif [ $filesize -gt 2147483648 ]; then
parts="50"
elif [ $filesize -gt 1310720 ]; then
parts="20"
else
parts="2"
fi

splitsize=$(($filesize / $parts))

split -b "$splitsize" -a 2 "$2" /all/tmp/cup/${password}_

#UPLOAD
declare -a pwait
for tmpfile in /all/tmp/cup/${password}_*
do
    scp ${tmpfile} root@${upto}.domain.com:/all/tmp/cup/ &
        array_lenght=${#pwait[@]}
        pwait[${array_lenght}]=$!
done

#ATTENDERE
for prid in ${pwait[@]}
do
wait $prid
done

#UNISCI FILE REMOTO
ssh root@${upto}.domain.com "cat /all/tmp/cup/${password}_* > ${remotepath} && wait && rm -f /all/tmp/cup/${password}_*"

#RIMUOVI ROBA DI TROPPO LA
#eval ssh root@${upto}.domain.com rm -f /all/tmp/cup/${password}*

#REMOVE HERE
rm -f /all/tmp/cup/${password}_*

exit 0

Best Answer

Assuming that your network are not saturated (contrary to what you're stating in the question), you should be tuning your link to deal with the (comparatively) high bandwidth delay product like Andrew mentioned. (The articles referenced at that link include some info on what to tweak, when, and why.)


If in fact your network links ARE saturated (moving the maximum amount of data they can) the only solution is to add more bandwidth (either more fiber trunks between the two sites, paying another carrier for transit to offload some of the peak period traffic, or if you're using "dedicated" links paying for a higher CIR/adding more circuits to the loop).


How can you tell the difference?
Well, if starting more streams gets you more speed you haven't saturated your link. You're probably getting hit by the relatively long round-trip time from the US to Europe (as compared to the round-trip time on a local network).
(There's a point of diminishing returns here as the overhead for more TCP connections will eventually cause other bottlenecks to show up.)

If adding more streams provides no net increase in speed (two streams run at half the net speed of one) your link is saturated, and you need to add bandwidth to improve performance.


Other stuff to consider

You should seek to minimize the data being pushed over the pipe, using rsync or similar protocols if appropriate (rsync works best with small-ish change sets to large-ish collections of data).

Never underestimate the bandwidth of a FedEx overnight package with a couple of hard disks in it. Especially for initial syncs.