Linux – copy large number of files over ssh

copyfile-transferlinux

I mount a remote server over ssh (using sshfs). I want to copy a large number of files from the remote server to local:

cp -rnv /mounted_path/source/* /local_path/destination

The command runs recursive copying that doesn't overwrite existing files. But the copying process is rather slow. I notice that it does not copy files in order. So my question is: can I speed the copying process by opening multiple terminals and run the same command above? Is a copying process smart enough to not overwriting the files copied by other processes?

Best Answer

…to answer the original question as stated…

There are two things to discuss here.

Using SSHFS

SSHFS uses the SFTP "subsystem" of the SSH protocol to make a remote filesystem appear as if it were mounted locally.

A crucial thing here is to note that SSHFS translates low-level syscalls into relatively high-level SFTP commands which are then translated into the syscalls executed on the server by the SFTP server, and then their results are sent back to the client and translated backwards.

There are several sources of slowness with this process:

There are distinct syscalls for distinct operations on files, and they are executed in the order the client issues them. Say, the client stat(2)-s the information on a file then open(2)-s that file then reads its data — by executing several read(2) calls in a row and then finally close(2)-s the file, all those syscalls have to be translated to SFTP commands, sent to the server and processed there with their results sent back to the client, translated back.
Even while SSHFS appears to implement certain clever hacks such as "read ahead" (speculatively reads more data than requested by the client), still, each syscall results in a round-trip to the server and back. That is, we send the data to the server then wait for it to respond then process its response. IIUC, SFTP does not implement "pipelining" — a mode of operation where we send commands before they are completed, so basically each syscall. While it's technically possible to have such processing to a certain degree, sshfs does not appear to implement it.

IOW, each syscall cp on your client machine makes, is translated to a request to the server followed by waiting for it to respond and then receiving its response.

Multiple `cp -n` processes run in parallel

The answer to the question of whether it's OK to employ multiple cp -n processes copying files in parallel depends on several considerations.

First, if they all will run over the same SSHFS mount, there will obviosly no speedup as all the syscalls issued by multiple cp will eventually hit the same SFTP client connection and will be serialized by it due to the reasons explained above.

Second, running several instances of cp -n running over distinct SSHFS mount points may be worthwhile — up to the limits provided by the network throughput and the I/O throughput by the medium/media under the target filesystem. In this case, it's crucial to understand that since SSHFS won't use any locking on the server, the different instances of cp -n must operate on distinct directory hierarchies — simply to not step on each others' toes.

Different / more sensible approaches

First, piping data stream created by tar, cpio or another streaming archiver and processing it remotely has the advantage that all round-trips for the file system operations are avoided: the local archiver creates the stream as fast as the I/O throughput on the source filesystem allows and sends it as fast as the network allows; the remove archiver extracts data from the stream and updates its local filesystem as fast as it allows. No round trips to execute elementary "commands" are involved: you just go as fast as the slowest I/O point in this pipeline allows you to; it's simply impossible to go faster.

Second, another answer suggested using rsync and you rejected that suggestion on the grounds of

rsync is slow as it has to checksum the files.

This is simply wrong. To cite the rsync manual page:

-c, --checksum

This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a "quick check" that (by default) checks if each file's size and time of last modification match between the sender and receiver. This option changes this to compare a 128-bit checksum for each file that has a matching size.

and

-I, --ignore-times

Normally rsync will skip any files that are already the same size and have the same modification timestamp. This option turns off this "quick check" behavior, causing all files to be updated.

--size-only

This modifies rsync's "quick check" algorithm for finding files that need to be transferred, changing it from the default of transferring files with either a changed size or a changed last-modified time to just looking for files that have changed in size. This is useful when starting to use rsync after using another mirroring system which may not preserve timestamps exactly.

and finally

--existing skip creating new files on receiver

--ignore-existing skip updating files that exist on receiver

That is,

By default rsync does not hash the file's contents to see whether a file has changed.
You can tell it to behave exactly like cp -n, that is, skip updating a file if it merely exists on the remote.

Best Answer

Using SSHFS

Multiple cp -n processes run in parallel

Different / more sensible approaches

Related Solutions

Linux – How to Copy Large Number of Files Quickly Between Servers

Windows Server 2008 x64, Large File Transfers, and Memory Usage

Related Topic

Multiple `cp -n` processes run in parallel