Ftp – the best way to remove duplicate files on web hosting’s FTP server

ftpweb-hosting

For some reason(Happened before I started working on this project)- my client's website has 2 duplicates of every single file. Effectively tripling the size of the site.

The files look much like this:

wp-comments-post.php    |    3,982 bytes
wp-comments-post (john smith's conflicted copy 2012-01-12).php    |    3,982 bytes
wp-comments-post (JohnSmith's conflicted copy 2012-01-14).php    |    3,982 bytes

The hosting that the website is on has no access to bash or SSH.

In your opinion, what would be the easiest way to delete these duplicate files that would take the least time?

Best Answer

I wrote a duplicates finder script in PowerShell using WinSCP .NET assembly.

Up to date and enhanced version of this script is now available as WinSCP extension
Find duplicate files in SFTP/FTP server.

The script first iterates a remote directory tree and looks for files with the same size. When it finds any, it by default downloads the files and compares them locally.

If you know, that the server supports a protocol extension for calculating checksums, you can improve the script efficiency by adding the -remoteChecksumAlg switch, to make the script ask the server for the checksum, sparing the file download.

powershell.exe -File find_duplicates.ps1 -sessionUrl ftp://user:password@example.com/ -remotePath /path

The script is:

param (
    # Use Generate URL function to obtain a value for -sessionUrl parameter.
    $sessionUrl = "sftp://user:mypassword;fingerprint=ssh-rsa-xxxxxxxxx...=@example.com/",
    [Parameter(Mandatory)]
    $remotePath,
    $remoteChecksumAlg = $Null
)

function FileChecksum ($remotePath)
{
    if (!($checksums.ContainsKey($remotePath)))
    {
        if ($remoteChecksumAlg -eq $Null)
        {
            Write-Host "Downloading file $remotePath..."
            # Download file
            $localPath = [System.IO.Path]::GetTempFileName()
            $transferResult = $session.GetFiles($remotePath, $localPath)

            if ($transferResult.IsSuccess)
            {
                $stream = [System.IO.File]::OpenRead($localPath)
                $checksum = [BitConverter]::ToString($sha1.ComputeHash($stream))
                $stream.Dispose()

                Write-Host "Downloaded file $remotePath checksum is $checksum"

                Remove-Item $localPath
            }
            else
            {
                Write-Host ("Error downloading file ${remotePath}: " +
                    $transferResult.Failures[0])
                $checksum = $False
            }
        }
        else
        {
            Write-Host "Request checksum for file $remotePath..."
            $buf = $session.CalculateFileChecksum($remoteChecksumAlg, $remotePath)
            $checksum = [BitConverter]::ToString($buf)
            Write-Host "File $remotePath checksum is $checksum"
        }

        $checksums[$remotePath] = $checksum
    }

    return $checksums[$remotePath]
}

function FindDuplicatesInDirectory ($remotePath)
{
    Write-Host "Finding duplicates in directory $remotePath ..."

    try
    {
        $directoryInfo = $session.ListDirectory($remotePath)

        foreach ($fileInfo in $directoryInfo.Files)
        {
            $remoteFilePath = ($remotePath + "/" + $fileInfo.Name) 

            if ($fileInfo.IsDirectory)
            {
                # Skip references to current and parent directories
                if (($fileInfo.Name -ne ".") -and
                    ($fileInfo.Name -ne ".."))
                {
                    # Recurse into subdirectories
                    FindDuplicatesInDirectory $remoteFilePath
                }
            }
            else
            {
                Write-Host ("Found file $($fileInfo.FullName) " +
                    "with size $($fileInfo.Length)")

                if ($sizes.ContainsKey($fileInfo.Length))
                {
                    $checksum = FileChecksum($remoteFilePath)

                    foreach ($otherFilePath in $sizes[$fileInfo.Length])
                    {
                        $otherChecksum = FileChecksum($otherFilePath)

                        if ($checksum -eq $otherChecksum)
                        {
                            Write-Host ("Checksums of files $remoteFilePath and " +
                                "$otherFilePath are identical")
                            $duplicates[$remoteFilePath] = $otherFilePath
                        }
                    }
                }
                else
                {
                    $sizes[$fileInfo.Length] = @()
                }

                $sizes[$fileInfo.Length] += $remoteFilePath
            }
        }
    }
    catch [Exception]
    {
        Write-Host "Error processing directory ${remotePath}: $($_.Exception.Message)"
    }
}

try
{
    # Load WinSCP .NET assembly
    Add-Type -Path "WinSCPnet.dll"

    # Setup session options from URL
    $sessionOptions = New-Object WinSCP.SessionOptions
    $sessionOptions.ParseUrl($sessionUrl)

    $session = New-Object WinSCP.Session
    $session.SessionLogPath = "session.log"

    try
    {
        # Connect
        $session.Open($sessionOptions)

        $sizes = @{}
        $checksums = @{}
        $duplicates = @{}

        $sha1 = [System.Security.Cryptography.SHA1]::Create()

        # Start recursion
        FindDuplicatesInDirectory $remotePath
    }
    finally
    {
        # Disconnect, clean up
        $session.Dispose()
    }

    # Print results
    Write-Host

    if ($duplicates.Count -gt 0)
    {
        Write-Host "Duplicates found:"

        foreach ($path1 in $duplicates.Keys)
        {
            Write-Host "$path1 <=> $($duplicates[$path1])"
        }
    }
    else
    {
        Write-Host "No duplicates found."
    }

    exit 0
}
catch [Exception]
{
    Write-Host "Error: $($_.Exception.Message)"
    exit 1
}

(I'm the author of WinSCP)

Related Topic