Mirror http website, excluding certain files

wget

I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements.
However there are some huge file types that are too large to download, so I want to ignore these.

Using the wget -m -R/--reject flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.

Here's how i'm using wget:

wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/

Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:


–2012-05-23 09:38:38– http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response… 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response… 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html'
100%[======================================================================================================================>] 2,677 –.-K/s in 0s

Last-modified header missing — time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) – `web.server.org/folder/index.html' saved [2677/2677]

Removing web.server.org/folder/index.html since it should be rejected.

is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?

Also, why do i get a 401 Authorization Required error for every downloaded file, despite supplying username & password. It's like wget tries to connect un-authenticated every time, before trying the username/password.

thanks, Mark

Best Answer

Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).

FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-date.log

in the end, wget --exclude-directories did the trick:

wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder

Since the --exclude-directories wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

Mark

Related Topic