I'd like to mirror a simple password-protected web-portal to some data that i'd like to keep mirrored & up-to-date. Essentially this website is just a directory listing with data organised into folders & I don't really care about keeping html files & other formatting elements.
However there are some huge file types that are too large to download, so I want to ignore these.
Using the wget -m -R/--reject
flag nearly does what I want, except that all files get downloaded, then if they match the -R flag, then they get deleted.
Here's how i'm using wget
:
wget --http-user userName --http-password password -R index.html,*tiff,*bam,*bai -m http://web.server.org/
Which produces output like this, confirming that an excluded file (index.html) (a) gets downloaded, and (b) then gets deleted:
…
–2012-05-23 09:38:38– http://web.server.org/folder/
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response… 401 Authorization Required
Reusing existing connection to web.server.org:80.
HTTP request sent, awaiting response… 200 OK
Length: 2677 (2.6K) [text/html]
Saving to: `web.server.org/folder/index.html'
100%[======================================================================================================================>] 2,677 –.-K/s in 0sLast-modified header missing — time-stamps turned off.
2012-05-23 09:38:39 (328 MB/s) – `web.server.org/folder/index.html' saved [2677/2677]Removing web.server.org/folder/index.html since it should be rejected.
…
is there a way to force wget to reject the file before downloading it?
Is there an alternative that I should consider?
Also, why do i get a 401 Authorization Required
error for every downloaded file, despite supplying username & password. It's like wget
tries to connect un-authenticated every time, before trying the username/password.
thanks, Mark
Best Answer
Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).
FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-
date.log
in the end,
wget --exclude-directories
did the trick:Since the
--exclude-directories
wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.Mark