Linux – How to download with wget without following links with parameters

linuxunixwget

I'm trying to download two sites for inclusion on a CD:

http://boinc.berkeley.edu/trac/wiki
http://www.boinc-wiki.info

The problem I'm having is that these are both wikis. So when downloading with e.g.:

wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/

I do get a lot of files because it also follows links like …?action=edit …?action=diff&version=…

Does somebody know a way to get around this?

I just want the current pages, without images, and without diffs etc.

P.S.:

wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex

This worked for berkeley but boinc-wiki.info is still giving me trouble :/

P.P.S:

I got what appears to be the most relevant pages with:

wget -r -k -nv  -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info

Best Answer

wget --reject-regex '(.*)\?(.*)' http://example.com

(--reject-type posix by default). Works only for recent (>=1.14) versions of wget though, according to other comments.

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com

Related Solutions

Linux – wget recursive download, but I don’t want to follow all links

You might also try HTTrack which has, IMO, more flexible and intuitive include/exclude logic. Something like this...

httrack "https://example.com" -O ExampleMirrorDirectory \
"-*" \
"+https://example.com/images/*" \
"-*.swf"

The rules will be applied in order, and will override previous rules...

Exclude everything
But include https://example.com/images/*
But exclude anything ending in swf

Mirror a site with wget and download static media

See the section in the wget manpage after the --page-requisites option, it has an example:

Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http://<site>/<document>

Best Answer

Related Solutions

Linux – wget recursive download, but I don’t want to follow all links

Mirror a site with wget and download static media

Related Topic