Linux – wget recursive download, but I don’t want to follow all links

linuxmirrormirror-sitewget

I'm trying to mirror a website using wget, but I don't want to download lots of files, so I'm using wget's --reject option to not save all the files. However wget will still download all the files and then remove the file afterwards if it matches my reject option.

Is there some way to tell wget not to follow certain links if they match some shell wildcard? If wget can't do this, is there some other common linux command that can do this?

Best Answer

You might also try HTTrack which has, IMO, more flexible and intuitive include/exclude logic. Something like this...

httrack "https://example.com" -O ExampleMirrorDirectory \
"-*" \
"+https://example.com/images/*" \
"-*.swf"

The rules will be applied in order, and will override previous rules...

  1. Exclude everything
  2. But include https://example.com/images/*
  3. But exclude anything ending in swf