Wget – download all links from a http location (not recursivly)

wget

I have a link to an http page that has a structure like this:

Parent Directory –
[DIR] _OLD/ 01-Feb-2012 06:05 –
[DIR] _Jan/ 01-Feb-2012 06:05 –
[DIR] _Dec/ 01-Jan-2012 06:05 –
……
[DIR] _Apr/ 01-May-2011 06:05 –
[DIR] _Mar/ 01-Apr-2011 06:05 –
[DIR] _Feb/ 01-Mar-2011 06:05 –
[DIR] WEB-INF/ 21-Aug-2009 13:44 –
[ ] nohup_XXX_XXX21.out 14-Feb-2012 09:05 1.6M
[ ] XXX_XXX21.log 14-Feb-2012 09:04 64K
[ ] XXX_XXX21_access.log 14-Feb-2012 08:31 8.0K
[ ] XXX_XXX21_access.log00013 14-Feb-2012 00:01 585K

I would like to downlload only the files present in the root directory…the xxxx files.

I have a solution using

curl -U Mozilla http://yourpage.com/bla.html > page
grep -o http://[^[:space:]]*.*log* page > links
wget -i link

but i wonder is not possible to do that only using wget ?

Best Answer

All files from root directory matching pattern *.log*:

wget --user-agent=Mozilla --no-directories --accept='*.log*' -r -l 1 http://yourpage.com/bla.html

--user-agent=Mozilla set User-Agent header
--no-directories save all files in current directory
--accept='*.log' accepted extensions (pattern)
-r recursive
-l 1 one level of recursion

You avoid grepping out html links (could be error prone) at a cost of few more requests to server.

Related Solutions

Making `wget` not save the page

You can redirect the output of wget to /dev/null (or NUL on Windows):

wget http://www.example.com -O /dev/null

The file won't be written to disk, but it will be downloaded.

Linux – wget recursive download, but I don’t want to follow all links

You might also try HTTrack which has, IMO, more flexible and intuitive include/exclude logic. Something like this...

httrack "https://example.com" -O ExampleMirrorDirectory \
"-*" \
"+https://example.com/images/*" \
"-*.swf"

The rules will be applied in order, and will override previous rules...

Exclude everything
But include https://example.com/images/*
But exclude anything ending in swf

Best Answer

Related Solutions

Making `wget` not save the page

Linux – wget recursive download, but I don’t want to follow all links

Related Topic