Linux – wget: how to download a file, whose url params changes dynamically, once only

linuxwget

I got a problem with wget, I need to download an entire site with images and other files linked in main pages, I'm using these options:

wget --load-cookies /tmp/cookie.txt -r -l 1 -k -p -nc 'https://www.example.com/mainpage.do'

(-l 1 is used for testing, I may need to travel to level 3 or even 4)

The problem is: I can't figure out how to bypass the 'random' GET parameter that is added after some recursion cycles, so my final result in the /tmp folder is like this:

/tmp/www.example.com/mainpage.do
/tmp/www.example.com/mainpage.do?cx=0.0340590343408
/tmp/www.example.com/mainpage.do?cx=0.0348934786475
/tmp/www.example.com/mainpage.do?cx=0.0032878284787
/tmp/www.example.com/mainpage.do?cx=0.0266389459023
/tmp/www.example.com/mainpage.do?cx=0.0103290334732
/tmp/www.example.com/mainpage.do?cx=0.0890345378478

Since the page it is always the same I don't need to get it other times, I tried with -nc option but it doesn't work, I also tried using -R (reject) but it only works with file extensions, not with URL parameters.

I looked extensively in the wget manual but I don't seem to find a way to do it; it is not mandatory to use wget, if you know how to do it in other ways, they are welcome.

Best Answer

Write a local proxy server that modifies the responses sent to wget.

Assuming your URLs are in links such as:

<a href="/path/to/mainpage.do?cx=0.0123412341234">

Then you can run a Ruby proxy server like this:

require 'webrick/httpproxy'
s = WEBrick::HTTPProxyServer.new(
   :Port => 2200,
   :ProxyContentHandler => Proc.new{|req,res|
      res.body.gsub!(/mainpage.do?cz=[0-9\.]*/, "mainpage.do")
   } 
)  
trap("INT"){ s.shutdown }
s.start
Related Topic