How to refresh an online website mirror created with `wget –mirror`

unixwget

One month ago, I used "wget –mirror" to create a mirror of our public website for temporary use during an upcoming scheduled maintenance window. Our primary website runs HTML, PHP & MySQL, but the mirror just needs to be HTML-only, no dynamic-content, PHP or database needed.

The following command will create a simple, online mirror of our website:

wget --mirror http://www.example.org/

Note that the Wget manual says --mirror "is currently equivalent to -r -N -l inf --no-remove-listing" (The human-readable equivalent is `–recursive –timestamping –level=inf –no-remove-listing.

Now it's a month later, and much of the website content has changed. I want wget to check all the pages, and download any pages which have changed. However, this isn't working.

My question:

What do I need to do to refresh the mirror of the website, short of deleting the directory and re-running the mirror?

The top level file at http://www.example.org/index.html has not changed, but there are many other files which have changed.

I thought all I needed to do was to re-run wget --mirror, because --mirror implies the flags --recursive "specify recursive download" and --timestamping "Don't re-retrieve files unless newer than local." I thought this would check all of the pages and only retrieve files which are newer then my local copies. Am I wrong?

However, wget doesn't recurse the site on the second try. 'wget –mirror' will check http://www.example.org/index.html , notice that this page did not change, and then stop.

--2010-06-29 10:14:07--  http://www.example.org/
Resolving www.example.org (www.example.org)... 10.10.6.100
Connecting to www.example.org (www.example.org)|10.10.6.100|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Server file no newer than local file "www.example.org/index.html" -- not retrieving.

Loading robots.txt; please ignore errors.
--2010-06-29 10:14:08--  http://www.example.org/robots.txt
Connecting to www.example.org (www.example.org)|10.10.6.100|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 136 [text/plain]
Saving to: “www.example.org/robots.txt”

     0K                                                       100% 6.48M=0s
2010-06-29 10:14:08 (6.48 MB/s) - "www.example.org/robots.txt" saved [136/136]

--2010-06-29 10:14:08--  http://www.example.org/news/gallery/image-01.gif
Reusing existing connection to www.example.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 40741 (40K) [image/gif]
Server file no newer than local file "www.example.org/news/gallery/image-01.gif" -- not retrieving.

FINISHED --2010-06-29 10:14:08--
Downloaded: 1 files, 136 in 0s (6.48 MB/s)

Best Answer

The following workaround seems to work for now. It forcibly deletes /index.html , which forces wget to check all child links again. However, shouldn't wget check all child links automatically?

rm www.example.org/index.html && wget --mirror http://www.example.org/