What’s the best way to create a static backup of a website

backupcommand-line-interfacestatic-contentwebsite

I have an old Joomla! site that I would like to convert to a static set of html pages (since it's not being updated anymore and I don't want the overhead of having a MySQL db running for something that is never updated).

Is there a command-line tool that could basically crawl and download the entire forward-facing website?

Best Answer

I just made static pages from an old Joomla with this command:

wget --adjust-extension --mirror --page-requisites --convert-links http://my.domain.com

It's short version is:

wget -E -m -p -k http://my.domain.com

This save pages with .hml extension and will get (almost) all css, js and images files the pages need.

But I wanted my static mirror to have the same links as the original. So the files names couldn’t have the .html extension, which made me remove the -E option.

Then I found that the -p option (and -k) doesn’t work the same way if you don’t use the -E . But using -E and -p still is the best way to get most of the page-requisites. So I did a first fetch with it, deleted all .html files and then fetched all over again without -E.

As option -k without -E also doesn’t convert all links, I had to make some substitutions. The complete list of commands used is:

# To get almost every thing:
wget --adjust-extension --mirror --page-requisites --convert-links http://my.dommain.com

# Remove files ending with .html:
find my.dommain.com/ -name '*.html*' -exec  rm {} \;

# Get pages without .html extension:
wget --mirror --page-requisites --convert-links http://my.dommain.com

# Check if there are unconverted absolute URL and which are:
grep -lr "http:\/\/my.dommain.com" my.dommain.com/ | sort -u | xargs sed -ne '/http:\/\/my\.dommain\.com/s/.*"http:\/\/\([^"]*\).*/http:\/\/\1/p' | sort -u
# Unconverted absolute URL correspond to missing needed files, so get them:
grep -lr "http:\/\/my.dommain.com" my.dommain.com/ | sort -u | xargs sed -ne '/http:\/\/my\.dommain\.com/s/.*"http:\/\/\([^"]*\).*/http:\/\/\1/p' | sort -u | wget -x -i -
# Then, converter all missing absolute URL to relative:
grep -lr “http:\/\/my.domain.com” my.domain.com/ | sort -u | xargs sed -i -e '/http:\/\/my\.domain\.com/s/http:\/\/my\.domain\.com\/\([^"]*\)/\1/g'

# Converter all URL with "?" into its URL encoding equivalent (%3F):
grep -lr –exclude=*.{css,js} '=\s\{0,1\}”[^?"]*?[^"]*”' my.domain.com/ | sort -u | xargs sed -i -e '/\(=\s\{0,1\}”[^?"]*\)?\([^"]*”\)/s/\(=\s\{0,1\}”[^?"]*\)?\([^"]*”\)/\1%3F\2/g'

As I was mirroring a site under a path in my domain I ended up with this file:

my.domain.com/subsite/index.html

That index.html was deleted by my second command, which is ok. When I ran the second wget it created a file with the same name as the directory and an index.php inside it, like:

my.domain.com/subsite.1
my.domain.com/subsite/index.php

...and converted (at least some) home links to subsite.1. If all home links were the same, only one of those two files would be needed. And index.php is the best choice as it is automatically served when a client asks for http://my.domain.com/subsite.

To solve that I ran:

# To verify if there were links to subsite.1:
grep -r 'subsite\.1' my.domain.com/
# To convert links from subsite.1 to subsite:
grep -lr 'subsite\.1' my.domain.com/ | sort -u | xargs sed -i -e '/subsite\.1/s/\(subsite\)\.1/\1/g'
# Then I could delete the "duplicated" index file:
rm my.domain.com/subsite.1

In the end, using a web developer tool (firebug) I found that there were still missing some files that were included by javascript or by css. I got them one by one.

Related Solutions

Windows – What’s the command-line utility in Windows to do a reverse DNS look-up

ping -a w.x.y.z

Should resolve the name from the IP address if the reverse lookup zone has been set up properly. If the reverse lookup zone does not have an entry for the record, the -a will just ping without a name.

Counting the number of pages in a website

The quick and dirty way is to go to google and run a search like:

site:mydomain.com

This example shows 232 known pages for fronde.com: http://i47.tinypic.com/j0h003.jpg

That will return the number of pages that google is aware of on that site. You may need to adjust your google preferences to include all content types (Turn SafeSearch off) and click the 'some results were omitted' warning before it'll give you its most accurate count.

To do it manually is harder. In order to discover all the pages on a particular website, you'll have to download the landing page, parse it for links that refer to the same web domain, then iteratively download those HTML pages and scan them as well. This continues iteratively until all links have been checked.

This method takes time (although with a tool like HTTrack, you can turn off non-HTML content downloading to save time).

This method will also miss orphaned pages that are not linked from the main page of the site.

Best Answer

Related Solutions

Windows – What’s the command-line utility in Windows to do a reverse DNS look-up

Counting the number of pages in a website

Related Topic