What’s the best way to create a static backup of a website

backupcommand-line-interfacestatic-contentwebsite

I have an old Joomla! site that I would like to convert to a static set of html pages (since it's not being updated anymore and I don't want the overhead of having a MySQL db running for something that is never updated).

Is there a command-line tool that could basically crawl and download the entire forward-facing website?

Best Answer

I just made static pages from an old Joomla with this command:

wget --adjust-extension --mirror --page-requisites --convert-links http://my.domain.com

It's short version is:

wget -E -m -p -k http://my.domain.com

This save pages with .hml extension and will get (almost) all css, js and images files the pages need.

But I wanted my static mirror to have the same links as the original. So the files names couldn’t have the .html extension, which made me remove the -E option.

Then I found that the -p option (and -k) doesn’t work the same way if you don’t use the -E . But using -E and -p still is the best way to get most of the page-requisites. So I did a first fetch with it, deleted all .html files and then fetched all over again without -E.

As option -k without -E also doesn’t convert all links, I had to make some substitutions. The complete list of commands used is:

# To get almost every thing:
wget --adjust-extension --mirror --page-requisites --convert-links http://my.dommain.com

# Remove files ending with .html:
find my.dommain.com/ -name '*.html*' -exec  rm {} \;

# Get pages without .html extension:
wget --mirror --page-requisites --convert-links http://my.dommain.com

# Check if there are unconverted absolute URL and which are:
grep -lr "http:\/\/my.dommain.com" my.dommain.com/ | sort -u | xargs sed -ne '/http:\/\/my\.dommain\.com/s/.*"http:\/\/\([^"]*\).*/http:\/\/\1/p' | sort -u
# Unconverted absolute URL correspond to missing needed files, so get them:
grep -lr "http:\/\/my.dommain.com" my.dommain.com/ | sort -u | xargs sed -ne '/http:\/\/my\.dommain\.com/s/.*"http:\/\/\([^"]*\).*/http:\/\/\1/p' | sort -u | wget -x -i -
# Then, converter all missing absolute URL to relative:
grep -lr “http:\/\/my.domain.com” my.domain.com/ | sort -u | xargs sed -i -e '/http:\/\/my\.domain\.com/s/http:\/\/my\.domain\.com\/\([^"]*\)/\1/g'

# Converter all URL with "?" into its URL encoding equivalent (%3F):
grep -lr –exclude=*.{css,js} '=\s\{0,1\}”[^?"]*?[^"]*”' my.domain.com/ | sort -u | xargs sed -i -e '/\(=\s\{0,1\}”[^?"]*\)?\([^"]*”\)/s/\(=\s\{0,1\}”[^?"]*\)?\([^"]*”\)/\1%3F\2/g'

As I was mirroring a site under a path in my domain I ended up with this file:

my.domain.com/subsite/index.html

That index.html was deleted by my second command, which is ok. When I ran the second wget it created a file with the same name as the directory and an index.php inside it, like:

my.domain.com/subsite.1
my.domain.com/subsite/index.php

...and converted (at least some) home links to subsite.1. If all home links were the same, only one of those two files would be needed. And index.php is the best choice as it is automatically served when a client asks for http://my.domain.com/subsite.

To solve that I ran:

# To verify if there were links to subsite.1:
grep -r 'subsite\.1' my.domain.com/
# To convert links from subsite.1 to subsite:
grep -lr 'subsite\.1' my.domain.com/ | sort -u | xargs sed -i -e '/subsite\.1/s/\(subsite\)\.1/\1/g'
# Then I could delete the "duplicated" index file:
rm my.domain.com/subsite.1

In the end, using a web developer tool (firebug) I found that there were still missing some files that were included by javascript or by css. I got them one by one.