Php – How to identify and remove unused files / directories from an apache web server

apache-2.2bashPHPweb-server

New Customer. Old server. Unused Files & Directories abound. 5 specific core directories (attached to differing domains). 10-20 possible extraneous directories w/ files on the same level as the core directories.

Creating something to run in each core directory and do the following:

A script that reduces a few months worth of raw log files down to just the URI, gathers a directory listing, loops through the directory listing and lists everything without a corresponding appearance in the condensed log file.

Anything like that exist already? Better method of accomplishing end goal? Suggested language / tools to build?

Honestly, I'm looking for where to begin on this if it were done right.

Best Answer

Honestly, I'm looking for where to begin on this if it were done right.

With a good backup and a new server, built with only what you need.

The danger of removing stuff based on access is that you'll lose the long tail stuff (that one super-critical file that is accessed twice a year by your vendor in Tahiti, without which they can't ship your shiny widgets to you and the entire company goes belly-up). This is where the backup comes in (so you can get the shiny widgets file back).

The danger of trying to "clean up" an old server filled with cruft is not knowing what's cruft and what's important.
Since you're asking us this question instead of shoving your fist into the server and tearing out its rotting digital guts we can assume you don't know for certain what is/isn't cruft. Even the best tool will have fuzz on one side or the other: Either you will leave cruft because you don't know if you need it, or you will remove something you need and have to go for those backups.

If you still want to write the script you described you can do it with a (relatively) simple shell script:

cat the log files together
use awk to grab the URLs
sort and uniq the URL list to eliminate duplicates
- You may need to do further awk and sed manipulation to turn URLs into filenames on-disk...
Take your list of known-accessed files, review it manually & add in anything your scripts may have missed
tar up the known-accessed files and stuck them somewhere safe.
Move the old directory aside (keep it safe as a backup) & untar your known-accessed files.

(Implementation is left as an exercise for the reader, mostly because your access log format may be different than mine which affects the awk expression(s) you need to use to turn URLs into files on the filesystem)

Related Solutions

Web-server – How to make Apache Web Server listen on two different ports

A standard Debian install of apache will have the following fragment of configuration:

Listen 80

<IfModule mod_ssl.c>
    # SSL name based virtual hosts are not yet supported, therefore no
    # NameVirtualHost statement here
    Listen 443
</IfModule>

This is telling apache to listen on port 80 and to listen to port 443 if mod_ssl is configured. In your case you'd want:

Listen 80
Listen 8080

You need to make sure you run a restart, not a reload operation on apache for it to pay any attention to new Listen directives. The safest thing to do is to stop apache, make sure it is dead and start it again.

If this configuration does not work, check the log files for any error messages. You could use "netstat -lep --tcp" to see if there is anything listening on port 8080. Finally, if everything else doesn't work, try running apache under strace to see if it's trying to bind to that port and failing, but not logging the problem.

Bash – How to delete all hidden files and directories using Bash

rm -rf .[^.] .??*

Should catch all cases. The .??* will only match 3+ character filenames (as explained in previous answer), the .[^.] will catch any two character entries (other than ..).

Best Answer

Related Solutions

Web-server – How to make Apache Web Server listen on two different ports

Bash – How to delete all hidden files and directories using Bash

Related Topic