Php – How to identify and remove unused files / directories from an apache web server

apache-2.2bashPHPweb-server

New Customer. Old server. Unused Files & Directories abound. 5 specific core directories (attached to differing domains). 10-20 possible extraneous directories w/ files on the same level as the core directories.

Creating something to run in each core directory and do the following:

A script that reduces a few months worth of raw log files down to just the URI, gathers a directory listing, loops through the directory listing and lists everything without a corresponding appearance in the condensed log file.

Anything like that exist already? Better method of accomplishing end goal? Suggested language / tools to build?

Honestly, I'm looking for where to begin on this if it were done right.

Best Answer

Honestly, I'm looking for where to begin on this if it were done right.

With a good backup and a new server, built with only what you need.

The danger of removing stuff based on access is that you'll lose the long tail stuff (that one super-critical file that is accessed twice a year by your vendor in Tahiti, without which they can't ship your shiny widgets to you and the entire company goes belly-up). This is where the backup comes in (so you can get the shiny widgets file back).

The danger of trying to "clean up" an old server filled with cruft is not knowing what's cruft and what's important.
Since you're asking us this question instead of shoving your fist into the server and tearing out its rotting digital guts we can assume you don't know for certain what is/isn't cruft. Even the best tool will have fuzz on one side or the other: Either you will leave cruft because you don't know if you need it, or you will remove something you need and have to go for those backups.


If you still want to write the script you described you can do it with a (relatively) simple shell script:

  • cat the log files together
  • use awk to grab the URLs
  • sort and uniq the URL list to eliminate duplicates
    • You may need to do further awk and sed manipulation to turn URLs into filenames on-disk...
  • Take your list of known-accessed files, review it manually & add in anything your scripts may have missed
  • tar up the known-accessed files and stuck them somewhere safe.
  • Move the old directory aside (keep it safe as a backup) & untar your known-accessed files.

(Implementation is left as an exercise for the reader, mostly because your access log format may be different than mine which affects the awk expression(s) you need to use to turn URLs into files on the filesystem)