New Customer. Old server. Unused Files & Directories abound. 5 specific core directories (attached to differing domains). 10-20 possible extraneous directories w/ files on the same level as the core directories.
Creating something to run in each core directory and do the following:
A script that reduces a few months worth of raw log files down to just the URI, gathers a directory listing, loops through the directory listing and lists everything without a corresponding appearance in the condensed log file.
Anything like that exist already? Better method of accomplishing end goal? Suggested language / tools to build?
Honestly, I'm looking for where to begin on this if it were done right.
Best Answer
Honestly, I'm looking for where to begin on this if it were done right.
With a good backup and a new server, built with only what you need.
The danger of removing stuff based on access is that you'll lose the long tail stuff (that one super-critical file that is accessed twice a year by your vendor in Tahiti, without which they can't ship your shiny widgets to you and the entire company goes belly-up). This is where the backup comes in (so you can get the shiny widgets file back).
The danger of trying to "clean up" an old server filled with cruft is not knowing what's cruft and what's important.
Since you're asking us this question instead of shoving your fist into the server and tearing out its rotting digital guts we can assume you don't know for certain what is/isn't cruft. Even the best tool will have fuzz on one side or the other: Either you will leave cruft because you don't know if you need it, or you will remove something you need and have to go for those backups.
If you still want to write the script you described you can do it with a (relatively) simple shell script:
cat
the log files togetherawk
to grab the URLssort
anduniq
the URL list to eliminate duplicatesawk
andsed
manipulation to turn URLs into filenames on-disk...tar
up the known-accessed files and stuck them somewhere safe.(Implementation is left as an exercise for the reader, mostly because your access log format may be different than mine which affects the
awk
expression(s) you need to use to turn URLs into files on the filesystem)