Blocking Bad Bots – Apache 2.2.15


I've used various versions of the code below to try and block bad bots, over several months, but have come to the realization that it never actually works.

My server has a number of virtual hosts, and so I'd like to have the code in httpd.conf, rather than separate .htaccess files, as it makes it that much easier to maintain.

Server Info:
Apache Version: Apache/2.2.15 (Unix)
OS: CentOS release 6.2

I realize the version of Apache is not the latest, but that's what I have to work with.

So, the code below is an abbreviated extract from my httpd.conf file, with just one virtual host section listed, and just a portion of the bots listed:

<Location *> 
SetEnvIfNoCase User-Agent ".*MJ12bot.*" bad_bot 
SetEnvIfNoCase User-Agent ".*Baiduspider.*" bad_bot 
SetEnvIfNoCase User-Agent ".*Vagabondo.*" bad_bot 
SetEnvIfNoCase User-Agent ".*lwp-trivial.*" bad_bot 
SetEnvIfNoCase User-Agent ".*libwww.*" bad_bot 
SetEnvIfNoCase User-Agent ".*Wget.*" bad_bot 
SetEnvIfNoCase User-Agent ".*XoviBot.*" bad_bot 
SetEnvIfNoCase User-Agent ".*xovibot.*" bad_bot 
SetEnvIfNoCase User-Agent ".*AhrefsBot.*" bad_bot 
SetEnvIfNoCase User-Agent "SemrushBot" bad_bot 
Deny from env=bad_bot 

DocumentRoot "/var/www/sites/xxx" 

ScriptAlias /cgi-bin/   "/var/www/sites/xxx/cgi-bin/" 
AddType application/x-httpd-php .html .php 

<Directory "/var/www/sites/xxx"> 
Order allow,deny 
Allow from all 
Deny from env=bad_bot 
Options FollowSymLinks +ExecCGI +Includes 
RewriteEngine On 
AllowOverride All 
Include "/var/www/sites/xxx/.htaccess" 

CustomLog "/var/www/sites/logs/xxx_access.log" combined 
ErrorLog  "/var/www/sites/logs/xxx_error.log" 

I've tried various things in regards to how I write the bots section, such as wildcarding it, or just the bot name in quotes, or prefixing it with a ^ symbol, which would hopefully catch the bot name if the User-Agent actually begins with the bot name, etc, etc.

However, nothing I do seems to make the slightest difference, and everything still gets served up for these bots with a 200, for local content, or 302 if it's following a link to off-site content. I'm figuring it should be throwing off error 403's.

Any assistance appreciated.

Many thanks.

Best Answer

Your basic idea is correct, but you need to use <Location /> instead of <Location *>. I would suggest reading the docs for Location and LocationMatch to see when wild cards can be used.

Also you do not need .* at the start and end of your User-Agent patterns, and you do not need the deny from env=bad_bot in the Directory block in your virtual host. The one in the Location block is sufficient.