Blocking Bad Bots – Apache 2.2.15

apache-2.2

I've used various versions of the code below to try and block bad bots, over several months, but have come to the realization that it never actually works.

My server has a number of virtual hosts, and so I'd like to have the code in httpd.conf, rather than separate .htaccess files, as it makes it that much easier to maintain.

Server Info:
Apache Version: Apache/2.2.15 (Unix)
OS: CentOS release 6.2

I realize the version of Apache is not the latest, but that's what I have to work with.

So, the code below is an abbreviated extract from my httpd.conf file, with just one virtual host section listed, and just a portion of the bots listed:

<Location *> 
SetEnvIfNoCase User-Agent ".*MJ12bot.*" bad_bot 
SetEnvIfNoCase User-Agent ".*Baiduspider.*" bad_bot 
SetEnvIfNoCase User-Agent ".*Vagabondo.*" bad_bot 
SetEnvIfNoCase User-Agent ".*lwp-trivial.*" bad_bot 
SetEnvIfNoCase User-Agent ".*libwww.*" bad_bot 
SetEnvIfNoCase User-Agent ".*Wget.*" bad_bot 
SetEnvIfNoCase User-Agent ".*XoviBot.*" bad_bot 
SetEnvIfNoCase User-Agent ".*xovibot.*" bad_bot 
SetEnvIfNoCase User-Agent ".*AhrefsBot.*" bad_bot 
SetEnvIfNoCase User-Agent "SemrushBot" bad_bot 
Deny from env=bad_bot 
</Location> 

<VirtualHost xx.xxx.xx.xxx:80> 
DocumentRoot "/var/www/sites/xxx" 
ServerName www.xxx.com 
ServerAlias xxx.com 

ScriptAlias /cgi-bin/   "/var/www/sites/xxx/cgi-bin/" 
AddType application/x-httpd-php .html .php 

<Directory "/var/www/sites/xxx"> 
Order allow,deny 
Allow from all 
Deny from env=bad_bot 
Options FollowSymLinks +ExecCGI +Includes 
RewriteEngine On 
AllowOverride All 
Include "/var/www/sites/xxx/.htaccess" 
</Directory> 

CustomLog "/var/www/sites/logs/xxx_access.log" combined 
ErrorLog  "/var/www/sites/logs/xxx_error.log" 
</VirtualHost>

I've tried various things in regards to how I write the bots section, such as wildcarding it, or just the bot name in quotes, or prefixing it with a ^ symbol, which would hopefully catch the bot name if the User-Agent actually begins with the bot name, etc, etc.

However, nothing I do seems to make the slightest difference, and everything still gets served up for these bots with a 200, for local content, or 302 if it's following a link to off-site content. I'm figuring it should be throwing off error 403's.

Any assistance appreciated.

Many thanks.

Best Answer

Your basic idea is correct, but you need to use <Location /> instead of <Location *>. I would suggest reading the docs for Location and LocationMatch to see when wild cards can be used.

Also you do not need .* at the start and end of your User-Agent patterns, and you do not need the deny from env=bad_bot in the Directory block in your virtual host. The one in the Location block is sufficient.