I've used various versions of the code below to try and block bad bots, over several months, but have come to the realization that it never actually works.
My server has a number of virtual hosts, and so I'd like to have the code in httpd.conf, rather than separate .htaccess files, as it makes it that much easier to maintain.
Server Info:
Apache Version: Apache/2.2.15 (Unix)
OS: CentOS release 6.2
I realize the version of Apache is not the latest, but that's what I have to work with.
So, the code below is an abbreviated extract from my httpd.conf file, with just one virtual host section listed, and just a portion of the bots listed:
<Location *>
SetEnvIfNoCase User-Agent ".*MJ12bot.*" bad_bot
SetEnvIfNoCase User-Agent ".*Baiduspider.*" bad_bot
SetEnvIfNoCase User-Agent ".*Vagabondo.*" bad_bot
SetEnvIfNoCase User-Agent ".*lwp-trivial.*" bad_bot
SetEnvIfNoCase User-Agent ".*libwww.*" bad_bot
SetEnvIfNoCase User-Agent ".*Wget.*" bad_bot
SetEnvIfNoCase User-Agent ".*XoviBot.*" bad_bot
SetEnvIfNoCase User-Agent ".*xovibot.*" bad_bot
SetEnvIfNoCase User-Agent ".*AhrefsBot.*" bad_bot
SetEnvIfNoCase User-Agent "SemrushBot" bad_bot
Deny from env=bad_bot
</Location>
<VirtualHost xx.xxx.xx.xxx:80>
DocumentRoot "/var/www/sites/xxx"
ServerName www.xxx.com
ServerAlias xxx.com
ScriptAlias /cgi-bin/ "/var/www/sites/xxx/cgi-bin/"
AddType application/x-httpd-php .html .php
<Directory "/var/www/sites/xxx">
Order allow,deny
Allow from all
Deny from env=bad_bot
Options FollowSymLinks +ExecCGI +Includes
RewriteEngine On
AllowOverride All
Include "/var/www/sites/xxx/.htaccess"
</Directory>
CustomLog "/var/www/sites/logs/xxx_access.log" combined
ErrorLog "/var/www/sites/logs/xxx_error.log"
</VirtualHost>
I've tried various things in regards to how I write the bots section, such as wildcarding it, or just the bot name in quotes, or prefixing it with a ^ symbol, which would hopefully catch the bot name if the User-Agent actually begins with the bot name, etc, etc.
However, nothing I do seems to make the slightest difference, and everything still gets served up for these bots with a 200, for local content, or 302 if it's following a link to off-site content. I'm figuring it should be throwing off error 403's.
Any assistance appreciated.
Many thanks.
Best Answer
Your basic idea is correct, but you need to use
<Location />
instead of<Location *>
. I would suggest reading the docs forLocation
andLocationMatch
to see when wild cards can be used.Also you do not need
.*
at the start and end of your User-Agent patterns, and you do not need thedeny from env=bad_bot
in theDirectory
block in your virtual host. The one in theLocation
block is sufficient.