Trouble filtering googlebot from apache access log

apache-2.4googlebotlogging

Though it seems like it should be pretty straightforward, I have been unable to configure apache so that googlebot's requests are not stored in the access log. I've tried the following lines:

SetEnvIfNoCase User-Agent googlebot dontlog
BrowserMatchNoCase googlebot dontlog
CustomLog "/foo/bar/access_log" combined env=!dontlog

and I restarted apache after adding them, but the log is still recording all of google bot's requests. My understanding is that SetEnvIf User-Agent and BrowserMatch do the same thing. i tried each of them but neither works.

Best Answer

Find a log entry that you suspect is the Googlebot and make a note of the IP address.

Next do a lookup on that IP address with the following command:

host 66.249.64.156

Don't forget to substitute the IP address you recorded earlier with this command.

If the result looks something like this then you know it's the Googlebot. You want make sure it ends in googlebot.com:

156.64.249.66.in-addr.arpa domain name pointer crawl-66-249-64-156.googlebot.com.

Next, go to your Apache2 Virtualhost and add these directives adapted for your site:

SetEnvIf Remote_Addr "66.249.64.156" AND User-Agent "Googlebot" do_not_log
CustomLog ${APACHE_LOG_DIR}/access.log combined env=!do_not_log

You can repeat this process for the bingbot:

host 157.55.39.247

The entry should have something that ends in search.msn.com like this

247.39.55.157.in-addr.arpa domain name pointer msnbot-157-55-39-247.search.msn.com.

So you would add the additional line in the Virtualhost file after the Googlebot line:

SetEnvIf Remote_Addr "157.55.39.247" AND User-Agent "bing" do_not_log

Usually the Googlebot and MSN bot will use the same IP to check your pages, but if not you may need to add additional entries. You may just want to use "^66" out of convenience.

https://support.google.com/webmasters/answer/80553

https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/

Related Topic