Go – Detecting well behaved / well known bots

botsgooglebot

I found this question very interesting : Programmatic Bot Detection
I have a very similar question, but I'm not bothered about 'badly behaved bots'.

I am tracking (in addition to google analytics) the following per visit :

  • Entry URL
  • Referer
  • UserAgent
  • Adwords (by means of query string)
  • Whether or not the user made a purchase
  • etc.

The problem is that to calculate any kind of conversion rate I'm ending up with lots of 'bot' visits that are greatly skewing my results.

I'd like to ignore as many as possible bot visits, but I want a solution that I don't need to monitor too closely, and that won't in itself be a performance hog and preferably still work if someone has javascript disabled.

Are there good published lists of the top 100 bots or so? I did find a list at http://www.user-agents.org/ but that appears to contain hundreds if not thousands of bots. I don't want to check every referer against thousands of links.

Here is the current googlebot UserAgent. How often does it change?

 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Best Answer

You could try importing the Robots database off robotstxt.org and using that to filter out requests from those User-Agents. Might not be much different to User-agents.org, but at least the robotstxt.org list is 'owner-submitted' (supposedly).

That site also links to botsvsbrowsers.com although I don't immediately see a downloadable version of their data.

Also, you said

I don't want to check every referer against thousands of links.

which is fair enough - but if runtime performance is a concern, just 'log' every request and filter them out as a post-process (an overnight batch, or as part of the reporting queries).

This point also confuses me a bit

preferably still work if someone has javascript disabled.

are you writing your log on the server-side as part of every page you serve? javascript should not make any difference in this case (although obviously those with javascript disabled will not get reported via Google Analytics).

p.s. having mentioned robotstxt.org, it's worth remembering that well-behaved robots will request /robots.txt from your website root. Perhaps you could use that knowledge to your advantage - by logging/notifying you of possible robot User-Agents that you might want to exclude (although I wouldn't automatically exclude that UA in case a regular web user types /robots.txt into their browser, which might cause your code to ignore real people). I don't think that would cause too much maintenance overhead over time...

Related Topic