Programmatic Bot Detection

bots

I need to write some code to analyze whether or not a given user on our site is a bot. If it's a bot, we'll take some specific action. Looking at the User Agent is not something that is successful for anything but friendly bots, as you can specify any user agent you want in a bot. I'm after behaviors of unfriendly bots. Various ideas I've had so far are:

If you don't have a browser ID
If you don't have a session ID
Unable to write a cookie

Obviously, there are some cases where a legitimate user will look like a bot, but that's ok. Are there other programmatic ways to detect a bot, or either detect something that looks like a bot?

Best Answer

User agents can be faked. Captchas have been cracked. Valid cookies can be sent back to your server with page requests. Legitimate programs, such as Adobe Acrobat Pro can go in and download your web site in one session. Users can disable JavaScript. Since there is no standard measure of "normal" user behaviour, it cannot be differentiated from a bot.

In other words: it can't be done short of pulling the user into some form of interactive chat and hope they pass the Turing Test, then again, they could be a really good bot too.

1) Simplest

Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers.

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

2) Medium

Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot’, ‘crawler’, ‘spider’ upfront. You can add to it any other known crawlers.

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

3) Paranoid

Is pretty fast, but a little slower than options 1 and 2. It’s the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive.

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

Notes:

It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.

Honeypots

The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.

Postives: It will match some unknown crawlers that aren’t declaring themselves.

Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.

Best Answer

Related Solutions

Stopping scripters from slamming your website

C# – Detecting honest web crawlers

1) Simplest

2) Medium

3) Paranoid

Related Topic