Detecting ‘stealth’ web-crawlers

web-crawler

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)

I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp.
I consider a bot nice if it:

identifies itself as a bot in the user agent string
reads robots.txt (and obeys it)

I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.

There are some trapdoors that can be constructed updated list (thanks Chris, gs):

Adding a directory only listed (marked as disallow) in the robots.txt,
Adding invisible links (possibly marked as rel="nofollow"?),
- style="display: none;" on link or parent container
- placed underneath another element with higher z-index
detect who doesn't understand CaPiTaLiSaTioN,
detect who tries to post replies but always fail the Captcha.
detect GET requests to POST-only resources
detect interval between requests
detect order of pages requested
detect who (consistently) requests https resources over http
detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice)

Some traps would be triggered by both 'good' and 'bad' bots.
you could combine those with a whitelist:

It trigger a trap
It request robots.txt?
It doest not trigger another trap because it obeyed robots.txt

One other important thing here is:
Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.

What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors.

Update
The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler.

Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc…
I have no illusions: I won't be able to beat them.

You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.

And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Best Answer

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.

1) Simplest

Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers.

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

2) Medium

Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot’, ‘crawler’, ‘spider’ upfront. You can add to it any other known crawlers.

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

3) Paranoid

Is pretty fast, but a little slower than options 1 and 2. It’s the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive.

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

Notes:

It’s tempting to just keep adding names to the regex option 1. But if you do this it will become slower. If you want a more complete list then linq with lambda is faster.
Make sure .ToLower() is outside of your linq method – remember the method is a loop and you would be modifying the string during each iteration.
Always put the heaviest bots at the start of the list, so they match sooner.
Put the lists into a static class so that they are not rebuilt on every pageview.

Honeypots

The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.

Postives: It will match some unknown crawlers that aren’t declaring themselves.

Negatives: Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.

What techniques can be used to detect so called “black holes” (a spider trap) when creating a web crawler

Well, you've asked very challenging question. There are many issues:

First, do you think someone would do something like that to prevent web spidering? A web spider could act as DoS attack if it would got stuck in such structure.

Secondly, if page is made for users, how would they react to large number of senseless links linking to random generated 'trash sites'? This links should be invisible for user, either a few of them or they would be hidden somehow - you should then check, if links have display: none, 1 px font etc.

Third, how google would behave? Well, google does not index everything it can. It adds links to queue, but not follows them immediately. He does not like to follow deeply referenced links, that are not linked from pages previously indexed. It makes him not index everything, but index what users are most likely to visit is finally visited. Otherwise such pages as you describe will be extremally often used by SEO spammers ;)

I would build priority queue. Each link to each URL adds 1 point priority (more, when from main page). Pages with priority 1 are at the end list. I would limit visited pages count, so at worst case I would visity most important pages. I would be suspitious againt pages that contains too much links with too little content. In short words, simulate google behaviour as much as it is needed.

Best Answer

Related Solutions

C# – Detecting honest web crawlers

1) Simplest

2) Medium

3) Paranoid

What techniques can be used to detect so called “black holes” (a spider trap) when creating a web crawler

Related Topic