C# – Prevent Custom Web Crawler from being blocked

cgoogle-crawlersweb-crawler

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

is there any way to prevent websites from blocking my crawler ?
some solutions like this would help (but I need to know how to apply them):

simulating Google bot or yahoo slurp
using multiple IP addresses (event fake IP addresses) as crawler client IP

any solution would help.

Best Answer

If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.

Related Solutions

Detecting ‘stealth’ web-crawlers

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically as needed by checking claimed user-agent strings and, if the client claimed to be a legitimate spider but not on the whitelist, it performed DNS/reverse-DNS lookups to verify that the source IP address corresponds to the claimed owner of the bot. As a failsafe, these actions were reported to the admin by email, along with links to black/whitelist the address in case of an incorrect assessment.

I haven't talked to that client in 6 months or so, but, last I heard, the system was performing quite effectively.

Side point: If you're thinking about doing a similar detection system based on hit-rate-limiting, be sure to use at least one-minute (and preferably at least five-minute) totals. I see a lot of people talking about these kinds of schemes who want to block anyone who tops 5-10 hits in a second, which may generate false positives on image-heavy pages (unless images are excluded from the tally) and will generate false positives when someone like me finds an interesting site that he wants to read all of, so he opens up all the links in tabs to load in the background while he reads the first one.

Php – Tor Web Crawler

cURL also supports SOCKS connections; try this:

<?php

$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 

// SOCKS5
curl_setopt($ch, CURLOPT_PROXY, 'localhost:9050'); 
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);

curl_exec($ch); 
curl_close($ch);

Best Answer

Related Solutions

Detecting ‘stealth’ web-crawlers

Php – Tor Web Crawler

Related Topic