C# – Prevent Custom Web Crawler from being blocked

cgoogle-crawlersweb-crawler

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

is there any way to prevent websites from blocking my crawler ?
some solutions like this would help (but I need to know how to apply them):

  • simulating Google bot or yahoo slurp
  • using multiple IP addresses (event fake IP addresses) as crawler client IP

any solution would help.

Best Answer

If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.

Related Topic