Javascript – How to create a web crawler with Node.js?

javascriptnode.jsweb-crawler

I just recently got interested in how search engines work, and I found out that they use "bots" or "webcrawlers". I immediately started wondering about how do these things work and I wanted to create one! So, first of: how do you make a program that requests a page from a server? It would be awesome if you gave me a simple example in JavaScript (I'm running it as a normal scripting language using Node). Next, is there a Node module that let's me interpret HTML? Create a DOM for me so I can cycle trough all the links and so on? Correct me if I'm wrong but I guess it's done like that… Any examples in C++, C or Python are warmly welcomed as well, although I'd prefer JS or Python because I'm more familiar with high-level scripting languages.

Best Answer

  • Getting HTTP pages: node http.get (example is there)
  • DOM documents: jsdom (also includes examples)