Sure we all do know what a web scrapper is, and how illegitimate it is, but for personal use you can always code one ! I was required to make a simple app using django Python to scrap off content from this blog and find the number of post written by each author in a given month, tabulate it and send a mail to everyone with the general information.
Rather than scrapping the content of the whole blog, I though why not simply scrap the RSS feeds of this wordpress based blog ? and Since I’ve been learning nodejs, why use Python ? when you can try this is nodejs, and this came as a very good experience to me learning more about events in nodejs.
Coming back to the problem, how do we approach this ? Isn’t it simple making a http request to the rss feed link ( maybe cURL it ) and then proceed with the data, use some heavy regEx, find the relevant data in given month, and fire an email ().

I knew there is a module by the name of “request” to make http request, and later I found out about another module by the name of “cheerio” which does all the rendering part, so scrapping off 10 pages of rss feed would be simple by this code.

var page=1;
while(page<10){
var url=”http://jellyfishtechnologies.com/feed/?paged=”+page;
request({uri:url,}, function(error, response, body){
var blog = cheerio.load(body,{ });
var date = blog(“pubDate”);
var author = blog(“dc\:creator”);
console.log(date+”————–“+author);
console.log(“n”);
}); page+=1;
}

I made a HTTP request to the given feed url by using request({uri: urlName}, callback);Our callback function is defined by the signature of function(error,response,body) where body is the result obtained from the rss feed.

We further use cheerio module to load the content of the body and then use its APIs to find content within the tag of <pubDate> and <dc:creator>. I’m still left with the job of processing the string, tabulating and sending mail ( not much of a task though ).

request() makes a call to the given url, now web scrapping can be a very slow process, given the amount of data generated from the url, and if we start making our code not based on this concept then we may get garbage results, like my initial code was something like this :

var url=”http://jellyfishtechnologies.com/feed/?paged=”+page;
console.log(“Making request to”+url);
request({uri:url,}, function(error, response, body){});

Now what happened here is that I got 10 lines of output “Making request to …” and then my scrapped result, while I expected every scrapped result to be preceded by the URL. Where did I made a mistake ? Its pretty obvious now, since console.log() statement is within the while loop, every time the loop executes we get the output on the console, and it also fires a request to the URL, however since request is a slow process, it’ll take its own time to complete, so 10 request events are initiated to different URLS within a minute and depending upon the content present on the URL any of these can finish before the other ! wow

So the output that you’d get doesn’t depend on any serial order, it simply depends when an event gets completed, and this is why callback functions are totally necessary, we should know how to handle the data whenever it is completed.

Callbacks act as event-handlers to the event and everything defined within them get executed only when the event is completed. So the better approach to the above code would be

var url=”http://jellyfishtechnologies.com/feed/?paged=”+page;
request({uri:url,}, function(error, response, body){console.log(“Making request to”+url);});