Using node.js and jquery to scrape websites

I have been playing with Node.js for last few days and am totally head over heels.

Madly in love to this book! It’s awesome to know how much you can build with how little. I have ranted about Node.js earlier and did some comparisons too. It’s fast, really fast. And it’s plain old Javascript we have been using for last many-many years now. I thought I would build a real world application with it to see how much it stands the water. Earlier I thought to make a something on top of Riak, but that felt like running too fast. Instead I picked up something simpler to deal only with Node.js. Now, I think it would make sense to brush up on some Javascript fundaments.


Javascript objects

Yes. Javascript is an object oriented language. But it’s different from your traditional classical OO languages like Java and Ruby.
  1. One obvious difference is in syntax, and the other major one is that
  2. Other languages have methods while Javascript has first-class functions.
First class functions. What does it mean? It means that they are expressions and can be assigned to a variable and can be easily passed around. Does it sound like a closure in Ruby? It does indeed. Well thought, it’s a little more than that. I will come to this again some other time. For now, let’s find out how we can create objects and use them? I will focus tell you two ways to do it.

The Classical way

Here is a constructor function for object Shape. It accepts two parameters and saves them into respective instance variables.

function Shape(width, height) {
 this.width = width;        // instance variable width
 this.height = height;      // instance variable height
 this.getArea = function() {     // function to calculate Area, notice the assignment.
  return this.width * this.height;
 };
}

var rectangle = new Shape (2, 5);    // instantiate a new Shape object
console.log (rectangle.getArea());   // calculate the area: 10

Javascript uses prototype chains to add new functions or variables to an object on the fly. You should read more about this thing here: http://www.packtpub.com/article/using-prototype-property-in-javascript
I will add a new function to calculate the perimeter of my Shape object.

Shape.prototype.getPerimiter = function() {
 return 2 * (this.width + this.height);
}

console.log (rectangle.getPerimiter());
What happened here? Did you notice that even if ‘rectangle’ was already defined it could access the newly added function to calculate perimeter. Wasn’t that awesome? Javascript is intelligent, dude. If you ask for something, it looks into the current object, and if not found, it would go up the object’s prototype chain to look for what you asked for. And since, we added the new function to the prototype, it’s found unscrupulously.  There is a lot of interesting stuffs going on here, you must read about it. I would suggest buying Manning’s Javascript Ninja, if you are really serious about it.
Now, let’s try to extend Shape. I will create a new constructor function for Square.

function Square(side){
 this.width = side;
 this.height = side;
}

Square.prototype = new Shape();

var sq = new Square(4);
console.log(sq.getArea());
I created a new Square class and overrode its prototype chain with that of Shape’s. I got all the functionalities and behavior of Shape. Easy… huh?
The Prototypal way
Let’s do the same thing without using constructors now. Just plain prototypes!


var Shape = {
 getArea: function () {
  return this.width * this.height;
 },
 getPerimiter: function() {
  return 2 * (this.width + this.height);
 }
};

var rec = Object.create(Shape);
rec.width = 2;
rec.height = 5;
console.log(rec.getArea());
Now that you have the Shape object, you can easily add new functions to its prototype chain, or even inherit it to another object. However I find this approach a little clumsy. I would rather stick to the classic way. You choose your pick. To each his own!

Node.js Modules

Node uses the CommonJS module system. Node has a simple module loading system where files and modules are in one-to-one correspondence. Here is the API: http://nodejs.org/api.html. Above example can be ported to Node.js module ecosystem like explained below:
First, create Shape.js

function Shape(width, height) {
 this.width = width;        // instance variable width
 this.height = height;      // instance variable height
 this.getArea = function() {     // function to calculate Area, notice the assignment.
  return this.width * this.height;
 };
}

// Export this module
exports.module = Shape;
 And now, use this
var Shape = require('./Shape');
var rectangle = new Shape (2, 5);
console.log (rectangle.getArea());
Node.js loads and runs each module in a sandbox which
 staves off any possible name collision. That’s the benefit you get 
apart from having a properly structured code base.
 

Writing a screen scraping application

I will write a simple application to capture details from various websites. The beautiful thing is Javascript has been handling DOM objects for years. In fact Javascript was created to handle DOM objects. No wonder that it’s more mature than any other html parsing library. Also, given that there are many elegant frameworks like Prototype, Mootools, JQuery etc. available to use, scraping websites with Node.js should be easy and fun. Let’s do it. Let’s write an application to collect data from various book selling websites.
Create a basic searcher.js module. It would provide the fundamental skeleton for writing website specific tool.
// External Modules
var request = require('ahr'), // Abstract-HTTP-request https://github.com/coolaj86/abstract-http-request
sys = require('sys'),  // System
events = require('events'), // EventEmitter
jsdom = require('jsdom'); // JsDom https://github.com/tmpvar/jsdom

var jQueryPath = 'http://code.jquery.com/jquery-1.4.2.min.js';
var headers = {'content-type':'application/json', 'accept': 'application/json'};

// Export searcher
module.exports = Searcher;

function Searcher(param) {
 if (param.headers) {
  this.headers = param.headers;
 } else {
  this.headers = headers;
 }

 this.merchantName = param.merchantName;
 this.merchantUrl = param.merchantUrl;
 this.id = param.merchantUrl;
}

// Inherit from EventEmitter
Searcher.prototype = new process.EventEmitter;

Searcher.prototype.search = function(query, collector) {
 var self = this;
 var url = self.getSearchUrl(query);

 console.log('Connecting to... ' + url);

 request({uri: url, method: 'GET', headers: self.headers, timeout: 10000}, function(err, response, html) {
  if (err) {
   self.onError({error: err, searcher: self});
   self.onComplete({searcher: self});
  } else {
   console.log('Fetched content from... ' + url);
   // create DOM window from HTML data
   var window = jsdom.jsdom(html).createWindow();
   // load jquery with DOM window and call the parser!
   jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js', function() {
    self.parseHTML(window);
    self.onComplete({searcher: self});
   });
  }
 });
}

// Implemented in inhetired class
Searcher.prototype.getSearchUrl = function(query) {
 throw "getSearchUrl() is unimplemented!";
}
// Implemented in inhetired class
Searcher.prototype.parseHTML = function(window) {
 throw "parseForBook() is unimplemented!";
}
// Emits 'item' events when an item is found.
Searcher.prototype.onItem = function(item) {
 this.emit('item', item);
}
// Emits 'complete' event when searcher is done
Searcher.prototype.onComplete = function(searcher) {
 this.emit('complete', searcher);
}
// Emit 'error' events
Searcher.prototype.onError = function(error) {
 this.emit('error', error);
}

Searcher.prototype.toString = function() {
 return this.merchantName + "(" + this.merchantUrl + ")";
}
 Now, code to scrape rediff books. I will name it searcher-rediff.js
var Searcher = require('./searcher');

var searcher = new Searcher({
 merchantName: 'Rediff Books',
 merchantUrl: 'http://books.rediff.com'
});

module.exports = searcher;

searcher.getSearchUrl = function(query) {
 return this.merchantUrl + "/book/" + query;
}

searcher.parseHTML = function(window) {
 var self = this;

 window.$('div[id="prod_detail"]').each(function(){
  var item  = window.$(this);

  var title = item.find('#prod_detail2').find('font[id="book-titl"]').text();
  var link = item.find('#prod_detail2').find('a').attr('href');
  var author = item.find('#prod_detail2').find('font[id="book-auth"]').text();
  var price = item.find('#prod_detail2').find('font[id="book-pric"]').text();

  self.onItem({
   title: title,
   link: link,
   author: author,
   price: price
  });
 });
}
Run it now.

var searcher = require('./searcher-rediff');

searcher.on('item', function(item){
 console.log('Item found >> ' + item)
});

searcher.on('complete', function(searcher){
 console.log('searcher done!');
});

searcher.search("Salman");

 

What I did?

  1. First, I wrote a skeleton searcher class. This class makes the
    1. request to the merchant’s search URL (this URL is built in getSearchUrl function), then
    2. fetches the html data from here, then
    3. by using ‘jsdom’ module creates DOM’s window object which further
    4. gets parsed by ‘jquery’, and
    5. function parseHTML is executed.
  2. Second, I wrote another class that extends from searcher and intends to interact with Rediff. This class implements,
    1. getSearchUrl function to return appropriate search URL to connect to, and
    2. parseHTML function to scrape data from DOM’s window object. This is very interesting. You can use all your jquery knowledge to pick elements and parse data from inside the elements. Just like you did in old days when you added styles or data to random elements.
Now, if I want to search say Flipkart along with Rediff, I just need to write a Flipkart specific implementation, say searcher-flipkart.js

var Searcher = require('./searcher');

var searcher = new Searcher({
 merchantName: 'Flipkart',
 merchantUrl: 'http://www.flipkart.com'
});

module.exports = searcher;

searcher.getSearchUrl = function(query) {
 return this.merchantUrl + "/search-book" + '?query=' + query;
}

searcher.parseHTML = function(window) {
 var self = this;

 window.$('.search_result_item').each(function(){
  var item  = window.$(this);

  var title = item.find('.search_result_title').text().trim().replace(/\n/g, "");
  var link = self.merchantUrl + item.find('.search_result_title').find("a").attr('href');
  var price = item.find('.search_results_list_price').text().trim().replace(/\n/g, "");

  self.onItem({
   title: title,
   link: link,
   price: price
  });
 });
}

I have also written a Runner class to execute the multiple searchers in parallel and collect results into an array. You can find the entire source code here: https://github.com/anismiles/jsdom-based-screen-scraper Chill!
What’s next? I am going to write on Node.js pretty feverishly. You better keep posted. How about a blog engine on Riak?

No comments: