Madly in love to this book! It’s awesome to know how much you can build with how little. I have ranted about Node.js earlier and did some comparisons too. It’s fast, really fast. And it’s plain old Javascript we have been using for last many-many years now. I thought I would build a real world application with it to see how much it stands the water. Earlier I thought to make a something on top of Riak, but that felt like running too fast. Instead I picked up something simpler to deal only with Node.js. Now, I think it would make sense to brush up on some Javascript fundaments.
Javascript objects
Yes. Javascript is an object oriented language. But it’s different from your traditional classical OO languages like Java and Ruby.- One obvious difference is in syntax, and the other major one is that
- Other languages have methods while Javascript has first-class functions.
The Classical way
Here is a constructor function for object Shape. It accepts two parameters and saves them into respective instance variables.function Shape(width, height) { this.width = width; // instance variable width this.height = height; // instance variable height this.getArea = function() { // function to calculate Area, notice the assignment. return this.width * this.height; }; } var rectangle = new Shape (2, 5); // instantiate a new Shape object console.log (rectangle.getArea()); // calculate the area: 10
Javascript uses prototype chains to add new functions or variables to an object on the fly. You should read more about this thing here: http://www.packtpub.com/article/using-prototype-property-in-javascript
I will add a new function to calculate the perimeter of my Shape object.
What happened here? Did you notice that even if ‘rectangle’ was already defined it could access the newly added function to calculate perimeter. Wasn’t that awesome? Javascript is intelligent, dude. If you ask for something, it looks into the current object, and if not found, it would go up the object’s prototype chain to look for what you asked for. And since, we added the new function to the prototype, it’s found unscrupulously. There is a lot of interesting stuffs going on here, you must read about it. I would suggest buying Manning’s Javascript Ninja, if you are really serious about it.Shape.prototype.getPerimiter = function() { return 2 * (this.width + this.height); } console.log (rectangle.getPerimiter());
Now, let’s try to extend Shape. I will create a new constructor function for Square.
I created a new Square class and overrode its prototype chain with that of Shape’s. I got all the functionalities and behavior of Shape. Easy… huh?function Square(side){ this.width = side; this.height = side; } Square.prototype = new Shape(); var sq = new Square(4); console.log(sq.getArea());
The Prototypal way
Let’s do the same thing without using constructors now. Just plain prototypes!
Now that you have the Shape object, you can easily add new functions to its prototype chain, or even inherit it to another object. However I find this approach a little clumsy. I would rather stick to the classic way. You choose your pick. To each his own!var Shape = { getArea: function () { return this.width * this.height; }, getPerimiter: function() { return 2 * (this.width + this.height); } }; var rec = Object.create(Shape); rec.width = 2; rec.height = 5; console.log(rec.getArea());
Node.js Modules
Node uses the CommonJS module system. Node has a simple module loading system where files and modules are in one-to-one correspondence. Here is the API: http://nodejs.org/api.html. Above example can be ported to Node.js module ecosystem like explained below:First, create Shape.js
And now, use thisfunction Shape(width, height) { this.width = width; // instance variable width this.height = height; // instance variable height this.getArea = function() { // function to calculate Area, notice the assignment. return this.width * this.height; }; } // Export this module exports.module = Shape;
var Shape = require('./Shape'); var rectangle = new Shape (2, 5); console.log (rectangle.getArea());
Node.js loads and runs each module in a sandbox which
staves off any possible name collision. That’s the benefit you get
apart from having a properly structured code base.
Writing a screen scraping application
I will write a simple application to capture details from various websites. The beautiful thing is Javascript has been handling DOM objects for years. In fact Javascript was created to handle DOM objects. No wonder that it’s more mature than any other html parsing library. Also, given that there are many elegant frameworks like Prototype, Mootools, JQuery etc. available to use, scraping websites with Node.js should be easy and fun. Let’s do it. Let’s write an application to collect data from various book selling websites.Create a basic searcher.js module. It would provide the fundamental skeleton for writing website specific tool.
Now, code to scrape rediff books. I will name it searcher-rediff.js// External Modules var request = require('ahr'), // Abstract-HTTP-request https://github.com/coolaj86/abstract-http-request sys = require('sys'), // System events = require('events'), // EventEmitter jsdom = require('jsdom'); // JsDom https://github.com/tmpvar/jsdom var jQueryPath = 'http://code.jquery.com/jquery-1.4.2.min.js'; var headers = {'content-type':'application/json', 'accept': 'application/json'}; // Export searcher module.exports = Searcher; function Searcher(param) { if (param.headers) { this.headers = param.headers; } else { this.headers = headers; } this.merchantName = param.merchantName; this.merchantUrl = param.merchantUrl; this.id = param.merchantUrl; } // Inherit from EventEmitter Searcher.prototype = new process.EventEmitter; Searcher.prototype.search = function(query, collector) { var self = this; var url = self.getSearchUrl(query); console.log('Connecting to... ' + url); request({uri: url, method: 'GET', headers: self.headers, timeout: 10000}, function(err, response, html) { if (err) { self.onError({error: err, searcher: self}); self.onComplete({searcher: self}); } else { console.log('Fetched content from... ' + url); // create DOM window from HTML data var window = jsdom.jsdom(html).createWindow(); // load jquery with DOM window and call the parser! jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js', function() { self.parseHTML(window); self.onComplete({searcher: self}); }); } }); } // Implemented in inhetired class Searcher.prototype.getSearchUrl = function(query) { throw "getSearchUrl() is unimplemented!"; } // Implemented in inhetired class Searcher.prototype.parseHTML = function(window) { throw "parseForBook() is unimplemented!"; } // Emits 'item' events when an item is found. Searcher.prototype.onItem = function(item) { this.emit('item', item); } // Emits 'complete' event when searcher is done Searcher.prototype.onComplete = function(searcher) { this.emit('complete', searcher); } // Emit 'error' events Searcher.prototype.onError = function(error) { this.emit('error', error); } Searcher.prototype.toString = function() { return this.merchantName + "(" + this.merchantUrl + ")"; }
Run it now.var Searcher = require('./searcher'); var searcher = new Searcher({ merchantName: 'Rediff Books', merchantUrl: 'http://books.rediff.com' }); module.exports = searcher; searcher.getSearchUrl = function(query) { return this.merchantUrl + "/book/" + query; } searcher.parseHTML = function(window) { var self = this; window.$('div[id="prod_detail"]').each(function(){ var item = window.$(this); var title = item.find('#prod_detail2').find('font[id="book-titl"]').text(); var link = item.find('#prod_detail2').find('a').attr('href'); var author = item.find('#prod_detail2').find('font[id="book-auth"]').text(); var price = item.find('#prod_detail2').find('font[id="book-pric"]').text(); self.onItem({ title: title, link: link, author: author, price: price }); }); }
var searcher = require('./searcher-rediff'); searcher.on('item', function(item){ console.log('Item found >> ' + item) }); searcher.on('complete', function(searcher){ console.log('searcher done!'); }); searcher.search("Salman");
What I did?
- First, I wrote a skeleton searcher class. This class makes the
- request to the merchant’s search URL (this URL is built in getSearchUrl function), then
- fetches the html data from here, then
- by using ‘jsdom’ module creates DOM’s window object which further
- gets parsed by ‘jquery’, and
- function parseHTML is executed.
- Second, I wrote another class that extends from searcher and intends to interact with Rediff. This class implements,
- getSearchUrl function to return appropriate search URL to connect to, and
- parseHTML function to scrape data from DOM’s window object. This is very interesting. You can use all your jquery knowledge to pick elements and parse data from inside the elements. Just like you did in old days when you added styles or data to random elements.
var Searcher = require('./searcher'); var searcher = new Searcher({ merchantName: 'Flipkart', merchantUrl: 'http://www.flipkart.com' }); module.exports = searcher; searcher.getSearchUrl = function(query) { return this.merchantUrl + "/search-book" + '?query=' + query; } searcher.parseHTML = function(window) { var self = this; window.$('.search_result_item').each(function(){ var item = window.$(this); var title = item.find('.search_result_title').text().trim().replace(/\n/g, ""); var link = self.merchantUrl + item.find('.search_result_title').find("a").attr('href'); var price = item.find('.search_results_list_price').text().trim().replace(/\n/g, ""); self.onItem({ title: title, link: link, price: price }); }); }
I have also written a Runner class to execute the multiple searchers in parallel and collect results into an array. You can find the entire source code here: https://github.com/anismiles/jsdom-based-screen-scraper Chill!
What’s next? I am going to write on Node.js pretty feverishly. You better keep posted. How about a blog engine on Riak?
No comments:
Post a Comment