Scraping a page’s content using the node-readability module and Node.js

The following example shows how you can scrape a page’s contents and remove unnecessary markup (similar to http://www.readability.com/), by using the Node.js node-readability module.

First, install the node-readability and sanitizer modules by running the following commands in your Terminal:

$ npm install node-readability
$ npm install sanitizer

Next, create a new JavaScript file, app.js, in the same working directory that you installed the Node modules above and enter the following code:

#!/usr/bin/env node

// 3rd party modules.
var readability = require("node-readability"),
    sanitizer = require("sanitizer");

scraper("http://www.readability.com/about", function (data) {
    console.log("# %s #\n\n%s\n\n---", data.title, data.contents);
});

function scraper(url, callback) {
    readability.read(url, function(err, doc) {
        if (err) {
            throw err;
        }

        var obj = {
            "url": url,
            "title": doc.getTitle().trim(),
            "contents": stripHTML(doc.getContent() || "")
        };
        callback(obj);
    });
}

function stripHTML(html) {
    var clean = sanitizer.sanitize(html, function (str) {
        return str;
    });
    // Remove all remaining HTML tags.
    clean = clean.replace(/<(?:.|\n)*?>/gm, "");

    // RegEx to remove needless newlines and whitespace.
    // See: http://stackoverflow.com/questions/816085/removing-redundant-line-breaks-with-regular-expressions
    clean = clean.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/ig, "\n");

    // Return the final string, minus any leading/trailing whitespace.
    return clean.trim();
}

Finally, run the Node.js app by typing $ node ./app.js in your Terminal window.

# About — Readability #

About the Service
Readability is a free reading platform that aims to deliver a great reading experience wherever you are, and to provide a system to connect readers to the writers they enjoy.
* * * *
A Brief History
Readability started off as a simple, Javascript-based reading tool that turned any web page into a customizable reading view. It was released by Arc90 (as an Arc90 Lab experiment), a New York City-based design and technology shop, back in early 2009.
Since its release, Readability was an instant hit. It's used by legions of readers today to make the Web a more pleasant place to read. The original Readability codebase is embedded in a host of applications, including Apple's Safari 5 browser (the Safari Reader feature), the Amazon Kindle and popular iPad applications like Flipboard and Reeder.
* * * *
Readability Today
Today, our goal is simple: to deliver a great reading experience on every platform and provide an avenue for connecting readers and publishers on the Web.
* * * *
The Team
Readability is designed and built by the Readability team, headquartered in New York City. The project is fortunate enough to have an exceptional team of advisors: Roger Black, Jay Chakrapani, Sarah Chubb, Anil Dash, Paul Ford, Jeffrey MacIntyre, Karen McGrane, and Jeffrey Zeldman.
If you have any questions about Readability, don't hesitate to contact us.
Happy Reading!—The Readability Team

---

As you can see, all the unnecessary headers, footers, navigation and other content is removed, and only the actual page contents (minus a lot of additional whitespace) remains.

Our scraper() function takes two parameters, the url of the page that we want to make readable, and the callback function to call once we’ve finished parsing the specified URL. In the example above we are scraping the readability.com site and logging the results to the console. Once the readability.read() function has parsed the specified url, the internal callback function is called and we create an obj object which contains the url of the page we scraped, the page’s title tag and parsed content of the page (via the node-readability module’s getTitle() and getContent() method, respectively).

The stripHTML() function takes a single String parameter, html, and attempts to remove any HTML tags, and as much unneeded whitespace (newlines and empty lines) as possible.

Leave a Reply

Your email address will not be published.