Scraping websites with server-side jQuery and Node.js

Mahzkrieg

New member
Nov 13, 2007
585
31
0
Austin, Texas
Node.js (Coffeescript):

Code:
log = console.log
http = require "http"
$ = require "jquery"
{parse} = require "url"

# visits url, returns html response body
getHtml = (url, callback) ->
  {hostname, path} = parse url
  log "sending GET: #{url}"
  body = ""
  http.get {hostname, path, port: 80}, (res) ->
    res.setEncoding "utf8"
    res.on "data", (chunk) -> body += chunk
    res.on "end", -> callback body

getHtml "http://danneu.com", (html) ->
  $page = $("body").append html     # the jquery module has an internal "body" to latch onto.
  $links = $page.find("a")
  log(link.href) for link in $links

For fun, here's what I came up with in Ruby and Python:

Ruby:

Code:
require "open-uri"
require "nokogiri"

doc = Nokogiri::HTML open("http://danneu.com")
links = doc.css "a"
urls = links.map { |link| link.attribute("href") }
puts urls

Python (code I wrote years ago):

Code:
import sys 
import urllib2 

if __name__ == "__main__":
  sys.path.append("./BeautifulSoup")
  from BeautifulSoup import BeautifulSoup

  url = "http://danneu.com"
  page = urllib2.build_opener().open(url)
  soup = BeautifulSoup(page)

  for link in soup.findAll("a"):
    print link.href

Of course, the Node version is asynchronous (the entire purpose of Node) and launches all the requests at once while the synchronous Ruby/Python examples only launch a request when the previous returns:

for 10 requests:
- Node.js: 20 seconds
- Ruby & Python: over a minute

You'd use any of Ruby/Python's great parallel http libraries if that mattered or even just launch a few threads manually. But numbers are always fun, especially when the comparison is nonsensical.

My main point was just to demonstrate how to launch an http request with Node and use the simple jQuery module. I find myself doing ad hoc scraping from the command line all the time and was tired of looking up xpath selectors and shit. I already know jQuery. I'm also still in my latest Coffeescript kick.
 


A friend of mine is lead developer of a relatively popular iphone app, and implemented the backend with node.js using redis for queues and aws simpledb as the store, and the whole thing was very elegant and most importantly small. I'm not convinced node.js can scale well, and you kids who skipped C are overly manic about epoll, but there's a lot to like.
 
You'd use any of Ruby/Python's great parallel http libraries if that mattered or even just launch a few threads manually. But numbers are always fun, especially when the comparison is nonsensical.

I wouldn't recommend launching threads manually for IO, it simply doesn't scale very well. You need to use something asynchronous / callback based.

Here's a few from the top of my head for Ruby / Python.

Ruby:
Eventmachine (Fast)
Typhoues (Easy) - Link to a blog post of mine using on Typhoeus, for anyone interested - link.

Python - I'm not actually sure on the speed differences between these two, but if I recall correctly, Twisted does generally have a performance benefit over gevent.
Twisted (Fast)
gevent (Easy)
 
Code:
<?
$start = microtime(true);

include('lib/Curl/CurlMaster.class.php');
include('lib/XPath/XPath.class.php');

$curl = new CurlMaster(10);
$curl->throttle = 10;

$curl->addCallback('success', 'wootResult');

for ($i = 0; $i < 10; $i++)
        $curl->getUrl('http://danneu.com');

$curl->processQueue();

function wootResult($obj)
{
        $x = new XPath();
        foreach ($x->queryHTMLGetAttributes('//a', $obj->getData(), 'href') as $atr)
                echo $atr."\n";
}

echo "Total time: ".(microtime(true) - $start)."\n";


[root@boost blehtest]# php test.php
/
/about-me
/projects
/posts/14-setting-up-mocha-testing-with-coffeescript-node-js-and-a-cakefile
/posts/13-generating-and-submitting-a-sitemap-xml-with-rails
/posts/12-a-list-of-some-helpful-rails-nuggets-snippets-things-to-remember
/posts/11-getting-burned-by-coffeescript-s-implicit-return
/posts/10-javascript-hoisting-and-coffeescript-pop-quiz
/posts/9-rails-3-2-markdown-pygments-redcarpet-heroku
/posts/8-scraping-a-blog-with-anemone-ruby-web-crawler-and-mongodb
/posts/7-keeping-that-free-heroku-dyno-alive-without-feeling-unethical
/posts/6-meteor-tutorial-for-fellow-noobs-adding-features-to-the-leaderboard-demo
/posts/4-darkstrap-css-a-dark-theme-for-twitter-bootstrap-2
http://heroku.com
*snip*
Total time: 0.65578603744507


Or if you prefer 100 requests:

[root@boost blehtest]# time php test.php | grep 'http://heroku.com' | wc -l
100

real 0m1.592s
user 0m0.174s
sys 0m0.257s
 
I don't know how to replicate that script. How can I install those PHP Gems? :P

I wouldn't recommend launching threads manually for IO, it simply doesn't scale very well. You need to use something asynchronous / callback based.

Here's a few from the top of my head for Ruby / Python.

Ruby:
Eventmachine (Fast)
Typhoues (Easy) - Link to a blog post of mine using on Typhoeus, for anyone interested - link.

I didn't really know who my audience would be, so the "benchmark" was meant to be an expression of the straight forward scrape script we start out writing in scraping 101 vs. parallel requests. Scraping comes up all the time here.

Dchuk introduced me to Typhoeus the other year and have been using it ever since. I also know he wrapped it with a crawler. I write Ruby 99% of the time until the past month where I took an intellectual vacation into something new.

A friend of mine is lead developer of a relatively popular iphone app, and implemented the backend with node.js using redis for queues and aws simpledb as the store, and the whole thing was very elegant and most importantly small. I'm not convinced node.js can scale well, and you kids who skipped C are overly manic about epoll, but there's a lot to like.

I usually just import libev directly for my real web servers. Ignore these kids.
 
A friend of mine is lead developer of a relatively popular iphone app, and implemented the backend with node.js using redis for queues and aws simpledb as the store, and the whole thing was very elegant and most importantly small. I'm not convinced node.js can scale well, and you kids who skipped C are overly manic about epoll, but there's a lot to like.

I've never been involved in the Node.js scene, but if I had to guess they don't like using epoll directly because it's non-portable. Maybe they prefer libevent? If they also have an irrational fear of libevent then I dunno what to say.
 
I don't know how to replicate that script. How can I install those PHP Gems? :P

He's using curl_multi, DOMDocument, and DOMXpath. It's fairly easy to get those modules installed if they aren't already. Curl multi is part of the standard curl module. You might have to write your own wrappers for them. I don't know if what he used is public or not.
 
I don't know how to replicate that script. How can I install those PHP Gems? :P



I didn't really know who my audience would be, so the "benchmark" was meant to be an expression of the straight forward scrape script we start out writing in scraping 101 vs. parallel requests. Scraping comes up all the time here.

Dchuk introduced me to Typhoeus the other year and have been using it ever since. I also know he wrapped it with a crawler. I write Ruby 99% of the time until the past month where I took an intellectual vacation into something new.



I usually just import libev directly for my real web servers. Ignore these kids.


typhoeus is actually a ruby wrapper to curl multi, so it's ultimately doing the same thing as insomniac's script
 
He's using curl_multi, DOMDocument, and DOMXpath. It's fairly easy to get those modules installed if they aren't already. Curl multi is part of the standard curl module. You might have to write your own wrappers for them. I don't know if what he used is public or not.

Incorrect

[root@boost Curl]# grep -R curl_multi *
[root@boost Curl]#

There are limitations with curl_multi when you start dealing with significant amounts of connections.

Insomniac would you mind sharing your CurlMaster class? Would really appreciate it (as would others I asssume)

Sorry, my classes are all private. I'm not trying to brag, just showing what PHP is capable of.
 
I've never been involved in the Node.js scene, but if I had to guess they don't like using epoll directly because it's non-portable. Maybe they prefer libevent? If they also have an irrational fear of libevent then I dunno what to say.

Right on. Node uses libev which is faster than libevent.

Ruby's EventMachine actually uses its own library.
 
developers are the god for the website

design development and programming are the god for the website.Without there incredible technique the website cannot be made easy to access. The user needs website which are easy to access and operated by them even if they do not know much about computers. And the design is the beauty for the website which will attract the people.
 
design development and programming are the god for the website.Without there incredible technique the website cannot be made easy to access. The user needs website which are easy to access and operated by them even if they do not know much about computers. And the design is the beauty for the website which will attract the people.

Keep upping that post count! Soon you'll be able to spam the forum with real links!