Scraping websites with server-side jQuery and Node.js

Mahzkrieg · Jun 8, 2012

Node.js (Coffeescript):

Code:

log = console.log
http = require "http"
$ = require "jquery"
{parse} = require "url"

# visits url, returns html response body
getHtml = (url, callback) ->
  {hostname, path} = parse url
  log "sending GET: #{url}"
  body = ""
  http.get {hostname, path, port: 80}, (res) ->
    res.setEncoding "utf8"
    res.on "data", (chunk) -> body += chunk
    res.on "end", -> callback body

getHtml "http://danneu.com", (html) ->
  $page = $("body").append html     # the jquery module has an internal "body" to latch onto.
  $links = $page.find("a")
  log(link.href) for link in $links

For fun, here's what I came up with in Ruby and Python:

Ruby:

Code:

require "open-uri"
require "nokogiri"

doc = Nokogiri::HTML open("http://danneu.com")
links = doc.css "a"
urls = links.map { |link| link.attribute("href") }
puts urls

Python (code I wrote years ago):

Code:

import sys 
import urllib2 

if __name__ == "__main__":
  sys.path.append("./BeautifulSoup")
  from BeautifulSoup import BeautifulSoup

  url = "http://danneu.com"
  page = urllib2.build_opener().open(url)
  soup = BeautifulSoup(page)

  for link in soup.findAll("a"):
    print link.href

Of course, the Node version is asynchronous (the entire purpose of Node) and launches all the requests at once while the synchronous Ruby/Python examples only launch a request when the previous returns:

for 10 requests:
- Node.js: 20 seconds
- Ruby & Python: over a minute

You'd use any of Ruby/Python's great parallel http libraries if that mattered or even just launch a few threads manually. But numbers are always fun, especially when the comparison is nonsensical.

My main point was just to demonstrate how to launch an http request with Node and use the simple jQuery module. I find myself doing ad hoc scraping from the command line all the time and was tired of looking up xpath selectors and shit. I already know jQuery. I'm also still in my latest Coffeescript kick.

JCash · Jun 8, 2012

A friend of mine is lead developer of a relatively popular iphone app, and implemented the backend with node.js using redis for queues and aws simpledb as the store, and the whole thing was very elegant and most importantly small. I'm not convinced node.js can scale well, and you kids who skipped C are overly manic about epoll, but there's a lot to like.

Jake232 · Jun 8, 2012

Mahzkrieg said:
You'd use any of Ruby/Python's great parallel http libraries if that mattered or even just launch a few threads manually. But numbers are always fun, especially when the comparison is nonsensical.

I wouldn't recommend launching threads manually for IO, it simply doesn't scale very well. You need to use something asynchronous / callback based.

Here's a few from the top of my head for Ruby / Python.

Ruby:
Eventmachine (Fast)
Typhoues (Easy) - Link to a blog post of mine using on Typhoeus, for anyone interested - link.

Python - I'm not actually sure on the speed differences between these two, but if I recall correctly, Twisted does generally have a performance benefit over gevent.
Twisted (Fast)
gevent (Easy)

Insomniac · Jun 8, 2012

Code:

<?
$start = microtime(true);

include('lib/Curl/CurlMaster.class.php');
include('lib/XPath/XPath.class.php');

$curl = new CurlMaster(10);
$curl->throttle = 10;

$curl->addCallback('success', 'wootResult');

for ($i = 0; $i < 10; $i++)
        $curl->getUrl('http://danneu.com');

$curl->processQueue();

function wootResult($obj)
{
        $x = new XPath();
        foreach ($x->queryHTMLGetAttributes('//a', $obj->getData(), 'href') as $atr)
                echo $atr."\n";
}

echo "Total time: ".(microtime(true) - $start)."\n";

[root@boost blehtest]# php test.php
/
/about-me
/projects
/posts/14-setting-up-mocha-testing-with-coffeescript-node-js-and-a-cakefile
/posts/13-generating-and-submitting-a-sitemap-xml-with-rails
/posts/12-a-list-of-some-helpful-rails-nuggets-snippets-things-to-remember
/posts/11-getting-burned-by-coffeescript-s-implicit-return
/posts/10-javascript-hoisting-and-coffeescript-pop-quiz
/posts/9-rails-3-2-markdown-pygments-redcarpet-heroku
/posts/8-scraping-a-blog-with-anemone-ruby-web-crawler-and-mongodb
/posts/7-keeping-that-free-heroku-dyno-alive-without-feeling-unethical
/posts/6-meteor-tutorial-for-fellow-noobs-adding-features-to-the-leaderboard-demo
/posts/4-darkstrap-css-a-dark-theme-for-twitter-bootstrap-2
http://heroku.com
*snip*
Total time: 0.65578603744507

Or if you prefer 100 requests:

[root@boost blehtest]# time php test.php | grep 'http://heroku.com' | wc -l
100

real 0m1.592s
user 0m0.174s
sys 0m0.257s

Mahzkrieg · Jun 8, 2012

I don't know how to replicate that script. How can I install those PHP Gems?

Jake232 said:
I wouldn't recommend launching threads manually for IO, it simply doesn't scale very well. You need to use something asynchronous / callback based.

Here's a few from the top of my head for Ruby / Python.

Ruby:
Eventmachine (Fast)
Typhoues (Easy) - Link to a blog post of mine using on Typhoeus, for anyone interested - link.

I didn't really know who my audience would be, so the "benchmark" was meant to be an expression of the straight forward scrape script we start out writing in scraping 101 vs. parallel requests. Scraping comes up all the time here.

Dchuk introduced me to Typhoeus the other year and have been using it ever since. I also know he wrapped it with a crawler. I write Ruby 99% of the time until the past month where I took an intellectual vacation into something new.

JCash said:
A friend of mine is lead developer of a relatively popular iphone app, and implemented the backend with node.js using redis for queues and aws simpledb as the store, and the whole thing was very elegant and most importantly small. I'm not convinced node.js can scale well, and you kids who skipped C are overly manic about epoll, but there's a lot to like.

I usually just import libev directly for my real web servers. Ignore these kids.

crackp0t · Jun 10, 2012

JCash said:
A friend of mine is lead developer of a relatively popular iphone app, and implemented the backend with node.js using redis for queues and aws simpledb as the store, and the whole thing was very elegant and most importantly small. I'm not convinced node.js can scale well, and you kids who skipped C are overly manic about epoll, but there's a lot to like.

I've never been involved in the Node.js scene, but if I had to guess they don't like using epoll directly because it's non-portable. Maybe they prefer libevent? If they also have an irrational fear of libevent then I dunno what to say.

crackp0t · Jun 10, 2012

Mahzkrieg said:
I don't know how to replicate that script. How can I install those PHP Gems?

He's using curl_multi, DOMDocument, and DOMXpath. It's fairly easy to get those modules installed if they aren't already. Curl multi is part of the standard curl module. You might have to write your own wrappers for them. I don't know if what he used is public or not.

dchuk · Jun 10, 2012

Mahzkrieg said:
I don't know how to replicate that script. How can I install those PHP Gems?

I didn't really know who my audience would be, so the "benchmark" was meant to be an expression of the straight forward scrape script we start out writing in scraping 101 vs. parallel requests. Scraping comes up all the time here.

Dchuk introduced me to Typhoeus the other year and have been using it ever since. I also know he wrapped it with a crawler. I write Ruby 99% of the time until the past month where I took an intellectual vacation into something new.

I usually just import libev directly for my real web servers. Ignore these kids.

typhoeus is actually a ruby wrapper to curl multi, so it's ultimately doing the same thing as insomniac's script

hehejo · Jun 10, 2012

Insomniac would you mind sharing your CurlMaster class? Would really appreciate it (as would others I asssume)

Insomniac · Jun 10, 2012

crackp0t said:
He's using curl_multi, DOMDocument, and DOMXpath. It's fairly easy to get those modules installed if they aren't already. Curl multi is part of the standard curl module. You might have to write your own wrappers for them. I don't know if what he used is public or not.

Incorrect

[root@boost Curl]# grep -R curl_multi *
[root@boost Curl]#

There are limitations with curl_multi when you start dealing with significant amounts of connections.

hehejo said:
Insomniac would you mind sharing your CurlMaster class? Would really appreciate it (as would others I asssume)

Sorry, my classes are all private. I'm not trying to brag, just showing what PHP is capable of.

Mahzkrieg · Jun 10, 2012

crackp0t said:
I've never been involved in the Node.js scene, but if I had to guess they don't like using epoll directly because it's non-portable. Maybe they prefer libevent? If they also have an irrational fear of libevent then I dunno what to say.

Right on. Node uses libev which is faster than libevent.

Ruby's EventMachine actually uses its own library.

hehejo · Jun 11, 2012

Insomniac said:
There are limitations with curl_multi when you start dealing with significant amounts of connections.

So true... This is why I was asking, curl multi has been giving me headaches for a while now...

I understand if you don't want to share your code, but how did you solve it? Did you write your own curl wrapper? Or is that a secret too?

andrewthomus · Jun 18, 2012

developers are the god for the website

design development and programming are the god for the website.Without there incredible technique the website cannot be made easy to access. The user needs website which are easy to access and operated by them even if they do not know much about computers. And the design is the beauty for the website which will attract the people.

NathanRidley · Jun 19, 2012

andrewthomus said:
design development and programming are the god for the website.Without there incredible technique the website cannot be made easy to access. The user needs website which are easy to access and operated by them even if they do not know much about computers. And the design is the beauty for the website which will attract the people.

Keep upping that post count! Soon you'll be able to spam the forum with real links!

mattseh · Jun 19, 2012

As a user, I do need website, even though I do know about computers.

Search

Search

Scraping websites with server-side jQuery and Node.js

Mahzkrieg

New member

JCash

Oldschool IM

Jake232

New member

Insomniac

New member

Mahzkrieg

New member

crackp0t

010001100100011101010100

crackp0t

010001100100011101010100

dchuk

Senior Botter

hehejo

Developer

Insomniac

New member

Mahzkrieg

New member

hehejo

Developer

andrewthomus

New member

NathanRidley

New member

mattseh

import this