Best Webbot/Scraper Scripting Language

plepco · Aug 12, 2007

I wrote some kickass tools in PHP that just run really slow. So I got to thinking about speeding these particular scripts up. But there's only so much I can optimize the code. Which brings up this question: is there a faster scripting language than PHP?

Now, I realize it probably depends on the application. In this case, its a spider/scraper. I am unaware of any method to do what I am doing other than to download each page and look for the data I want. When you do this for a hundred web pages it takes a loooong time. So there might NOT be anything faster than the PHP script I already wrote.

Nevertheless I'm interested in others' thoughts on scripting languages. I know a lot of webbots are written in Perl, PHP, and a few I've seen in Ruby. Let me know what you guys think.

lerchmo · Aug 12, 2007

Are you running many simultaneous processes? or forking or using multi curl? those have worked fine for me. if your doing 1 page at a time its going to be slow in any language.

plepco · Aug 12, 2007

Well I have to tell ya....I have no clue how to do this forking or multi curl you speak of. That's why I love asking questions: I often learn new cool things.

Yeah I might have 50 or 100 pages to do and I do 'em one at a time. Because I didn't know there was a faster method. I'm gonna do some searching and reading on this...

Fewleftstanding · Aug 12, 2007

Maybe look into a PHP optimizer like Zend? (There are others..cant recall any names)

smaxor · Aug 12, 2007

curl-multi should be fine for a small amount of pages like that. Could probably run 50 threads and get all pages in 30 seconds or less

ashleybaker · Aug 12, 2007

From what i've heard, perl should be fantastic for scraping content. But, I've never worked with perl so no idea how it actually is.

Choller · Aug 13, 2007

Wow never heard of curl-multi, that sounds really good because the scraping speed is the bottleneck for me.

Choller · Aug 16, 2007

Can't find any example of multicurl on the web.

Can anyone show a small example? Like searching 10 keywords simultaneously through google?

ashleybaker · Aug 16, 2007

This was just a quick search, so may not help you at all, but you may want to look through
curl_multi_init lang

hp - Google Code Search

Choller · Aug 16, 2007

Actually I also just found the basic one on php.net
PHP: curl_multi_add_handle - Manual

plepco · Aug 17, 2007

Yes go to php.net and look at all the cURL stuff here. Okay, you'll see a bunch of stuff that says curl_multi. That's all the stuff you need to know.

I am still trying to get this to work the way I want but the order of events is basically like "regular" cURL:
1. Set options
2. Initialize a session
3. Execute a session
4. Close your session
5. Move on to parsing the scraped content however you want

The difference here is that you are doing this with multiple handles so you can quickly become as confused as a termite in a yoyo as we used to say in Texas. (That is, if you have a lot of stuff going on at once.)

At the bottom of a few of those multi_curl function pages is an example script that demonstrates using multi_curl BTW.

Search

Search

Best Webbot/Scraper Scripting Language

plepco

New member

lerchmo

Senior Member

plepco

New member

Fewleftstanding

New member

smaxor

New member

ashleybaker

=)

Choller

New member

Choller

New member

ashleybaker

=)

Choller

New member

plepco

New member