Best Webbot/Scraper Scripting Language

Status
Not open for further replies.

plepco

New member
May 24, 2007
275
11
0
New Orleans
www.massindexer.com
I wrote some kickass tools in PHP that just run really slow. So I got to thinking about speeding these particular scripts up. But there's only so much I can optimize the code. Which brings up this question: is there a faster scripting language than PHP?

Now, I realize it probably depends on the application. In this case, its a spider/scraper. I am unaware of any method to do what I am doing other than to download each page and look for the data I want. When you do this for a hundred web pages it takes a loooong time. So there might NOT be anything faster than the PHP script I already wrote.

Nevertheless I'm interested in others' thoughts on scripting languages. I know a lot of webbots are written in Perl, PHP, and a few I've seen in Ruby. Let me know what you guys think.
 


Are you running many simultaneous processes? or forking or using multi curl? those have worked fine for me. if your doing 1 page at a time its going to be slow in any language.
 
Well I have to tell ya....I have no clue how to do this forking or multi curl you speak of. That's why I love asking questions: I often learn new cool things.

Yeah I might have 50 or 100 pages to do and I do 'em one at a time. Because I didn't know there was a faster method. I'm gonna do some searching and reading on this...
 
curl-multi should be fine for a small amount of pages like that. Could probably run 50 threads and get all pages in 30 seconds or less
 
Wow never heard of curl-multi, that sounds really good because the scraping speed is the bottleneck for me.
 
Can't find any example of multicurl on the web.

Can anyone show a small example? Like searching 10 keywords simultaneously through google?
 
Yes go to php.net and look at all the cURL stuff here. Okay, you'll see a bunch of stuff that says curl_multi. That's all the stuff you need to know.

I am still trying to get this to work the way I want but the order of events is basically like "regular" cURL:
1. Set options
2. Initialize a session
3. Execute a session
4. Close your session
5. Move on to parsing the scraped content however you want

The difference here is that you are doing this with multiple handles so you can quickly become as confused as a termite in a yoyo as we used to say in Texas. (That is, if you have a lot of stuff going on at once.)

At the bottom of a few of those multi_curl function pages is an example script that demonstrates using multi_curl BTW.
 
Status
Not open for further replies.