Long Scraping Process Problems

nvanprooyen

Fortes Fortuna Adiuvat
Dec 8, 2008
3,189
82
0
Orlando, FL
I've been writing this scraper in PHP for the past several months and have it pretty much collecting all the data I want it to. The problem I'm having, is the time frame to go out and grab all of the data is pretty long - like 5-6 hrs for a complete "set"...and when I run a job that size, I almost never get a complete set of data back. Either the app crashes, the browser crashes or something undetermined goes wrong. I'm closing curl and mysql sessions.

I've already started to write some error handling stuff so I can pick up where I left off, but it still pisses me off when I get up in the morning expecting to see 1000 results, and I only have 200. I'm guessing that if I ran this command line I might not run into as many issues, but I still need to test that. Does anyone know how to pass post variables using CLI...because I have no idea. Any other suggestions? I know some of you do some pretty aggressive scraping so any advice you can give me is much appreciated...
 


You can't "POST" to the command line, you need to pass "arguments" in the command line, then pick them up in the script with the $argv variable.

Code:
php /path/to/script.php argument1 argument2
Code:
<?php
  $scriptName = $argv[0];
  //will populate $scriptName with 'script.php'

  $argument1 = $argv[1];
  //will populate $argument1 with 'argument1' that was passed from the command line.

  $argument2 = $argv[2];
  //will populate $argument2 with 'argument2' that was passed from the command line.

?>
Good luck.
 
Use curl_multi or simply run multiple instances of the script that each grab a section of the content.
 
It's not so much an issue of speed. In fact, I have it going intentionally slow so I don't get an IP ban from someone. I just need to figure out a way to run something reliably for that amount of time.
 
What i do is i run fake multi thread in PHP, and having 15 ip's on my server, i run 15 task at once.

But if your crashing, could be in your code, something maybe using up too much memory, not having a Timeout for Curl requests and wait.

Have fun
 
Can you split it up into chunks? Instead of grabbing all 1000 results at once, maybe do 50 at a time, but have it run every 10 minutes? Long running scripts in PHP can be a problem, possible to do yes but not really what PHP was designed for.
 
Can you split it up into chunks? Instead of grabbing all 1000 results at once, maybe do 50 at a time, but have it run every 10 minutes? Long running scripts in PHP can be a problem, possible to do yes but not really what PHP was designed for.

I came here to recommend this. Don't do massive bulk jobs in PHP like that, that's not what it's designed for. Break it up into individual chunks, then let CRON trigger it all day long. That way you can offset the long running part to CRON, something that was designed for that, instead of leaving PHP to do the job.

My other suggestion is switch to something like Python or Ruby (Ruby's the shit) for long running jobs, then let your PHP interface just access the data and trigger the scripts. They're much more reliable languages for these sorts of jobs.
 
I came here to recommend this. Don't do massive bulk jobs in PHP like that, that's not what it's designed for. Break it up into individual chunks, then let CRON trigger it all day long. That way you can offset the long running part to CRON, something that was designed for that, instead of leaving PHP to do the job.

My other suggestion is switch to something like Python or Ruby (Ruby's the shit) for long running jobs, then let your PHP interface just access the data and trigger the scripts. They're much more reliable languages for these sorts of jobs.

I started to teach myself Ruby a bit ago. Probably will be a bit before I get past the learning curve to duplicate what I already have. Great suggestions on the IPs and chunking the code up and using CRON to trigger it. Kind of need to rethink the whole process. Thanks everyone.
 
Hit me up if you want to discuss and Ruby stuff, I use it a lot and can probably share some code with you.

One day I want to get a Ruby thread going in design and dev forum, just haven't gotten around to it

PM me your GChat name, I lost all of my aliases after switching comps so I don't know who everyone is lol
 
I've been writing this scraper in PHP for the past several months and have it pretty much collecting all the data I want it to. The problem I'm having, is the time frame to go out and grab all of the data is pretty long - like 5-6 hrs for a complete "set"...and when I run a job that size, I almost never get a complete set of data back. Either the app crashes, the browser crashes or something undetermined goes wrong. I'm closing curl and mysql sessions.

I've already started to write some error handling stuff so I can pick up where I left off, but it still pisses me off when I get up in the morning expecting to see 1000 results, and I only have 200. I'm guessing that if I ran this command line I might not run into as many issues, but I still need to test that. Does anyone know how to pass post variables using CLI...because I have no idea. Any other suggestions? I know some of you do some pretty aggressive scraping so any advice you can give me is much appreciated...
curl_multi mother fucker.

As an alternative: Break it into smaller sections, use a browser and ajax to send out small groups of urls to query. It kind of sucks having to leave a browser open, but sometimes it's just the easiest way...especially if you need to track partial progress.
 
Thanks Shady. Honestly never heard of curl_multi before today, but I'm a noob at this shit. Going to check it out. In hindsight, opening up 10 different curl sessions in different functions probably wasn't the smartest thing to do :)
 
ssh + screen = long scraping with php, no problems. screen means if your connection goes down, you won't lose your current session.
 
Thanks Shady. Honestly never heard of curl_multi before today, but I'm a noob at this shit. Going to check it out. In hindsight, opening up 10 different curl sessions in different functions probably wasn't the smartest thing to do :)
Ah. Yeah. You want to write containers for everything. As pre-configured as possible.
For Example:
MyCurl
-getURL()
-getData()
-setURL($url)
-setRef($url)
-setPost($shouldPost, $array);
-setPost($shouldPost, $qstr);
-setUserAgent($name)
-getJavascriptVariable($name)
-sendRequest();
-setShowHeader($bool);
MyCurlMulti
-setURLs($array);//loads into MyCurl
-sendRequests();
-getData($connectionIndex);
-getAllData();//returns array of request data

That's just a brief example though. They have a habit of growing.