Programming a scraper

Status
Not open for further replies.

vizilla

New member
Jul 11, 2006
173
3
0
I'd like to program a script that would scrape content off a target site :)

Anyone have a tutorial or give a general pointer?:rasta:
 


I'd like to program a script that would scrape content off a target site :)

Anyone have a tutorial or give a general pointer?:rasta:

Well, just to get you started....

1. First you need to get the html code of that particular page. So find the function in php that does that. I think you can use file_get_contents() function.
2. Then by looking at the HTML source, you can define the boundaries of the content that interests you.
3. Finally by using Regular Expression, extract that content and do whatever you want to do with it.

Of course it depends on the target site. If for example you need to scrap few articles from an articles site, you should figure out the way to request them dynamically. Then run every article site through your regular expression to extract the actual article and save them all to an output file.

But because your question is not very accurate, I cannot help you further right now.

Hope that helps.
 
Start with this for php.

Code:
class spider
  {
    // This class grabs the content from the sites                
      function setup()
      {
        if(ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off'));
          $cookieJar = 'cookies.txt';
          curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar); 
          curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
          curl_setopt($this->curl,CURLOPT_AUTOREFERER,true);
          curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION,true);
          curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);        
      }
      
      function get($url)
      { 
          $this->curl = curl_init($url);
          $this->setup();            
          return $this->request();
      }
      
      function request()
      {
          return curl_exec($this->curl);
      }
  }

Its not what you do with the scraper, its what you do when you get the code when you get it, I hope you know all of your php string functions because you'll need them as well as some regex expressions to strip out the parts you need, for directories and shit it should be easy enough.
 
Easiest scraper ever:

Code:
$content = htmlspecialchars(file_get_contents(URL));

This outputs the html of a given URL. Not as good as all the curl stuff usually, but works for me (a beginner).
 
Its not what you do with the scraper, its what you do when you get the code when you get it, I hope you know all of your php string functions because you'll need them as well as some regex expressions to strip out the parts you need, for directories and shit it should be easy enough.

That's the tutorial I would be more interested in seeing. :):):)
 
Status
Not open for further replies.