Programming a scraper

vizilla · Aug 28, 2007

I'd like to program a script that would scrape content off a target site

Anyone have a tutorial or give a general pointer?:rasta:

barman · Aug 28, 2007

oooff.com

DomainRealty · Aug 28, 2007

justfuckinggoogleit.com

psychoul · Aug 28, 2007

vizilla said:
I'd like to program a script that would scrape content off a target site

Anyone have a tutorial or give a general pointer?:rasta:

Well, just to get you started....

1. First you need to get the html code of that particular page. So find the function in php that does that. I think you can use file_get_contents() function.
2. Then by looking at the HTML source, you can define the boundaries of the content that interests you.
3. Finally by using Regular Expression, extract that content and do whatever you want to do with it.

Of course it depends on the target site. If for example you need to scrap few articles from an articles site, you should figure out the way to request them dynamically. Then run every article site through your regular expression to extract the actual article and save them all to an output file.

But because your question is not very accurate, I cannot help you further right now.

Hope that helps.

Aequitas · Aug 28, 2007

Start with this for php.

Code:

class spider
  {
    // This class grabs the content from the sites                
      function setup()
      {
        if(ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off'));
          $cookieJar = 'cookies.txt';
          curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar); 
          curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
          curl_setopt($this->curl,CURLOPT_AUTOREFERER,true);
          curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION,true);
          curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);        
      }
      
      function get($url)
      { 
          $this->curl = curl_init($url);
          $this->setup();            
          return $this->request();
      }
      
      function request()
      {
          return curl_exec($this->curl);
      }
  }

Its not what you do with the scraper, its what you do when you get the code when you get it, I hope you know all of your php string functions because you'll need them as well as some regex expressions to strip out the parts you need, for directories and shit it should be easy enough.

sam-i-am · Aug 29, 2007

Easiest scraper ever:

Code:

$content = htmlspecialchars(file_get_contents(URL));

This outputs the html of a given URL. Not as good as all the curl stuff usually, but works for me (a beginner).

sam-i-am · Aug 29, 2007

Aequitas said:
Its not what you do with the scraper, its what you do when you get the code when you get it, I hope you know all of your php string functions because you'll need them as well as some regex expressions to strip out the parts you need, for directories and shit it should be easy enough.

That's the tutorial I would be more interested in seeing.

krazyjosh5 · Aug 29, 2007

regex on this one is a bitch. almost done, tho.

smaxor · Aug 29, 2007

vizilla that's what my site Scraping and Posting your way to money on the Internet - Oooff.com is as thebarman mentioned. Hope it helps yah.

Search

Search

Programming a scraper

vizilla

New member

barman

New member

DomainRealty

I'm a Coder

psychoul

New member

Aequitas

New member

sam-i-am

Banned

sam-i-am

Banned

krazyjosh5

theres GOLD in dem tubes!

smaxor

New member