PHP - Scraping Multiple Pages

elime · Jun 11, 2009

Hi guys,

I'm not even close to great with php, I mostly just tinker around when I get a chance. I'm hoping this is something super simple I'm just not catching on to.

So, quick question here about scraping. Most scripts I see (and ones I've successfully made/frankenscripted myself) focus on scraping a single page, which is fine, but what if you want a single script to scrape multiple pages?

Say I have a list of urls in a text document - nothing massive, like 5-10. What would be the best way to go about scraping the html from each and dumping it all into a single, appended txt?

What I've tried is importing the list as an array and then using foreach to get the html and fwrite it. Needless to say I'm having some difficulties.

Yes, I know I should be getting comfortable with cURL, that's my next step.

Any tips or info on this would be great though, thanks!

erect · Jun 11, 2009

just run multiple scrapes together in the same variable then dump it as you see fit.

example:

$scrape = file_get_contents('hxxp://mysite.com/1.php') ;
$scrape .= file_get_contents('hxxp://mysite.com/2.php') ;
$scrape .= file_get_contents('hxxp://mysite.com/3.php') ;

file_put_contents('scrape.txt',$scrape) ;

if it's a line separated file, you might have to manually add a line break after each scrape.

Rage9 · Jun 11, 2009

Been working on some scraping tutorials:
Noobie Scraping Guide | MadPPC

Hope they help.

elime · Jun 11, 2009

Thanks for the info, Erect. Rage, I'm checking out your guides now.

DewChugr · Jun 11, 2009

You should be able to scrape the page numbers/links and then loop through all of them.

BradleyT2p2 · Jun 12, 2009

Make your scraper a function that takes in a URL. Then do whatever kind of looping necessary to pass in all the URLs.

machinecontrol · Jun 12, 2009

Take a look at the curl_multi functions, easy multithreading with curl, much much faster than doing it in serial.

shrak · Jun 21, 2009

If you want to load urls from a txt file and scrape the html, try something like this

Code:

<?php

$file ="yourfile.txt"; //file that contains your links

$links=file_get_contents($file);

foreach($links as $link)

{

$link=trim($link);


$htmlfile = fopen("htmlfile.txt",'a'); /* all your scraped html content will be in this file */

$html = file_get_contents($link);

fwrite($htmlfile,$html);


}

?>

teampl4y4 · Jun 21, 2009

Has anyone built a php based parser to remove in line scripts.. strip_tags has it's limitations.

shaggz · Jun 21, 2009

teampl4y4 said:
Has anyone built a php based parser to remove in line scripts.. strip_tags has it's limitations.

something like?

Code:

$result = preg_replace('%<script.+?>.+?</script>%si', '', $subject);

shaggz · Jun 21, 2009

machinecontrol said:
Take a look at the curl_multi functions, easy multithreading with curl, much much faster than doing it in serial.

Here's a trio of classes and an example from some library functions that I've used:

Classes:

Code:

  class WebRequest {
      
      
      public $RequestHeader = array();
      public $UserAgent = 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7';
      public $Url;
      public $Html;
      public $ID;
      public $CookieJar = null;
      
      public function __construct( $ID, $Url, $HeaderArray = null  ) {
        if ($HeaderArray == null) {
            $this->InitStandardHeader();   
        } else {
            $this->RequestHeader = $HeaderArray;   
        }       
        $this->Url = $Url;
        $this->ID = $ID;
      }
      
      
      public function InitStandardHeader() {
          
        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank. 
        
        $this->RequestHeader = $header;   
      }
       
  }


  class CurlMulti {
      
      
      public $TIMEOUT = 10;

      /*
      *    @param $WebRequestArray
      */
        public function ExecuteRequests($WebRequestArray) {
            $mh = curl_multi_init();

            foreach ($WebRequestArray as $i => $WR) {
                $conn[$i]=curl_init($WR->Url);

                curl_setopt($conn[$i], CURLOPT_USERAGENT, $WR->UserAgent);
                curl_setopt($conn[$i], CURLOPT_HTTPHEADER, $WR->RequestHeader);
                
                if ($WR->CookieJar != null) {
                    curl_setopt($conn[$i],CURLOPT_COOKIEJAR, $cookieJar); 
                }
                curl_setopt($conn[$i],CURLOPT_AUTOREFERER, true);   
                curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,1);//return data as string 
                curl_setopt($conn[$i],CURLOPT_FOLLOWLOCATION,1);//follow redirects
                curl_setopt($conn[$i],CURLOPT_MAXREDIRS,2);//maximum redirects
                curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,$this->TIMEOUT);//timeout
                curl_multi_add_handle ($mh,$conn[$i]);
            }

            do { $n=curl_multi_exec($mh,$active); } while ($active);

            foreach ($WebRequestArray as $i => $WR) {                  
                   $WR->Html = curl_multi_getcontent($conn[$i]);
                   curl_multi_remove_handle($mh,$conn[$i]);
                   curl_close($conn[$i]);
            }
            curl_multi_close($mh);
            return $WebRequestArray;

        }
   
      
  }

  Class HTMLStripper {
      
      
      // set of frankensteined regexes that strip all html from text
      public static function StripAllHTML($html) {
 
        $html = preg_replace('/document\\.write\\(.+?\\);/si', ' ', $html); 
        $search = array('%<\\s*script[^>]*?>.*?<\\s*/script\\s*>%si',  // Strip out javascript
         '@<\\s*style[^>]*?>.*?<\\s*/style\\s*>@siU',    // Strip style tags properly
         '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
         '@<![\s\S]*?--[ \t\n\r]*>@');        // Strip multi-line comments including CDATA

        $html = preg_replace($search, '', $html);

        //$html = preg_replace('/<a\s+.*?href="([^"]+)"[^>]*>([^<]+)<\/a>/is', '\2', $html);                
        $html = strip_tags( $html );           



        
        $html = strtolower($html);
        
        $html = preg_replace('/&[a-z]{2,6};/', ' ', $html);
        $html = preg_replace('/\\n/', ' ', $html);
        $html = preg_replace('/-{2,}+/', ' ', $html);
        $html = preg_replace('/&#\\d+;/', ' ', $html);  
        $html = preg_replace('%\\s{2,}%', ' ', $html);  // shrink double spaces
        
        $punc =". , : ; ? ! ( ) = / \" \\ * _ > < | @ $ [ ] · +";
        $punc = explode(" ",$punc);
        foreach($punc as $value){
            $html = str_replace($value, " ", $html);
        }
        
        $html = preg_replace('%\'%', '', $html);  // shrink double spaces                   
        $html = preg_replace('%\\s{2,}%', ' ', $html);  // shrink double spaces      
      
        return $html;
      
      }   
      
      
      
  }

Usage:

Code:

$DEBUG = true;

// setup the urls we want to scrape:
$urls = array();
$urls[] = "http://www.google.com";
$urls[] = "http://www.yahoo.com";
$urls[] = "http://www.wickedfire.com";


// we'll append all HTML to this variable:
$allhtml = "";                
            
// BUILD OUT WEBREQUEST ARRAY
$wreqs = array();
foreach ($urls as $value) {
    $wr = new WebRequest($value, $value);
    $wreqs[] = $wr;  
}
// GET ALL USING CURL MULTI
$cm = new CurlMulti();
if ($DEBUG) printf("%s: %s pages requested<BR>",  time(), count($urls) );
$wreqs = $cm->ExecuteRequests($wreqs);
if ($DEBUG) printf("%s: %s pages fetched<BR>",  time(), count($urls) );
foreach ($wreqs as $value) {
    $string = $value->Html;
    $string = HTMLStripper::StripAllHTML($string);
    $allhtml .= " " . $string;            
}

teampl4y4 · Jun 21, 2009

Thanks Shaggs ::StripAllHTML is just what I am looking for. I wonder why they/you didn't use strip_tags for the html one though, anyways great.

shaggz · Jun 22, 2009

teampl4y4 said:
Thanks Shaggs ::StripAllHTML is just what I am looking for. I wonder why they/you didn't use strip_tags for the html one though, anyways great.

It does use strip_tags about halfway thru. IIRC, this function was the result of 2 other similar functions i found, as well as some more regex stuff I added in. I know I had a giant blob of shitty HTML I was using as a test case (including some malformed HTML). Strip_tags by itself wasn't working all that well and when I finally got i all working, that function was the result.

erect · Jun 22, 2009

good share!

Would you mind moving your code over to the php function war chest ... your code will live in infamy.

yebot · Jun 22, 2009

If you're rolling your own scraping tools, Snoopy might save you some time:

SourceForge.net: Snoopy

Search

Search

PHP - Scraping Multiple Pages

elime

New member

erect

New member

Rage9

Banned

elime

New member

DewChugr

Photoshop God

BradleyT2p2

New member

machinecontrol

юзверь

shrak

New member

teampl4y4

Lame-o User

shaggz

New member

shaggz

New member

teampl4y4

Lame-o User

shaggz

New member

erect

New member

yebot

New member

PHP - Scraping Multiple Pages

New member

New member

Banned

New member

Photoshop God

New member

&#1102;&#1079;&#1074;&#1077;&#1088;&#1100;

New member

Lame-o User

New member

New member

Lame-o User

New member

New member

New member

юзверь