PHP - Scraping Multiple Pages

Status
Not open for further replies.

elime

New member
Oct 19, 2007
866
17
0
Hi guys,

I'm not even close to great with php, I mostly just tinker around when I get a chance. I'm hoping this is something super simple I'm just not catching on to.

So, quick question here about scraping. Most scripts I see (and ones I've successfully made/frankenscripted myself) focus on scraping a single page, which is fine, but what if you want a single script to scrape multiple pages?

Say I have a list of urls in a text document - nothing massive, like 5-10. What would be the best way to go about scraping the html from each and dumping it all into a single, appended txt?

What I've tried is importing the list as an array and then using foreach to get the html and fwrite it. Needless to say I'm having some difficulties.

Yes, I know I should be getting comfortable with cURL, that's my next step.

Any tips or info on this would be great though, thanks!
 


just run multiple scrapes together in the same variable then dump it as you see fit.

example:

$scrape = file_get_contents('hxxp://mysite.com/1.php') ;
$scrape .= file_get_contents('hxxp://mysite.com/2.php') ;
$scrape .= file_get_contents('hxxp://mysite.com/3.php') ;

file_put_contents('scrape.txt',$scrape) ;

if it's a line separated file, you might have to manually add a line break after each scrape.
 
  • Like
Reactions: elime
Make your scraper a function that takes in a URL. Then do whatever kind of looping necessary to pass in all the URLs.
 
If you want to load urls from a txt file and scrape the html, try something like this

Code:
<?php

$file ="yourfile.txt"; //file that contains your links

$links=file_get_contents($file);

foreach($links as $link)

{

$link=trim($link);


$htmlfile = fopen("htmlfile.txt",'a'); /* all your scraped html content will be in this file */

$html = file_get_contents($link);

fwrite($htmlfile,$html);


}

?>
 
Take a look at the curl_multi functions, easy multithreading with curl, much much faster than doing it in serial.


Here's a trio of classes and an example from some library functions that I've used:

Classes:
Code:
  class WebRequest {
      
      
      public $RequestHeader = array();
      public $UserAgent = 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7';
      public $Url;
      public $Html;
      public $ID;
      public $CookieJar = null;
      
      public function __construct( $ID, $Url, $HeaderArray = null  ) {
        if ($HeaderArray == null) {
            $this->InitStandardHeader();   
        } else {
            $this->RequestHeader = $HeaderArray;   
        }       
        $this->Url = $Url;
        $this->ID = $ID;
      }
      
      
      public function InitStandardHeader() {
          
        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank. 
        
        $this->RequestHeader = $header;   
      }
       
  }


  class CurlMulti {
      
      
      public $TIMEOUT = 10;

      /*
      *    @param $WebRequestArray
      */
        public function ExecuteRequests($WebRequestArray) {
            $mh = curl_multi_init();

            foreach ($WebRequestArray as $i => $WR) {
                $conn[$i]=curl_init($WR->Url);

                curl_setopt($conn[$i], CURLOPT_USERAGENT, $WR->UserAgent);
                curl_setopt($conn[$i], CURLOPT_HTTPHEADER, $WR->RequestHeader);
                
                if ($WR->CookieJar != null) {
                    curl_setopt($conn[$i],CURLOPT_COOKIEJAR, $cookieJar); 
                }
                curl_setopt($conn[$i],CURLOPT_AUTOREFERER, true);   
                curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,1);//return data as string 
                curl_setopt($conn[$i],CURLOPT_FOLLOWLOCATION,1);//follow redirects
                curl_setopt($conn[$i],CURLOPT_MAXREDIRS,2);//maximum redirects
                curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,$this->TIMEOUT);//timeout
                curl_multi_add_handle ($mh,$conn[$i]);
            }

            do { $n=curl_multi_exec($mh,$active); } while ($active);

            foreach ($WebRequestArray as $i => $WR) {                  
                   $WR->Html = curl_multi_getcontent($conn[$i]);
                   curl_multi_remove_handle($mh,$conn[$i]);
                   curl_close($conn[$i]);
            }
            curl_multi_close($mh);
            return $WebRequestArray;

        }
   
      
  }

  Class HTMLStripper {
      
      
      // set of frankensteined regexes that strip all html from text
      public static function StripAllHTML($html) {
 
        $html = preg_replace('/document\\.write\\(.+?\\);/si', ' ', $html); 
        $search = array('%<\\s*script[^>]*?>.*?<\\s*/script\\s*>%si',  // Strip out javascript
         '@<\\s*style[^>]*?>.*?<\\s*/style\\s*>@siU',    // Strip style tags properly
         '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
         '@<![\s\S]*?--[ \t\n\r]*>@');        // Strip multi-line comments including CDATA

        $html = preg_replace($search, '', $html);

        //$html = preg_replace('/<a\s+.*?href="([^"]+)"[^>]*>([^<]+)<\/a>/is', '\2', $html);                
        $html = strip_tags( $html );           



        
        $html = strtolower($html);
        
        $html = preg_replace('/&[a-z]{2,6};/', ' ', $html);
        $html = preg_replace('/\\n/', ' ', $html);
        $html = preg_replace('/-{2,}+/', ' ', $html);
        $html = preg_replace('/&#\\d+;/', ' ', $html);  
        $html = preg_replace('%\\s{2,}%', ' ', $html);  // shrink double spaces
        
        $punc =". , : ; ? ! ( ) = / \" \\ * _ > < | @ $ [ ] · +";
        $punc = explode(" ",$punc);
        foreach($punc as $value){
            $html = str_replace($value, " ", $html);
        }
        
        $html = preg_replace('%\'%', '', $html);  // shrink double spaces                   
        $html = preg_replace('%\\s{2,}%', ' ', $html);  // shrink double spaces      
      
        return $html;
      
      }   
      
      
      
  }
Usage:

Code:
$DEBUG = true;

// setup the urls we want to scrape:
$urls = array();
$urls[] = "http://www.google.com";
$urls[] = "http://www.yahoo.com";
$urls[] = "http://www.wickedfire.com";


// we'll append all HTML to this variable:
$allhtml = "";                
            
// BUILD OUT WEBREQUEST ARRAY
$wreqs = array();
foreach ($urls as $value) {
    $wr = new WebRequest($value, $value);
    $wreqs[] = $wr;  
}
// GET ALL USING CURL MULTI
$cm = new CurlMulti();
if ($DEBUG) printf("%s: %s pages requested<BR>",  time(), count($urls) );
$wreqs = $cm->ExecuteRequests($wreqs);
if ($DEBUG) printf("%s: %s pages fetched<BR>",  time(), count($urls) );
foreach ($wreqs as $value) {
    $string = $value->Html;
    $string = HTMLStripper::StripAllHTML($string);
    $allhtml .= " " . $string;            
}
 
  • Like
Reactions: LogicFlux
Thanks Shaggs ::StripAllHTML is just what I am looking for. I wonder why they/you didn't use strip_tags for the html one though, anyways great.

It does use strip_tags about halfway thru. IIRC, this function was the result of 2 other similar functions i found, as well as some more regex stuff I added in. I know I had a giant blob of shitty HTML I was using as a test case (including some malformed HTML). Strip_tags by itself wasn't working all that well and when I finally got i all working, that function was the result.
 
Status
Not open for further replies.