How would I scrape every <a href> off this page?



I know you got it already but if it's a one-time thing and you don't want to go coding something up for it, you could use a bookmarklet like this one:

Code:
javascript:(function(){as=document.getElementsByTagName("a");str="<ol>";for(i=0;i<as%20.length;i++){str+="<li><a%20href="+as[i].href+">"+as[i].href+"</a>\n"}str+="</as></ol>";with(window.open()){document.write(str);document.close();}})()

Paste that into the browser address bar of the page and it will open up a new tab with a list of all links on it. 2133 in this case :)

You could also modify the bookmarklet to show the links separated by a comma and not linked. That way you'd get a CSV of any page at the click of a button.
 
I know you got it already but if it's a one-time thing and you don't want to go coding something up for it, you could use a bookmarklet like this one:

Code:
javascript:(function(){as=document.getElementsByTagName("a");str="<ol>";for(i=0;i<as%20.length;i++){str+="<li><a%20href="+as[i].href+">"+as[i].href+"</a>\n"}str+="</as></ol>";with(window.open()){document.write(str);document.close();}})()
Paste that into the browser address bar of the page and it will open up a new tab with a list of all links on it. 2133 in this case :)

You could also modify the bookmarklet to show the links separated by a comma and not linked. That way you'd get a CSV of any page at the click of a button.

That's a neat little trick...
 
If you like python: Scrapemark - Easy Python Scraping Library

Code:
import scrapemark

print scrapemark.scrape("""
        {*
                <div class='news-summary'>
                <h3><a href='{{ [links].url }}'>{{ [links].title }}</a></h3>
                <p>{{ [links].description }}</p>
                <li class='digg-count'>
                <strong>{{ [links].diggs|int }}</strong>
                </li>
                </div>
        *}
        """,
        url='http://digg.com/')
 
That php DOM parser is indeed a very useful piece of code, not only for parsing but also for manipulating html. Say you want to parse out the links and change them to something else you can do:
Code:
$html = [B]file_get_html[/B]('http://www.google.com/');
foreach($html->[B]find[/B]('a') as  $element) {
    // add some code here the check which url's you want to change
    ...
    // change the url
   $element->[B]href = "http://www.my-new-url/";[/B]
   $element->innertext = "new linktext";
}
echo $html; // we just print out the altered html code ... do with it what you like 8)
 
That php DOM parser is indeed a very useful piece of code, not only for parsing but also for manipulating html. Say you want to parse out the links and change them to something else you can do:

Haha, I've been needing to do just this, and I wasn't even looking for it at the moment! (Awesome.)

I was also just thinking that this thread generally is a great example of a really helpful internet question/multiple answer situation (weird for Wickedfire :) ) I mean ask, get three answers plus someone does it for you and sends the file; that's fucken futuristic or something like that... :D
 
Im amazed some of you PHP guys haven never seen the simple DOM script.

I can say though, I have never seen a bookmarklet that could do this either, so + rep for that shit.
 
problem's already solved, but here's my 2c

Just scrape the page (file_get_contents or curl) into a variable (say $text)

PHP:
function extract_links($text) {
  preg_match_all('/<\s*a[^<>]*?href=[\'"]?([^\s<>\'"]*)[\'"]?[^<>]*>(.*?)<\/a>/si',
    $text,
    $match_array,
    PREG_SET_ORDER);
  $return = array() ;
  foreach ($match_array as $page) {
    $full_anchor = $page[0];
    $href = $page[1];
    $anchortext = $page[2];
    if ( (preg_match("/http:/i",$href)) && 
         (trim($href) != '') && 
         ($href[0]!= '/') ) {
      array_push($return,$page) ;
    }
  }
  
  return $return ;
}
then run a print_r(extract_links($text)). THe array will spit out anchor, the link & the full <a .. </a> code for each link on the page.