How would I scrape every <a href> off this page?

Garrett · Apr 1, 2010

Fuckin A! collecting these by hand is a pain in the ass: U.S. Universities, by State

Sounds like the job for a simple bot - Could you tell me what I need to learn so I can automate this.

igl00 · Apr 1, 2010

with ubot it cn be done easily

silentbob · Apr 1, 2010

Do you know how to code? Using PHP Simple HTML DOM Parser it is very easy.

How to use it :

Garrett · Apr 1, 2010

Holy mother of god... worked perfectly the very first time - that never happens.

+rep for saving me hours and possibly a case of carpal tunnel

xentech · Apr 1, 2010

Here ya go.

CSV file is url,university name but if you actually want it the HTML <a> tags they are in the text file.

EDIT: Oh well

Garrett · Apr 1, 2010

+rep for the thought

bobsoap · Apr 1, 2010

I know you got it already but if it's a one-time thing and you don't want to go coding something up for it, you could use a bookmarklet like this one:

Code:

javascript:(function(){as=document.getElementsByTagName("a");str="<ol>";for(i=0;i<as%20.length;i++){str+="<li><a%20href="+as[i].href+">"+as[i].href+"</a>\n"}str+="</as></ol>";with(window.open()){document.write(str);document.close();}})()

Paste that into the browser address bar of the page and it will open up a new tab with a list of all links on it. 2133 in this case

You could also modify the bookmarklet to show the links separated by a comma and not linked. That way you'd get a CSV of any page at the click of a button.

kblessinggr · Apr 1, 2010

Yea I used the same thing for scraping google's front page:
Scraping Google Front Page Results KBeezie

Rage9 · Apr 1, 2010

OMG think I just got a hard on.

nvanprooyen · Apr 1, 2010

Thanks. This is great.

nvanprooyen · Apr 1, 2010

That's a neat little trick...

uplinked · Apr 1, 2010

If you like python: Scrapemark - Easy Python Scraping Library

Code:

import scrapemark

print scrapemark.scrape("""
        {*
                <div class='news-summary'>
                <h3><a href='{{ [links].url }}'>{{ [links].title }}</a></h3>
                <p>{{ [links].description }}</p>
                <li class='digg-count'>
                <strong>{{ [links].diggs|int }}</strong>
                </li>
                </div>
        *}
        """,
        url='http://digg.com/')

silentbob · Apr 1, 2010

That php DOM parser is indeed a very useful piece of code, not only for parsing but also for manipulating html. Say you want to parse out the links and change them to something else you can do:

Code:

$html = [B]file_get_html[/B]('http://www.google.com/');
foreach($html->[B]find[/B]('a') as  $element) {
    // add some code here the check which url's you want to change
    ...
    // change the url
   $element->[B]href = "http://www.my-new-url/";[/B]
   $element->innertext = "new linktext";
}
echo $html; // we just print out the altered html code ... do with it what you like 8)

Nostalgae · Apr 1, 2010

Haha, I've been needing to do just this, and I wasn't even looking for it at the moment! (Awesome.)

I was also just thinking that this thread generally is a great example of a really helpful internet question/multiple answer situation (weird for Wickedfire

) I mean ask, get three answers plus someone does it for you and sends the file; that's fucken futuristic or something like that...

DKPMO · Apr 2, 2010

+ rep, I wish I had this before...

eliquid · Apr 2, 2010

Im amazed some of you PHP guys haven never seen the simple DOM script.

I can say though, I have never seen a bookmarklet that could do this either, so + rep for that shit.

dabreadman · Apr 5, 2010

Firefox has an add-on that does all that.

Lol at complex scripts

Just use this https://addons.mozilla.org/en-US/firefox/addon/605

erect · Apr 5, 2010

problem's already solved, but here's my 2c

Just scrape the page (file_get_contents or curl) into a variable (say $text)

PHP:

function extract_links($text) {
  preg_match_all('/<\s*a[^<>]*?href=[\'"]?([^\s<>\'"]*)[\'"]?[^<>]*>(.*?)<\/a>/si',
    $text,
    $match_array,
    PREG_SET_ORDER);
  $return = array() ;
  foreach ($match_array as $page) {
    $full_anchor = $page[0];
    $href = $page[1];
    $anchortext = $page[2];
    if ( (preg_match("/http:/i",$href)) && 
         (trim($href) != '') && 
         ($href[0]!= '/') ) {
      array_push($return,$page) ;
    }
  }
  
  return $return ;
}

then run a print_r(extract_links($text)). THe array will spit out anchor, the link & the full <a .. </a> code for each link on the page.

hehejo · Apr 6, 2010

+rep

Made my life so much easier =)

How would I scrape every <a href> off this page?

music LOUD

Elite Blackhatter

New member

music LOUD

NFLol.

Attachments

music LOUD

Together we can do anyone

PedoBeard

Banned

Fortes Fortuna Adiuvat

Fortes Fortuna Adiuvat

Code-whisperer

New member

New member

Deadly Serious

Serpwoo.com

About Boats n' Hoes

New member

Developer