Here's a scraper that given a $kw it'll pull the first page of google results then scan everything with the "cite" tag. It then runs the URLs through a whois check and saves the results. I'm getting bogus results that are outside of the sponsored link section and I'm a bit lost on how to narrow down my scrape to just focus on that area.
I'm new to DOMDocs and doing a bunch of reading but this is still stumping me. One of my biggest questions is how to dump the node list so I can see what I'm working with at any given time? Next, I've read that xPath would be better to use this for but I don't understand how to build the queries. Any help there?
Lastly, whats the link to googles adboard? Is it still active?
Thanks for taking a look fellas
I'm new to DOMDocs and doing a bunch of reading but this is still stumping me. One of my biggest questions is how to dump the node list so I can see what I'm working with at any given time? Next, I've read that xPath would be better to use this for but I don't understand how to build the queries. Any help there?
Lastly, whats the link to googles adboard? Is it still active?
Code:
if(strlen($kw)>0) {
$site = "http://www.google.com/search?q=$kw";
$html = file_get_contents($site);
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html); //Throws a ton of errors if input is not validated...
libxml_use_internal_errors(FALSE);
$links = $dom->getElementsByTagName('cite');
foreach ($links as $link) {
// Dirty fix, some URLs return like "www.site.com - ", they're not ad blocks so I skip em
if (substr($link->nodeValue, -3)!= ' - ') {
// A class from another file, runs a socks WHOIS query on :43 for data
$whois = new Whois($link->nodeValue);
if($whois->data)
// Do something with the results...
save_this($whois->domain, $whois->data);
// u/Sleep() function that works with times < 1sec
nap(1);
}
}
/*
// Dont understand xpath yet...
// TODO : fix query
$xpath = new DOMXPath($dom);
$xpath->registerNamespace("html", "http://www.w3.org/1999/xhtml");
query = "//html:cite";
$data = $xpath->query($query);
foreach ($data as $link)
echo $link->nodeValue . '<br>';
*/
}
Thanks for taking a look fellas
