DOMDocument, DOMXPath, and google adboard questions...

SuperDave2U · Nov 14, 2009

Here's a scraper that given a $kw it'll pull the first page of google results then scan everything with the "cite" tag. It then runs the URLs through a whois check and saves the results. I'm getting bogus results that are outside of the sponsored link section and I'm a bit lost on how to narrow down my scrape to just focus on that area.

I'm new to DOMDocs and doing a bunch of reading but this is still stumping me. One of my biggest questions is how to dump the node list so I can see what I'm working with at any given time? Next, I've read that xPath would be better to use this for but I don't understand how to build the queries. Any help there?

Lastly, whats the link to googles adboard? Is it still active?

Code:

if(strlen($kw)>0) {
	$site 	= "http://www.google.com/search?q=$kw";
	$html 	= file_get_contents($site);

	$dom = new DOMDocument();
	libxml_use_internal_errors(TRUE); 
	$dom->loadHTML($html); //Throws a ton of errors if input is not validated...
	libxml_use_internal_errors(FALSE);

	$links = $dom->getElementsByTagName('cite');
	foreach ($links as $link) {
		// Dirty fix, some URLs return like "www.site.com - ", they're not ad blocks so I skip em
		if (substr($link->nodeValue, -3)!= ' - ') {
			// A class from another file, runs a socks WHOIS query on :43 for data
			$whois = new Whois($link->nodeValue);
			if($whois->data)
				// Do something with the results...
				save_this($whois->domain, $whois->data); 
			// u/Sleep() function that works with times < 1sec
			nap(1);
		}
	}

	/* 
	// Dont understand xpath yet... 
	// TODO : fix query
	$xpath = new DOMXPath($dom);
	$xpath->registerNamespace("html", "http://www.w3.org/1999/xhtml");
	query = "//html:cite";
	$data = $xpath->query($query);
	foreach ($data as $link) 
		echo $link->nodeValue . '<br>';
	*/

}

Thanks for taking a look fellas

SuperDave2U · Nov 14, 2009

Still confused on how DOMDocs work but I fixed the query, dropped the extra code, and this is working as I intended =]

Code:

if(strlen($kw)>0) {
	$site 	= "http://www.google.com/search?q=$kw";
	$html 	= file_get_contents($site);

	$dom = new DOMDocument();
	libxml_use_internal_errors(TRUE); 
	$dom->loadHTML($html); 
	libxml_use_internal_errors(FALSE);

	$xpath = new DOMXPath($dom);
	$urls = $xpath->query("//ol[@class='nobr']//cite");
	foreach ($urls as $url) {
		$whois = new Whois($url->nodeValue);
		if($whois->data)
			save_this($whois->domain, $whois->data);
		nap(1);  
	}
}

kblessinggr · Nov 14, 2009

You could just use "Simple HTML Dom Parser" as used in this simple Google scraping example : Scraping Google Front Page Results KBeezie

Search

Search

DOMDocument, DOMXPath, and google adboard questions...

SuperDave2U

New member

SuperDave2U

New member

kblessinggr

PedoBeard