Thanks to everyone who's got involved in this thread so far
I've not started writing code yet as I'm snowed under with bullshit client work so I'm more concerned with making sure the data I'm going to be feeding into my tracker is clean (if I'm tracking bad data, this whole exercise is pointless) and also learning a bit more about MongoDB
So I've got a few findings to share for anyone following along or working on something similar
Google Search Strings
Before I started looking into this, I thought I had a search string that was useable, when logged out of a Google account, that would return a user-independent (country/IP etc) set of results when access programatically
Turns out I was wrong and even when logged out and browsing in incognito mode, Google was still detecting my location via IP and limiting results to just 10 URLs
From a few different sources (which I can't remember now) I've come up with the following string to use:
Code:
http://www.google.co.nz/search?gl=uk&pws=0&as_qdr=all&num=100&q=searchstring
This should work for countries OTHER THAN New Zealand. For NZ results, just use another google TLD
Query string variables are as following:
- gl=uk: Boost results from that locale - UK in this case
- pws=0: Personalisation off
- num=100: Number of results to return (only works with the following)
- as_qdr=all: Limit results to all dates (i.e doesn't change the result set but allows 100 results to be displayed)
- q=searchstring: Search string
I'm going to do a bit more testing but I think I'm at a point where, when scraping a search URL like the above it should be as free as possible from personalisation
The reason to use a different TLD than the locale you want to view results for, is that this seems to stop Google from auto-detecting location based on IP and displaying blended Google Places results for geo-contextual searches
If anyone has done more of this than me and can spot any glaring omissions, I'd appreciate a nudge in the right direction
Scraping SERPs and xpaths
Not sure if anyone will find this useful but thought I'd share anyway.
In the past, I've seen the xpath to extract URLs from SERPs as the following:
I want my tracker to report on Google Places results too and not just the integrated ones but the old-style 3/7-pack that come in the slightly different format
To include these, the above xpath can be updated to the following to include both h3/h4 tags (h3 for regular organic and integrated places results and h4 for non-integrated places results):
This pulls out all organic and places links. To differentiate between the two different types programatically, you can just check the "tag" value for each element and then deal with each type as you want.
I did find that the above xpath also included video thumbnails, which I didn't want, so I'm now using the following which discounts any links found in the div with the "vresult" class:
Code:
//div[@class!='vresult']/*[@class='r']/a
This is pretty basic stuff but I had no idea about it before I started looking into it, so maybe it'll help somebody out
Python Libraries
So far, I've been playing with the following and think I'll be using them when I finally get round to writing some code so thanks to those who mentioned these:
-
mechanize: Scraping
-
lxml.html: Parsing
-
pymongo: MongoDB
-
sqlalchemy: mySQL
-
bottle.py: Web
Thanks again for everyone's input