Coding a Python rank tracker



good stuff OP would - some of my most frequently used products are the result of threads like these. would love to see something like this!
 
@chatmasta: I think you're right about only changing one thing at a time. I'm planning on doing some content scraping in the future too so I'm thinking it would be good to have a play with Mongo then

@j0hnsmith: Thanks, I've had a good look at Django already and I was going to use the ORM from that anyway so may as well use the full framework

@RoR fans: I've looked at Ruby and RoR, and there was nothing I could really fault about them other than the fact that I prefer the Python syntax, so that's what I plan to go with.

@kamired2: That's what I was thinking in terms of splitting the scraping and reporting. I had thought that the scraping would be controlled by some sort of queue system, and then the eventual front end would just process data already scraped.

Pretty sure he was talking about &num=100

^^^ Correct

I'll try and update the thread with anything that might be useful to others once I actually get started
 
Don't use Django to build a scraper!! That is ridiculous. Build the scraper and then if you need a web interface hook it up with Django. Honestly, if you don't need features like user authorization, sessions, etc. then create a web interface with Flask or Cherrypy.

Most of the job can be completed easily with pycurl and BeautifulSoup. Just get coding and you'll be fine.
 
Don't use Django to build a scraper!! That is ridiculous. Build the scraper and then if you need a web interface hook it up with Django. Honestly, if you don't need features like user authorization, sessions, etc. then create a web interface with Flask or Cherrypy.

Most of the job can be completed easily with pycurl and BeautifulSoup. Just get coding and you'll be fine.

I would recommend bottle as a web framework, I find it needs the shortest amount of code to get what you want.

Also if you want to extract data from web pages use lxml. It's by far the fastest solution out there. (Beautiful Soup 4 Benchmark) It can be a little clunky to get started with, it took me some time to switch from BeautifulSoup (which has the arguably more pythonic interface) to lxml, but it was definitely worth it.
 
bottle.py is sweet. I use it as an API layer. I run a lot of my python programs on their own dedicated boxes, so to send commands I need to interface with them through an API, and bottle is perfect for that.

Here's how I implement it, using pasteserver ("apt-get python-paste" on debian) and sqlite. The security scheme is described at the top and obviously relies on you being the only one who knows that your encryption method is md5. If you want to make it more secure, choose a different encryption method. But the basic idea is here.

Example API function is at the bottom of the code.

Code:
from bottle import run, route, PasteServer, debug, template, request, redirect, abort
from hashlib import md5
from random import random
import os
import sqlite3

'''
	This API uses an encryption schema designed to prevent search bots or prying eyes
	from compromising the integrity of the system. Here is the process used:
	
	1. Client requests '/key'
	2. Server generates random text (hereby referred to as "<public key>") - cannot contain "/"
	3. Server generates <private key> with one way hash md5(<public key>)
	4. Server saves <private key> as row in api_keys.db->keys (SQLite)
	5. Server returns <public key> to client
	
	6. Client receives <public key>
	7. Client generates <private key> with one way hash md5(<public key>)
	
	8. Client requests /api/<private key>/<request>
	9. Server checks if <private key> is a row in api_keys.db->keys
		If <private key> exists as a row in keys table
			a. Server deletes <private key> from table
			b. Server redirects client to <request>
		If <private key> is not a row in the table
			a. Server redirects client to Error page
	
	The effect of this is that each API key has a lifetime of one request. The client must
	request a new API key for each request.
	
	*(Note on implementation): I wish this didn't have to be done by adding a parameter to every
		method, but unfortunately it was the only option without hacking bottle.py. I could not use
		a front controller as an interceptor, because if there is a more specific route, Bottle will match it.
'''

# Check if key is valid based on whether it appears in table. 
# If it finds it, it returns true and deletes row
def isValidKey(key):

	# Look for a matching key in table
	conn = sqlite3.connect(settings.config.api_keys_db)
	c = conn.cursor()
	query = "SELECT COUNT(*) FROM `private_keys` WHERE `private_key`=:private_key"
	num_matches = c.execute(query, {'private_key':key} ).fetchone()[0]
	c.close()
	conn.close()
	
	# Return false if the key does not exist
	if num_matches == 0:
		return False
	
	# It matches. We will return True, but first delete the key.
	conn = sqlite3.connect(settings.config.api_keys_db)
	c = conn.cursor()
	c.execute("DELETE FROM `private_keys` WHERE `private_key`=:private_key", {'private_key':key})
	c.close()
	conn.commit()
	conn.close()
	
	return True

# Simply returns 401 (not authorized) if key is invalid
def checkKey(key):
	if not isValidKey(key):
		abort(401)

# Keep search engines out
@route('/robots.txt')
def robots():
	return 'User-agent: *\nDisallow: /\nNoindex: /'

# For generating the key
@route('/key')
@route('/key/')
def generateKey():
	public_key = md5(str(random())).hexdigest() # Public key is 32 char md5 of random()
	private_key = md5(public_key).hexdigest() # Private key is a second md5 on public key
	
	# Save private_key to the database TODO: Put create table in a setup script
	conn = sqlite3.connect(settings.config.api_keys_db)
	conn.execute("CREATE TABLE IF NOT EXISTS `private_keys` (`private_key` char(100) NOT NULL PRIMARY KEY)")
	c = conn.cursor()
	c.execute("INSERT INTO private_keys (`private_key`) VALUES (:private_key)", {'private_key':private_key});
	conn.commit()
	c.close()
	conn.close()
	
	return public_key
	
# NOTE THAT URLS ARE CASE SENSITIVE
	
# PLACE ALL YOUR API FUNCTIONS BELOW HERE.

'''
	Every function should accept <key> as a parameter and call
	checkKey(key) before executing any code.
'''

# Example
@route('/api/:key/your/api/function')
@route('/api/:key/your/api/function/')
def yourApiFunction(key):
	checkKey(key)
	return "\n".["blah", "foo", "bar"]
 
Also, like someone already said, do NOT use Django for this. I see why you want to use its ORM layer because it's so nice and simple, but the ORM in Django is actually one of the worst parts of it. Queries are bloated and awful. If you are handling a lot of data it's going to be slow as fuck. Look into SQLAlchemy - same simplicity, but much faster because queries are optimized.
 
  • Like
Reactions: iamjon
Thanks to everyone who's got involved in this thread so far :)

I've not started writing code yet as I'm snowed under with bullshit client work so I'm more concerned with making sure the data I'm going to be feeding into my tracker is clean (if I'm tracking bad data, this whole exercise is pointless) and also learning a bit more about MongoDB

So I've got a few findings to share for anyone following along or working on something similar

Google Search Strings

Before I started looking into this, I thought I had a search string that was useable, when logged out of a Google account, that would return a user-independent (country/IP etc) set of results when access programatically

Turns out I was wrong and even when logged out and browsing in incognito mode, Google was still detecting my location via IP and limiting results to just 10 URLs

From a few different sources (which I can't remember now) I've come up with the following string to use:

Code:
http://www.google.co.nz/search?gl=uk&pws=0&as_qdr=all&num=100&q=searchstring

This should work for countries OTHER THAN New Zealand. For NZ results, just use another google TLD

Query string variables are as following:
  • gl=uk: Boost results from that locale - UK in this case
  • pws=0: Personalisation off
  • num=100: Number of results to return (only works with the following)
  • as_qdr=all: Limit results to all dates (i.e doesn't change the result set but allows 100 results to be displayed)
  • q=searchstring: Search string

I'm going to do a bit more testing but I think I'm at a point where, when scraping a search URL like the above it should be as free as possible from personalisation

The reason to use a different TLD than the locale you want to view results for, is that this seems to stop Google from auto-detecting location based on IP and displaying blended Google Places results for geo-contextual searches

If anyone has done more of this than me and can spot any glaring omissions, I'd appreciate a nudge in the right direction

Scraping SERPs and xpaths

Not sure if anyone will find this useful but thought I'd share anyway.

In the past, I've seen the xpath to extract URLs from SERPs as the following:

Code:
//h3[@class='r']/a

I want my tracker to report on Google Places results too and not just the integrated ones but the old-style 3/7-pack that come in the slightly different format

To include these, the above xpath can be updated to the following to include both h3/h4 tags (h3 for regular organic and integrated places results and h4 for non-integrated places results):

Code:
//*[@class='r']/a

This pulls out all organic and places links. To differentiate between the two different types programatically, you can just check the "tag" value for each element and then deal with each type as you want.

I did find that the above xpath also included video thumbnails, which I didn't want, so I'm now using the following which discounts any links found in the div with the "vresult" class:

Code:
//div[@class!='vresult']/*[@class='r']/a

This is pretty basic stuff but I had no idea about it before I started looking into it, so maybe it'll help somebody out

Python Libraries

So far, I've been playing with the following and think I'll be using them when I finally get round to writing some code so thanks to those who mentioned these:

- mechanize: Scraping
- lxml.html: Parsing
- pymongo: MongoDB
- sqlalchemy: mySQL
- bottle.py: Web

Thanks again for everyone's input
 
Wow - differentiating the 3/7/10 whatever pack and being able to mark it as such would be such a cool feature to have. One thing that really pisses me off about my current ranktracker is that if I am ranking #1 organic it will tell me I am ranking #8 if there is a 7 pack e.g.

This thread is really making me want to do a rank tracker for myself.

Have you decided what you are going to do for an interface yet? Have you thought about just keeping dirty and just dumping the results to csv?