Creating my own Google ranking tracker script. Advice?



You're confusing java with ruby. Color blind?

I41yR.jpg
 
  • Like
Reactions: dchuk
What you use to track rankings daily and what you use to manage and display the data should be completely different apps. PHP would not be a good choice for running the rank trackings in my opinion, I'd do ruby or python that feeds data into a database. Front end can be rails/php/blah.

PHP + MySQL will get the job done fine. How do I know? I recently wrote an entire system to do this on a large scale across multiple serps. If you're smart about how you're doing it, you can do VERY large amounts of tracking to the point the overall speed of the language is probably not the defining constraint. I can't really say more than that nor can I really offer much advice besides that.

My only real advice is this is not a newbie friendly project and you should start out with much, much smaller goals. Besides that best of luck!
 
PHP + MySQL will get the job done fine. How do I know? I recently wrote an entire system to do this on a large scale across multiple serps. If you're smart about how you're doing it, you can do VERY large amounts of tracking to the point the overall speed of the language is probably not the defining constraint. I can't really say more than that nor can I really offer much advice besides that.

My only real advice is this is not a newbie friendly project and you should start out with much, much smaller goals. Besides that best of luck!

The core functionality isn't hard, and at worst, it's just sloppy and amateurish on the back end. Since this is just an educational venture for personal use and not a large scale production app, it already is a small project (although I'd like to make something that I could share in the Newbie Corner).

Right now, it takes a URL + CSV file of keywords and can return the respective positions. Building some scaffolding/login around that isn't a big project. I'll hop through the Django tutorial today after my last final and I'm sure I can integrate the function of my script with it this weekend.

How do you learn how to smartly design web applications? Are there any good resources for best practices?
 
There's a DOM library for PHP that I really like. It's really not a bad language. Yes it lacks threading.

I've cheated the whole no threading thing by using a number of python threads that execute the PHP scripts I didn't want to rewrite. Is that ugly or is that ugly?

Guys are we really talking about what language is "better" by comparing CPU usage?
 
There's a DOM library for PHP that I really like. It's really not a bad language. Yes it lacks threading.
I've replicated the same scrapers in PHP, Python, and Ruby in an effort to get a feel for their differences.

All three languages have great DOM libraries that work almost identically.
If you like one, you like them all.

In fact, I've built the website I spoke of in the OP in PHP (CodeIgniter), Python (Django), and I'm now working on the Ruby (Rails) version. They only have basic CRUD, user auth, and I have their scripts on practice timers (daily scrapes atm) that are simulating hundreds of websites with 20 keywords per website to see what's performing the best. Originally, I was going to compare the PHP vs Python scrapers on a large-volume bidaily scrape, but once I had the Python scraper working on a backend routine, I just copied it to my PHP website. Discussion in this thread got me curious, so I'll have to drop the PHP scrape in and see how it fares.

I choose you, Rails.

But I'm mostly looking forward to working with Ruby on Rails. The new Rails version 3 is great and Ruby/Rails just fits my mental workings better than Python/Django did. I'm building the Rails version of this website as I go through Beginning Rails 3 by Apress. A quarter of the way through the book and you feel confident you can make 3/4ths of any website.

QUESTION

I doubt anyone will read down this far, but I'm stumped at one point: How should I store the keywords and the daily scrape information for each one?

Here's what I was thinking.
Code:
 _______         ___________________
| Users |       |     Websites      |
|-----  |       |-------------------|
| id    |<--+   | id                |  
|_______|   +-->| user_id           |
                | url               |
                | keywords_csv_path |
                |___________________|
I couldn't think of a database solution that would scale unless I made a table for each website, but from what I've discerned, that's a lot of overhead.

Moreover, I'm probably overlooking the simplest solution.

But here's my idea: Each website for each user could build it's own CSV file that stores its keywords. I'm limiting keywords to 20 per website during development for ease. The scraper will then add a row for each date it scrapes, returning the rank position of each keyword.

So the CSV would look like this:
Code:
[B]Date   [COLOR="PaleGreen"]Keyword1 Keyword2 Keyword3[/COLOR][/B]
12/13     NF       18      
12/14     NF       17
12/15     99       17 
12/16     98       14
12/17     98       13       7
12/18     96       13       6 
12/19     95       12       7
12/20     95        9       5
...       ...      ...      ...
  • NF indicates it was "not found"
  • Keyword3 demonstrates a keyword added later

Any better ideas?

This is pretty shitty should I, say, track Google, Yahoo, and Bing. Dealing with CSVs also suck.
 
Users has_many: websites
Websites has_many: keywords
keywords has_many: daily_rankings

Websites table has a user_id foreign key
keywords table has a website_id foreign key
daily_rankings table has a keyword_id foreign key
 
In the daily_rankings table, you create a row for each keyword rank check of the day.

so if you have a keyword "sunflower seeds", and you let it run for 3 days, you'd have 3 daily_rankings rows, with columns for timestamp, google_rank, yahoo_rank, bing_rank, backlinks, alexa, whatever..
 
This has been a lot of fun. My dayjob and my pursuit of learning Rails have both sidetracked me off and on, but my last challenge is to ensure my application can scale, my main concern is that I elude Google's ban.

Right now, I'm using Beanstalkd to queue up the jobs (great screencast). A daily Rake task is scheduled daily, which runs through each keyword in the database and pushes it into Beanstalkd's queue, which processes in the background.

Of course, my interest in creating something that scales isn't based on the thought that a lot of people will use it (i'll post it to WF's Newbie Corner once it's ready), but on my own interest in scalable designs.

I'm not too concerned with the service getting castrated by Google since IP blocks are easy to evade, especially since I'm deploying to Heroku. But I don't want a bad scrape to compromise the day's ranking for a keyword and create holes in the data.

Jy94m.png
 
But I don't want a bad scrape to compromise the day's ranking for a keyword and create holes in the data.

I'm sure when google does block you the page changes and/or asks you to fill out a captcha - parse for it and stop the script if you find it.
 
I throw an exception if the website document can't be opened (I'm blocked) or the SERP doesn't have links of the kind I'm crawling. In case I missed missing, I made a meta-exception that ensures the script fits the footprint of a script that is working (in the event that the script is producing erroneous/nil data but keeps processing).

If an exception is raised during a scrape job, it gets thrown into the retry queue and the entire scraping system takes a global nap.

I tried to replicate a failure scrape by running large volume, but I couldn't even get the script to fail after iterating through the same ten Google pages many, many times. I didn't really push it too hard, but given that my scraping is fractionally intensive on Google, that was enough testing for my needs. I also think it's just unethical to run what amounts to a ddos attack (despite how linear and weak it is) just for personal testing.

Interestingly, the only way I could reliably fail was when I introduced simultaneous workers to chomp at the scrape queue, but that obviously doesn't apply to my small little project.

I'd be interested in hearing how the commercial services like SEScout deal with real volume and high frequency beyond just spreading things out horizontally.

I won't even need a proxy unless I can't fit a linear scrape (with many sleeps) for all database keywords in a 24 hour period (which, again, is a non-issue).
 
Interesting thread :small-smiley-026:

I got to a point where I was spending a large amount of my day tracking rankings for all of my money sites. It was getting out of hand. So the obvious thing to do was look for a SERP tracker. Research research research, blah blah blah... typical conclusion for all of us developers: "I can write something better than all of this available trash". lol

Before SEscout was SEscout it was just a personal tool. One of the easiest web apps I've written. You should be able to build a basic scraper in any programming language within a (very short) day. If you can't, programming probably isn't your thing.

Having said the above, creating a SERP tracker to handle a sizable portfolio of keywords (anything above 1000 keywords with hourly updates) was considerably harder. The issue of scale is important with any web app, but for a serp tracker, scale means that pretty much however you think is the best way to build your app when you're testing single keywords will probably fail when you're scraping 1k+ keywords an hour (and especially when you're scraping millions of pages an hour like SEscout does). Google will block your ip. Google will block your proxies. Google will block sequential ips, so it's nearly pointless to buy a standard 100 proxy block from a vendor since they are almost always sequential.

Im obviously biased, but my advice is: STOP! lol
It's such a slippery slope. First you'll want daily updates, then twice a day, then hourly. You will see your keywords and your rankings in a nice table/chart, then you'll have the idea that you want to see your pagerank there too... then backlinks... then alexa. Then you'll want to get updates while your off of the computer, so you'll write an email script etc. Then one day you won't have any data... spend 4 hours troubleshooting to find that Google changed a div class, or they blocked one of your ip ranges, or your daemon tripped on an error and has been in an endless loop for 14 hours, or ....

Just sincere advice: If you want to create a serp tracker, code... if you want to do SEO, do SEO... if you do both, its inevitable that one project will suffer due to the other.

I was a successful full time SEO before SEscout. Now SEscout is my life :tongue2:
 
I am wrote my own ranking tracker in delphi (formerly pascal). It supports google, bing and yahoo ranking results. All results will be stored in a database so that you have a nice easy to filter ranking history. Anyway, I don't think its ready for the publicity but if you are interested I will finish it and sell it for a small amount.

By the way.. You should query google really randomly. That is a problem why some guys just get banned after a few queries.
 
Thanks for the words of encouragement, backlinkgurus. :tongue2:

The part that's challenging me, like any creative endeavor should, is the territory I'm unfamiliar with because I'm learning. And that's the whole point of this project. Low-level stuff like scraping, user authentication, etc. is a breeze, but as soon as I looked into making something publicly available, then scale, performance, caching, and optimization enter the mix. And so far, those are brilliant challenges, each clever success giving you a surge of accomplishment.

With a basic programming intuition, it's incredible how fast you can learn anything else if you go at it on a "execute now, learn as I need it" basis.

In the past month, just because of this project, I've:
  • learned Ruby
  • learned Rails
  • learned how to use Git
  • learned how to use GitHub
  • learned the insides of Heroku
  • made 50 commits to Rails
  • made ~100 commits to open source software

It's been an incredible experience, my only regret is not diving into coding more back when I had 24 hours of free-time every day and didn't need a dayjob. :stonedsmilie:

As per your advice

I'm not really doing SEO or Rank Tracking with this project, so neither will occlude the other. It's motivated only by my interest in learning how to make stuff and I picked a project that seemed interesting.

I was hoping you'd write something more interesting, technical, and in lieu of this thread than words of discouragement.

Evading Google is only an algorithmic challenge that can be solved algorithmically. Hell, by virtue of building this in Rails, it already has a beautiful RESTful API. It wouldn't be hard to offshore scraping scripts onto N free domains that update the mother database. This could be done almost infinitely to the point where one scraping bot is only scraping one keyword per day if you really wanted to.