Fuck You Google

cardine

...
Jan 9, 2008
3,582
77
0
wordai.com
I have a script where a user enters in something, and then in realtime I scrape google (unfortunately a part of Google that doesn't have API yet or else my job would be a lot easier).

Because of this, as you might expect I've run into the "Automated Queries" issue with Google (after 20 or so people use the site the error comes up). I figured this wouldn't be that big of a deal... I could just show Google's captcha to the person on my website, and then take what the user types in and give it back to Google.

So what happens is
PHP:
echo file_get_contents('http://google.com/something');

Shows the content of this url:
google.com/sorry/Captcha

So I scraped the ID of the captcha, showed the captcha to the user. Then when they hit submit I do the following:
PHP:
file_get_contents('http://google.com/sorry/Captcha?continue=http://google.com/something&id=8722769811594829024&captcha=troing&submit=I\'m+human!');
(And of course I would pull the captcha ID and whatever the user entered in for the captcha to replace id and captcha).

Unfortunately this isn't working (and when you echo the contents, you see you're still echoing the captcha page). Now when I iframe it, it does work... meaning I've pulled the right url. However when I do file_get_contents it does not work. The problem with the iframe is that Google sees it as the users IP address and not my servers IP. So then when I try to do file_get_contents later and pull up the data I need to scrape Google still sees my IP as blacklisted and I get the automated queries error.

So the question is this: How do I take Google's captcha (which the user types in for me) and post it back to Google in a way so that my server will no longer be blacklisted by Google?

Any suggestions? And if you give me a good explanation that I can get working I'll send some $$ your way.
 


First, I would recommend using proxies to avoid the captcha issue in the first place.

Second, if that's not possible, the reason you're having the issue is because Google's captchas are dynamically generated per request, so if you pull in the page with file_get_contents and then pull in the captcha the same way, you're technically getting the captcha for the next page load instead of the current one.

What you need to do is use cURL and capture the cookies that Google drops and pass those back to Google on the captcha request. curl won't download the images on the initial request, so if you pass the cookies from that request back to google as you request the captcha image, Google won't know the difference and think it was just a delayed page load.
 
Good advice from dchuk, the only thing I'd add is to remember to set the user agent string. Probably won't make that much difference, but could do in the future, and I've run into plenty of sites (think eBay at one time) that dicked around my bots if a decent user agent wasn't set.
 
Good advice from dchuk, the only thing I'd add is to remember to set the user agent string. Probably won't make that much difference, but could do in the future, and I've run into plenty of sites (think eBay at one time) that dicked around my bots if a decent user agent wasn't set.

definitely. Everything should be identical between requests. Most scraping jobs implement rotating UA's (if you're being smart about it) so make sure you take that into account when requesting the captcha (IE pause the rotation for the captcha request)
 
Thanks both of you for the advice.

The site I'm creating is one I think is very susceptible to the Digg/Slashdot effect. So I'm worried that if I were to rely just on proxies as soon as I got a huge increase in traffic (200,000 visits in a day) even a network of 100 proxies wouldn't be enough to prevent Google's automated queries. So I'd like to ideally use both (a proxy network and a way to handle the captcha's in case Google does flag me).

I'll definitely take a look at doing exactly what you described in cURL; everything both of you said makes complete sense based on all of the troubleshooting I've done so far.

Also if either of you (or anybody else reading in this thread) would be interested in coding this up for ~$50 PM me and let me know (just even so I could look at and learn from the code). If not I think I'll be able to figure it out on my own :)
 
You may want to consider REST. Take a look at Developer's Guide - Google AJAX Search API - Google Code

If you are making large numbers of queries from your site to google you are going to need proxies, however, going with JS makes the originating IP address the visitor, thus solving your blacklist problem.

Do they have this API for Google Products? That is what I'm scraping and I read through all the Google API's checking to see if they had an API for that and couldn't find anything.