Scrape Google without getting blocked?

Status
Not open for further replies.

dsiomtw

New member
Mar 12, 2007
1,495
30
0
End of the rainbow
I need to figure out the best way to do MASSIVE scraping of Google without getting blocked. I've looked into using anonymous proxy services such as anonymizer.com but the cost is prohibitive.

My next thought was to acquire 100+ IPs and setup my own proxy server that just rotates through all the IPs when sending requests. Not sure how easy it will be to acquire 100+ IPs since you need to provide a valid reason for needing so many IPs these days...

Another thought was to literally setup 100 different hosting accounts at different providers, each with their own IP, and just program my system to route all the requests through these. The cost wouldn't be too bad, but setting up that many accounts and programming all the logins etc. into my system will be a real PITA.

Anyone have any ideas for me? There are a lot of different tools out there that do large scale scraping of Google - I'm just wondering how they do it. Thanks!
 


Maybe get a key for the API? Depends on what you are trying to get.

Or find one of G's 'partners' (like Ask, I think) who displays G search results, and scrape them? Less sure about that trick...
 
the google api is gone. hit google at ir-regular intervals of between 4 to 8 seconds and you should be fine.

or start scraping, testing and using http proxies and going through them. proxies is such a boring and important part of bh.
 
The problem with proxies is that they are sooooo slow. Maybe I am going about it the wrong way. Do serious BH'ers use paid proxies that are reasonably fast? I've messed with some of the free proxies before and they were always so incredibly slow, and it seems like you have to constantly find new ones every 5 minutes. Or is this not the case? And I know there are different types of proxies and many do reveal your true IP, right?

I need to bone up on this. Can you point me in the right direction in terms of doing some serious scraping using truly anonymous proxies? I'm off to Google but any input you can provide would be most appreciated. I don't mind paying for good proxies, or even paying for a good constantly updated anonymous proxy list, etc. Is there such a thing??
 
Forget to ask specificaly - does anyone have a source for one or more lists of truly anonymous proxies that is updated frequently? I assume there are sources for this, and I don't mind paying a subscription fee or whatever.

Thanks!
 
proxies are important. everyone has a different system but i personally scrape pages with lists of proxes. like the botmaster one who are the makers of xrumer at hxxp://botmaster.ru/proxy/httplist.htm and this one here, but you've to sign up and write a script that logs in: hxxp://www.cspy.org/proxy-lists

cron a job every day that scrapes these pages or whatever pages you want and for all new proxies, make it access php page on your server that returns the headers sent with php getallheaders() and check the anonymity of the proxy. transparent pass your IP with a x-via forwarded: header or something. anonymous pass the x-via forwarded header but don't give your ip and elite don't pass anything at all. then my scripts cycle through this list of proxies and if a proxy doesn't respond, moves onto the next one and deletes that proxy. you'll get a fast proxy for a while and then when its bust move onto the next one.

whew, long post
 
Proxy List 1 - Proxy 4 Free - Protect Your Online Privacy! is what I use for my spammers. Rotate to a new proxy every day, and for the most part about 80% of the "anonymous" ones on there work fast. I had one from Egypt the other day that was keeping up with my cable modem. :)

Anyway, just my 2 cents. Hope it helps.

PS - hitting sites with the egypt one right now, still working (started last night) :)
 
Additional IP's are both cheap and easy to get from Softlayer if you've got a dedicated server with them. A full c-class net is $128/month. Beats scraping for proxies.
 
Additional IP's are both cheap and easy to get from Softlayer if you've got a dedicated server with them. A full c-class net is $128/month. Beats scraping for proxies.

What's your ARIN justification for them, though? "Scraping"?

You could say SSL, but that won't last too long if they actually followup on ARIN justifications.
 
Google API doesn't provide the information that I need in an accurate way.

I would LOVE to just pay for a bunch of IPs (in fact I already have a dedicated server with Softlayer) but I can't imagine how to justify it. My best thought was that I would say I need them for "clients" - 1 for each client site that I will be hosting or whatever - but if they checked up on me just once I'd be screwed. Do they just not care?? The other thing is I think I need IPs on a bunch of different c-classes otherwise G will catch on pretty quick I think.

In regards to proxy lists, anyone know of a paid service out there? Like someone who aggregates all the various lists, merges them and removes duplicates and non-working proxies, etc. all in one place?
 
"Clients" is a very broad term. You could set up a bunch of 'business card' sites, and say that your unique selling position is the site gets it's own IP, so no worries about email or web traffic being banned because of some bad user on the same shared server.
 
I'm seconding seocracy... and what project is so worthless that you cant even buy a proxy? Think outside of the box or throw a few dollars into it and just buy a proxy service; if you're doing something worthwhile it will payoff.
 
API's suck if you're going to do a ton of volume. Get a block of proxies and cycle the google data centers. And you should be fine.
 
Status
Not open for further replies.