Testing Some Code - Free Data Scraping For All

genetic

New member
Nov 26, 2011
133
7
0
I am coding a scraping API. It's for a service that I am going to launch. Development has been ongoing for 3 days so far.

Although the API front-end is not started yet, the back-end (processing) is in a late Alpha stage now, and is capable of handling scrapes for the source of any URL that doesn't require a POST. In simple terms it cannot scrape data from a page that requires form input (yet).

I would like to test what I have implemented so far with real scraping requests, as I'm not sure if I am going too easy on myself with the tests I have come up with so far.

If anybody has any sites that they would like some data scraped from, I can do this for you (FREE!) as part of my testing.

The details I require are as follows -

Starting URL - The URL of "Page 1" of the scrape, subsequent URLs will be calculated by the script. If the URL needs to have variable parameters passed to it (eg. example.com?example=parameter) then include the list of parameters that need to be used.

Required Data - Some sort of identifier for the data that should be scraped, for example if the data required is an email address and it is labelled "E-Mail Addy" on the target website, then "E-Mail Addy" would be the required data. Of course it is fine to have multiple pieces of required data.​

One thing I will mention is that this is not an unlimited offer, in that I am not necessarily offering it to unlimited people, nor am I offering to scrape an unlimited amount of data. I think a reasonable amount of URLs to scrape is 500. Remember this is a fully hosted service.

Any questions, or requests for scrapes, just reply in this thread. If anything you want to scrape is non-public, I'm sure you know where the PM button is!
 


I love scrapping. It's like licking your cousins tits. You know it's wrong but it really tastes so sweet... Just can't help myself.

Anyway, can you scrape images too or just simple text?
 
It's all in PHP. I'm sure many will feel that it isn't the best language to do something like this in, but it's what I know best, and (so far) it seems to be working well.

You got any sites for me to scrape dchuk? :P
 
It's all in PHP. I'm sure many will feel that it isn't the best language to do something like this in, but it's what I know best, and (so far) it seems to be working well.

You got any sites for me to scrape dchuk? :P

nah, I'm good, I have my own scrapers when I need them (not knocking what you've done, just leaving a spot open for someone else who needs scraping done)

I wrote this in ruby, might want to play with it, seems like it would be useful for what you're doing: http://github.com/dchuk/Arachnid
 
Several replies since I replied to dchuk.

@Bofu2U This is a lot more "logical" than that. For example you want all the "First Name" entries from a site. You give me the URL of page one of the listing, and the script grabs every "First Name" entry throughout listing, including entries on any pagination. More complicated scrapes are possible, but this is the basic idea.

@xpathfucker "it's wrong but it really tastes so sweet..." <-- THIS. The script can scrape text, but if it is scraping images it can upload them to a specified FTP sever or return their original URLs.

@Jake232 Eh?
 
Several replies since I replied to dchuk.

@Bofu2U This is a lot more "logical" than that. For example you want all the "First Name" entries from a site. You give me the URL of page one of the listing, and the script grabs every "First Name" entry throughout listing, including entries on any pagination. More complicated scrapes are possible, but this is the basic idea.

What if I give you one page with an example of the structure (and what I want, like First name) and tell you I want that information from every page on the entire site?
 
Ahh, it seems that Jake232 was talking about Arachnid by dchuk.

@dchuk Yeah, I appreciate that a lot of people will have their own solutions for any scraping requirements that they may have, but at the same time there are many who do not have access to this kind of resource. This is the gap(?) in the market that I am looking to fill with this service. Thanks for the sharing the code for Arachnid, I will have to have a look, and at the very least understand the approach you have taken. Hopefully I can learn something from your methods!
 
Several replies since I replied to dchuk.
The script can scrape text, but if it is scraping images it can upload them to a specified FTP sever or return their original URLs.

Ok cool. May I suggest you include a way to scrape stuff that require post requests (which is quite easy anyway), cause that is where some powerfull things can be made with PHP.

No urls to give u, but u sure have a nice offer. Have fun scrapping.:drinkup:
 
What if I give you one page with an example of the structure (and what I want, like First name) and tell you I want that information from every page on the entire site?

This is exactly the objective I am working to achieve. Nice and automated, define the URL to scrape, and then grab the required data from the entire site so reliably that it can be handled by other scripts or even (spun? [and]) used as content.
 
if your using php, are you using curl, raw sockets, or fopen?

if curl, are you using an asyn/non blocking version of it? how about polling/que?

just curious, I've been testing up a nonblocking multi curl script that has a que....
 
Thanks for the PMs, had an interesting eBay scraping request that has made me rethink the way that I am doing things. Basically, the system now caches pages that are being scraped. If anybody else has any scraping requests, then please send me details and I will have a look!

@Bofu2U Sweet. I haven't started working on proxy support yet, but when I'm there I'll hit you up.

@eliquid I am using file_get_contents() for simple requests and CURL for requests where I need to send header information. I will have a look at asynchronous non-blocking requests. The system does have a queue, and can be set to process a defined number of jobs per minute.

@mattseh Thanks for sharing that, I will have a look.

@Berto Cheers for the PM, I will drop you a reply in a minute.