Scraping a website using AJAX

Jeff-DBA

New member
Hey all,

I'm determined to create a bot that will latch on to this penny auction website: Auction | Cheap TV laptop ipod auctions - bid & win on Swoopo, and parse out data such as auction price, bidder, # of bids, bid frequency, etc. etc. etc. for each individual auction. As it dynamically changes via AJAX, until the auction has ended.

My current thought is that I can use livehttpheaders to read the data as it changes, then use php to parse this data and make sense out of it, but these change about 5-7 times per second continually. Does anyone know of a way I can latch onto this domain and store all of its header requests as they change for an extended amount of time?

Every bot/scraper I've ever written has been executed via cron job on the server side and hasn't always been "watching" something dynamic. Any input or ideas that come to mind will be very helpful to me. Thanks in advance for any advice you all may have.
 


ruby + watir or selenium is what i'd use to scrape the data then parse it however you want.

they're both slow compared to server side/cli scripts

you also might try firebug instead of livehttpheaders
 
you'd want to use python / ruby etc, as they have threads, so you can have a thread per auction. in each thread, have an infinite while loop, that grabs the data parses it, the sleeps for a while.
 
You could take a look at the AJAX HTTP call and figure out the way it works in under 5 minutes. I just checked and it makes a simple call with the auction IDs in the query string and returns a list of easily parse able fields for time left, bidder name, and current price.

Setup a web based script to find the IDs on the homepage, then have another script keep requesting data about those IDs until they stop and log it all to your database.
 
Thanks bigmoneyrob, Does Firebug provide the headers in a format that is easier to work with or something? How much of a learning curve am I looking to overcome if I try something new such as selenium for the scraping?

firebug is a little less chaotic. i use livehttpheaders when i really need to dig deep and don't mind searching through a streaming clusterfuck of text to find what i need.

you should be able to pick up selenium/watir in a couple of hours
 
You could take a look at the AJAX HTTP call and figure out the way it works in under 5 minutes. I just checked and it makes a simple call with the auction IDs in the query string and returns a list of easily parse able fields for time left, bidder name, and current price.

Setup a web based script to find the IDs on the homepage, then have another script keep requesting data about those IDs until they stop and log it all to your database.

This is exactly what I was thinking initially :) Do you know how to set up the second script to keep requesting data about those IDs that I acquired with the first script? I was thinking Have a cron job continually refresh the script that calls for that data, but I can only execute a cron job once every minute, right?
 
This is exactly what I was thinking initially :) Do you know how to set up the second script to keep requesting data about those IDs that I acquired with the first script? I was thinking Have a cron job continually refresh the script that calls for that data, but I can only execute a cron job once every minute, right?

True, it would be more practical to build a control panel that used ajax requests to call the script say every 10 seconds (or however long you chose).
 
I'd execute a cron job every 10 minutes and have it start a loop that will end a few seconds before the next cron job starts. If you make it sleep for a while between loops (for, let's say, 10 seconds) it'd work perfectly.

I have done this myself previously so I can tell you it does work.
 
You could take a look at the AJAX HTTP call and figure out the way it works in under 5 minutes. I just checked and it makes a simple call with the auction IDs in the query string and returns a list of easily parse able fields for time left, bidder name, and current price.

Setup a web based script to find the IDs on the homepage, then have another script keep requesting data about those IDs until they stop and log it all to your database.

^^^^ This for the win! I *love* scraping AJAX-ified sites, the data is a joy to play with compared to sometimes pig-ugly html :)