Scraping a website using AJAX

Jeff-DBA · Apr 1, 2010

Hey all,

I'm determined to create a bot that will latch on to this penny auction website: Auction | Cheap TV laptop ipod auctions - bid & win on Swoopo, and parse out data such as auction price, bidder, # of bids, bid frequency, etc. etc. etc. for each individual auction. As it dynamically changes via AJAX, until the auction has ended.

My current thought is that I can use livehttpheaders to read the data as it changes, then use php to parse this data and make sense out of it, but these change about 5-7 times per second continually. Does anyone know of a way I can latch onto this domain and store all of its header requests as they change for an extended amount of time?

Every bot/scraper I've ever written has been executed via cron job on the server side and hasn't always been "watching" something dynamic. Any input or ideas that come to mind will be very helpful to me. Thanks in advance for any advice you all may have.

bigmoneyrob · Apr 1, 2010

ruby + watir or selenium is what i'd use to scrape the data then parse it however you want.

they're both slow compared to server side/cli scripts

you also might try firebug instead of livehttpheaders

Jeff-DBA · Apr 1, 2010

Thanks bigmoneyrob, Does Firebug provide the headers in a format that is easier to work with or something? How much of a learning curve am I looking to overcome if I try something new such as selenium for the scraping?

mattseh · Apr 1, 2010

you'd want to use python / ruby etc, as they have threads, so you can have a thread per auction. in each thread, have an infinite while loop, that grabs the data parses it, the sleeps for a while.

wdmny · Apr 1, 2010

You could take a look at the AJAX HTTP call and figure out the way it works in under 5 minutes. I just checked and it makes a simple call with the auction IDs in the query string and returns a list of easily parse able fields for time left, bidder name, and current price.

Setup a web based script to find the IDs on the homepage, then have another script keep requesting data about those IDs until they stop and log it all to your database.

bigmoneyrob · Apr 1, 2010

Jeff-DBA said:
Thanks bigmoneyrob, Does Firebug provide the headers in a format that is easier to work with or something? How much of a learning curve am I looking to overcome if I try something new such as selenium for the scraping?

firebug is a little less chaotic. i use livehttpheaders when i really need to dig deep and don't mind searching through a streaming clusterfuck of text to find what i need.

you should be able to pick up selenium/watir in a couple of hours

Jeff-DBA · Apr 2, 2010

wdmny said:
You could take a look at the AJAX HTTP call and figure out the way it works in under 5 minutes. I just checked and it makes a simple call with the auction IDs in the query string and returns a list of easily parse able fields for time left, bidder name, and current price.

Setup a web based script to find the IDs on the homepage, then have another script keep requesting data about those IDs until they stop and log it all to your database.

This is exactly what I was thinking initially

Do you know how to set up the second script to keep requesting data about those IDs that I acquired with the first script? I was thinking Have a cron job continually refresh the script that calls for that data, but I can only execute a cron job once every minute, right?

Rage9 · Apr 3, 2010

Jeff-DBA said:
This is exactly what I was thinking initially Do you know how to set up the second script to keep requesting data about those IDs that I acquired with the first script? I was thinking Have a cron job continually refresh the script that calls for that data, but I can only execute a cron job once every minute, right?

True, it would be more practical to build a control panel that used ajax requests to call the script say every 10 seconds (or however long you chose).

SeriousBiz · Apr 4, 2010

I'd execute a cron job every 10 minutes and have it start a loop that will end a few seconds before the next cron job starts. If you make it sleep for a while between loops (for, let's say, 10 seconds) it'd work perfectly.

I have done this myself previously so I can tell you it does work.

moomycow · Apr 5, 2010

wdmny said:
You could take a look at the AJAX HTTP call and figure out the way it works in under 5 minutes. I just checked and it makes a simple call with the auction IDs in the query string and returns a list of easily parse able fields for time left, bidder name, and current price.

Setup a web based script to find the IDs on the homepage, then have another script keep requesting data about those IDs until they stop and log it all to your database.

^^^^ This for the win! I *love* scraping AJAX-ified sites, the data is a joy to play with compared to sometimes pig-ugly html

Jeff-DBA · Apr 6, 2010

Haha, well if you *love* it so much, I might just have a few PM's for you soon

moomycow · Apr 6, 2010

Jeff-DBA said:
Haha, well if you *love* it so much, I might just have a few PM's for you soon

Hehe, oh I love it, the question is will you *love* my invoices

Search

Search

Scraping a website using AJAX

Jeff-DBA

New member

bigmoneyrob

.

Jeff-DBA

New member

mattseh

import this

wdmny

New member

bigmoneyrob

.

Jeff-DBA

New member

Rage9

Banned

SeriousBiz

That's my motto.

moomycow

New member

Jeff-DBA

New member

moomycow

New member