Web Crawler and Scraper

DavidR · Dec 13, 2006

I'm a little over 4 hours into scraping about 4,000 RSS feeds with some software I wrote and I'm 40% done. I'm only scraping the first page of each feed as well, with 15 "products" per page. Some of these feeds have 300+ pages. I'd need my own mini-google to pull off scraping that much data!

The file is already so big that I can't fit it into memory all at once. I'm sure the site I'm pulling the content from just loves me by now. If I had a little more foresight into how big the file would get, I would've broken it into chunks automatically. As it stands, I'll have to write more software to load chunks of it into memory at a time just to break the file apart into CSV files.

Just thought I'd share

Jon · Dec 13, 2006

Why is your scraper so slow?

nachoninja · Dec 13, 2006

makes me wish I payed more attention in my programming classes...

good luck

DavidR · Dec 13, 2006

The reason it's so slow is that I sacrificed performance for rapid development. If I wrote it in a faster language, it would've taken longer to implement. I just wanted to git 'er done I suppose

I may rewrite it when I want to cast a wider net.

Deliguy · Dec 13, 2006

hmmm. why do those figures sound familiar?

If your ip is either 66.33.208.* or 74.32.170.* and your hitting a network of 17 large product info sites you should probably pm me.
If not.......nevermind.....and good luck!

Search

Search

Web Crawler and Scraper

DavidR

New member

Jon

Administrator

nachoninja

Love the dog

DavidR

New member

Deliguy

New member