Web Crawler and Scraper

Status
Not open for further replies.

DavidR

New member
Aug 23, 2006
508
10
0
WA
I'm a little over 4 hours into scraping about 4,000 RSS feeds with some software I wrote and I'm 40% done. I'm only scraping the first page of each feed as well, with 15 "products" per page. Some of these feeds have 300+ pages. I'd need my own mini-google to pull off scraping that much data!

The file is already so big that I can't fit it into memory all at once. I'm sure the site I'm pulling the content from just loves me by now. If I had a little more foresight into how big the file would get, I would've broken it into chunks automatically. As it stands, I'll have to write more software to load chunks of it into memory at a time just to break the file apart into CSV files.

Just thought I'd share :D
 


The reason it's so slow is that I sacrificed performance for rapid development. If I wrote it in a faster language, it would've taken longer to implement. I just wanted to git 'er done I suppose :D

I may rewrite it when I want to cast a wider net.
 
hmmm. why do those figures sound familiar?

If your ip is either 66.33.208.* or 74.32.170.* and your hitting a network of 17 large product info sites you should probably pm me.
If not.......nevermind.....and good luck!
 
Status
Not open for further replies.