Hey guys, I got a scraping project about to start and would like to know what app I should use... Here's what I'm thinking:
1. Browser-based scraper - There's some sites with javascript where I need it to work.
2. Pull data into a database - PREFERABLY connect and compare/ dump it to a remote MySQL database - or save to a local one and upload / import that one.
3. Thinking Windows-based VPS to do the work since I trust Windows browsers more than Linux (although I'm a Linux user and more comfortable in Linux - am I wrong here?)
4. Random proxies from a file or database or somewhere
5. XPath will be consistent so that seems like money, but sometimes there are HTML errors and last time I had to use a PHP library to clean the Xpath first. Might be an issue?
6. Already have a list of URLs so that part is easy. Will need to a do a urldecode() PHP-style function on them though since they'll be a bit gobbled up
I don't know Ruby so Watir vs. Selenium I'd have to learn the API either way (but I do know PHP/cURL but need javascript action).
Any inputs on what I should lean towards? I'm an OSS guy.
If all things are considered equal, I'd say learning Ruby has more value for my future possible needs.
THANKS!
1. Browser-based scraper - There's some sites with javascript where I need it to work.
2. Pull data into a database - PREFERABLY connect and compare/ dump it to a remote MySQL database - or save to a local one and upload / import that one.
3. Thinking Windows-based VPS to do the work since I trust Windows browsers more than Linux (although I'm a Linux user and more comfortable in Linux - am I wrong here?)
4. Random proxies from a file or database or somewhere
5. XPath will be consistent so that seems like money, but sometimes there are HTML errors and last time I had to use a PHP library to clean the Xpath first. Might be an issue?
6. Already have a list of URLs so that part is easy. Will need to a do a urldecode() PHP-style function on them though since they'll be a bit gobbled up
I don't know Ruby so Watir vs. Selenium I'd have to learn the API either way (but I do know PHP/cURL but need javascript action).
Any inputs on what I should lean towards? I'm an OSS guy.
If all things are considered equal, I'd say learning Ruby has more value for my future possible needs.
THANKS!