selenium + python for bots

Hey guys, I got a scraping project about to start and would like to know what app I should use... Here's what I'm thinking:

1. Browser-based scraper - There's some sites with javascript where I need it to work.

2. Pull data into a database - PREFERABLY connect and compare/ dump it to a remote MySQL database - or save to a local one and upload / import that one.

3. Thinking Windows-based VPS to do the work since I trust Windows browsers more than Linux (although I'm a Linux user and more comfortable in Linux - am I wrong here?)

4. Random proxies from a file or database or somewhere

5. XPath will be consistent so that seems like money, but sometimes there are HTML errors and last time I had to use a PHP library to clean the Xpath first. Might be an issue?

6. Already have a list of URLs so that part is easy. Will need to a do a urldecode() PHP-style function on them though since they'll be a bit gobbled up

I don't know Ruby so Watir vs. Selenium I'd have to learn the API either way (but I do know PHP/cURL but need javascript action).

Any inputs on what I should lean towards? I'm an OSS guy.

If all things are considered equal, I'd say learning Ruby has more value for my future possible needs.

THANKS!
 


Hey guys, I got a scraping project about to start and would like to know what app I should use... Here's what I'm thinking:

1. Browser-based scraper - There's some sites with javascript where I need it to work.

2. Pull data into a database - PREFERABLY connect and compare/ dump it to a remote MySQL database - or save to a local one and upload / import that one.

3. Thinking Windows-based VPS to do the work since I trust Windows browsers more than Linux (although I'm a Linux user and more comfortable in Linux - am I wrong here?)

4. Random proxies from a file or database or somewhere

5. XPath will be consistent so that seems like money, but sometimes there are HTML errors and last time I had to use a PHP library to clean the Xpath first. Might be an issue?

6. Already have a list of URLs so that part is easy. Will need to a do a urldecode() PHP-style function on them though since they'll be a bit gobbled up

I don't know Ruby so Watir vs. Selenium I'd have to learn the API either way (but I do know PHP/cURL but need javascript action).

Any inputs on what I should lean towards? I'm an OSS guy.

If all things are considered equal, I'd say learning Ruby has more value for my future possible needs.

THANKS!

1. Selenium / Splinter are great.
2. Django ORM / SQLAlchemy FTW (depending on if you already have a database)
3. Linux headless V is fine, linode has been fine for scraping for me.
4. For proxies, rip ProxyManager from https://github.com/mattseh/python-web/ and use it with your browser scraper.
5. XPath in splinter is great, I often feed the source html into lxml's etree for xpath I'm more used to though. It's never choked on malformed HTML, Beautiful soup has, don't use it.
6. Agreed, shouldn't be a problem.

vim is win, good choice!

You have my skype, feel free to ping me about anything man!
 
  • Like
Reactions: Berto
WHY FLASH NO WORK ON MY VPS MAAAN IT DRIVE ME CRAZY.

Flash support on Linux is very poor, sorry.
If you have a debian based distro follow this:
Install debian-multimedia libs:

Code:
wget http://www.debian-multimedia.org/pool/main/d/debian-multimedia-keyring/debian-multimedia-keyring_2008.10.16_all.deb

dpkg -i debian-multimedia-keyring_2008.10.16_all.deb


If you are on a 64bit machine install the 32bit libs:

Code:
aptitude install ia32-libs ia32-libs-libnss3 ia32-libs-libcurl3 libcurl3 nspluginwrapper


Install the Flash player for all browsers on machine:

Code:
aptitude install flashplayer-mozilla
 
1. Selenium / Splinter are great.
2. Django ORM / SQLAlchemy FTW (depending on if you already have a database)
3. Linux headless V is fine, linode has been fine for scraping for me.
4. For proxies, rip ProxyManager from https://github.com/mattseh/python-web/ and use it with your browser scraper.
5. XPath in splinter is great, I often feed the source html into lxml's etree for xpath I'm more used to though. It's never choked on malformed HTML, Beautiful soup has, don't use it.
6. Agreed, shouldn't be a problem.

vim is win, good choice!

You have my skype, feel free to ping me about anything man!

Thanks MattSeh! I still don't understand the value-adds of splinter over selenium, besides maybe it's easier for noobs like me, but I'll go in that direction.

Glad to know Linux headless could end up working well. I'll want to start with a head so I can watch it progress... Maybe I'll just start coding this from my own machine and then free up my extra linode for this for production.
 
I just told mattseh this, but I've been raving about Watir for Ruby for like two years here now...I'm glad more people are getting into botting but damn, where's the ruby love?
 
Thanks MattSeh! I still don't understand the value-adds of splinter over selenium, besides maybe it's easier for noobs like me, but I'll go in that direction.

Glad to know Linux headless could end up working well. I'll want to start with a head so I can watch it progress... Maybe I'll just start coding this from my own machine and then free up my extra linode for this for production.

You can VNC and watch bots as they run on servers ;)
 
Some of Ruby's most popular libraries are built to interact with headless browers.

- capybara: Rack testing driver that supports Selenium. (capybara-webkit for webkit)

It's a testing driver, so you just have to extract the API out of test objects.

Code:
  def sign_in
    visit '/log_in'
    fill_in 'Login', :with => 'user@example.com'
    fill_in 'Password', :with => 'password'
    click_link 'Sign in'
  end

  sign_in

Also, you should be banned for posting unformated code.
 
Mod of code from this thread for splinter captcha grabbing:

Code:
def get_image_screenshot(browser, element):
    element = element._element
    screenshot = browser.driver.get_screenshot_as_base64()
    file_like = StringIO.StringIO(screenshot.decode('base64'))
    img = Image.open(file_like)
    left = element.location['x']
    top = element.location['y']
    right = left + element.size['width']
    bottom = top + element.size['height']
    cropped_img = img.crop((left, top, right, bottom))
    ret_file_like = StringIO.StringIO()
    cropped_img.save(ret_file_like, format='png')
    ret_file_like.seek(0)
    return ret_file_like