selenium + python for bots

Berto · Aug 8, 2011

Hey guys, I got a scraping project about to start and would like to know what app I should use... Here's what I'm thinking:

1. Browser-based scraper - There's some sites with javascript where I need it to work.

2. Pull data into a database - PREFERABLY connect and compare/ dump it to a remote MySQL database - or save to a local one and upload / import that one.

3. Thinking Windows-based VPS to do the work since I trust Windows browsers more than Linux (although I'm a Linux user and more comfortable in Linux - am I wrong here?)

4. Random proxies from a file or database or somewhere

5. XPath will be consistent so that seems like money, but sometimes there are HTML errors and last time I had to use a PHP library to clean the Xpath first. Might be an issue?

6. Already have a list of URLs so that part is easy. Will need to a do a urldecode() PHP-style function on them though since they'll be a bit gobbled up

I don't know Ruby so Watir vs. Selenium I'd have to learn the API either way (but I do know PHP/cURL but need javascript action).

Any inputs on what I should lean towards? I'm an OSS guy.

If all things are considered equal, I'd say learning Ruby has more value for my future possible needs.

THANKS!

Berto · Aug 8, 2011

Addendum: If anyone recommends something for the post above, also recommend me a Windows IDE (if you recommend Windows). I use 100% vim in Linux and am clueless with IDEs, but might as well do it right.

mattseh · Aug 9, 2011

Berto said:
Hey guys, I got a scraping project about to start and would like to know what app I should use... Here's what I'm thinking:

1. Browser-based scraper - There's some sites with javascript where I need it to work.

2. Pull data into a database - PREFERABLY connect and compare/ dump it to a remote MySQL database - or save to a local one and upload / import that one.

3. Thinking Windows-based VPS to do the work since I trust Windows browsers more than Linux (although I'm a Linux user and more comfortable in Linux - am I wrong here?)

4. Random proxies from a file or database or somewhere

5. XPath will be consistent so that seems like money, but sometimes there are HTML errors and last time I had to use a PHP library to clean the Xpath first. Might be an issue?

6. Already have a list of URLs so that part is easy. Will need to a do a urldecode() PHP-style function on them though since they'll be a bit gobbled up

I don't know Ruby so Watir vs. Selenium I'd have to learn the API either way (but I do know PHP/cURL but need javascript action).

Any inputs on what I should lean towards? I'm an OSS guy.

If all things are considered equal, I'd say learning Ruby has more value for my future possible needs.

THANKS!

1. Selenium / Splinter are great.
2. Django ORM / SQLAlchemy FTW (depending on if you already have a database)
3. Linux headless V is fine, linode has been fine for scraping for me.
4. For proxies, rip ProxyManager from https://github.com/mattseh/python-web/ and use it with your browser scraper.
5. XPath in splinter is great, I often feed the source html into lxml's etree for xpath I'm more used to though. It's never choked on malformed HTML, Beautiful soup has, don't use it.
6. Agreed, shouldn't be a problem.

vim is win, good choice!

You have my skype, feel free to ping me about anything man!

Flash4Ever · Aug 9, 2011

mattseh said:
WHY FLASH NO WORK ON MY VPS MAAAN IT DRIVE ME CRAZY.

Flash support on Linux is very poor, sorry.
If you have a debian based distro follow this:
Install debian-multimedia libs:

Code:

wget http://www.debian-multimedia.org/pool/main/d/debian-multimedia-keyring/debian-multimedia-keyring_2008.10.16_all.deb

dpkg -i debian-multimedia-keyring_2008.10.16_all.deb

If you are on a 64bit machine install the 32bit libs:

Code:

aptitude install ia32-libs ia32-libs-libnss3 ia32-libs-libcurl3 libcurl3 nspluginwrapper

Install the Flash player for all browsers on machine:

Code:

aptitude install flashplayer-mozilla

Berto · Aug 9, 2011

mattseh said:
1. Selenium / Splinter are great.
2. Django ORM / SQLAlchemy FTW (depending on if you already have a database)
3. Linux headless V is fine, linode has been fine for scraping for me.
4. For proxies, rip ProxyManager from https://github.com/mattseh/python-web/ and use it with your browser scraper.
5. XPath in splinter is great, I often feed the source html into lxml's etree for xpath I'm more used to though. It's never choked on malformed HTML, Beautiful soup has, don't use it.
6. Agreed, shouldn't be a problem.

vim is win, good choice!

You have my skype, feel free to ping me about anything man!

Thanks MattSeh! I still don't understand the value-adds of splinter over selenium, besides maybe it's easier for noobs like me, but I'll go in that direction.

Glad to know Linux headless could end up working well. I'll want to start with a head so I can watch it progress... Maybe I'll just start coding this from my own machine and then free up my extra linode for this for production.

dchuk · Aug 9, 2011

I just told mattseh this, but I've been raving about Watir for Ruby for like two years here now...I'm glad more people are getting into botting but damn, where's the ruby love?

mattseh · Aug 9, 2011

Berto said:
Thanks MattSeh! I still don't understand the value-adds of splinter over selenium, besides maybe it's easier for noobs like me, but I'll go in that direction.

Glad to know Linux headless could end up working well. I'll want to start with a head so I can watch it progress... Maybe I'll just start coding this from my own machine and then free up my extra linode for this for production.

You can VNC and watch bots as they run on servers

Mahzkrieg · Aug 9, 2011

Some of Ruby's most popular libraries are built to interact with headless browers.

- capybara: Rack testing driver that supports Selenium. (capybara-webkit for webkit)

It's a testing driver, so you just have to extract the API out of test objects.

Code:

  def sign_in
    visit '/log_in'
    fill_in 'Login', :with => 'user@example.com'
    fill_in 'Password', :with => 'password'
    click_link 'Sign in'
  end

  sign_in

Also, you should be banned for posting unformated code.

mattseh · Nov 19, 2011

Mod of code from this thread for splinter captcha grabbing:

Code:

def get_image_screenshot(browser, element):
    element = element._element
    screenshot = browser.driver.get_screenshot_as_base64()
    file_like = StringIO.StringIO(screenshot.decode('base64'))
    img = Image.open(file_like)
    left = element.location['x']
    top = element.location['y']
    right = left + element.size['width']
    bottom = top + element.size['height']
    cropped_img = img.crop((left, top, right, bottom))
    ret_file_like = StringIO.StringIO()
    cropped_img.save(ret_file_like, format='png')
    ret_file_like.seek(0)
    return ret_file_like

Search

Search

selenium + python for bots

Berto

Movin to TX

Berto

Movin to TX

mattseh

import this

Flash4Ever

New member

Berto

Movin to TX

dchuk

Senior Botter

mattseh

import this

Mahzkrieg

New member

mattseh

import this