selenium + python for bots

Hammi · Jul 25, 2011

Hey,

Selenium 2 seems to be a big step up from the old pre-webdriver selenium.

There are still a few issues with the python connector being unable to set firefox proxies, but I think that is being fixed soon.

If anyone want's to give it a shot with python you might find this useful. It's a little function to retrieve an item from the in memory firefox cache and save it to the hard drive. You could probably edit it to send the binary data directly to your captcha solver of choice if you wanted.

[high=python]
def recover_file_from_cache(browser, key):
'''
@arg: browser - instance of selenium.webdriver.Firefox
@arg: key - the url of the image

@usage: recover_file_from_cache(browser, 'http://example.com/myjpeg.jpg')
'''
cachepath = ''.join(['about:cache-entry?client=HTTP&sb=1&key=',key])
b = browser

# open a new window to retrieve the cached image from
b.execute_script('window.open()')
main_window = browser.current_window_handle
cache_window = [a for a in browser.window_handles if a != main_window][0]
b.switch_to_window(cache_window)
b.get(cachepath)

# extract the hex data from the page and save it to the harddrive.
representation = b.find_element_by_tag_name('pre').text
cleandump = [a[11:73] for a in representation.lstrip().strip().split('\n')]
hs = ' '.join(cleandump).replace(' ','') # hex string.
hb = binascii.a2b_hex(hs) # hex to binary

# replace with the path you want to save the captchas to.
f = open('/home/h/Desktop/captchas/'+str(uuid.uuid4().time_low)+'.jpg','wb')
f.write(hb) # write binary data to file
f.close()

# switch back to the main browser window
b.close() # close browser window
b.switch_to_window(main_window)
[/high]

Hammi · Jul 25, 2011

deleted

dchuk · Jul 25, 2011

that's neat, I worked on something similar with Watir and Ruby a while back. What's the speed like on Selenium 2? I found that watir would be slow as fuck from time to time

Hammi · Jul 25, 2011

It is similar to watir from what I remember. The slowest part is opening/closing the browser.

It automatically handles multiple simultaneous instances of firefox with seperate profiles, which I found really nice. I'm hoping I can figure out how to update preferences (including proxy) with a browser that is already open. Then I can just keep a pool of browsers alive, and whenever I have a new task for it, I can setup cookies/proxies/user-agent and get started in about 1/20th of the time.

mattseh · Jul 26, 2011

very nice! pip has completely failed to install selenium, will have to try when I get back to my main machine. Lots of possibilities!

Rexibit · Jul 26, 2011

Splinter

uplinked recommended that to me instead of Selenium and it's quite easier to use. It still doesn't support setting a proxy, but you can probably write a default class to handle that via the about:config area in firefox. It's on my list of things to do or I'd link to it.

j0hnsmith · Jul 26, 2011

I'm working on a project at the moment that uses selenium, here's how I solved the captcha problem

Code:

from PIL import Image
from selenium.webdriver import Firefox
import cStringIO


class FirefoxWithElementScreenShot(Firefox):
    def __init__(self, *args, **kwargs):
        super(FirefoxWithElementScreenShot, self).__init__(*args, **kwargs)
        self.execute_script("window.resizeTo(1280, 1024)")

    def get_screenshot_of_element_as_base_64(self, element):
        """
        Take a screenshot of just a single element and return a file-like object (eg cString.StringIO)
        """
        screenshot = self.get_screenshot_as_base64()
        file_like = cStringIO.StringIO(screenshot.decode('base64'))
        img = Image.open(file_like)
        left = element.location['x']
        top = element.location['y']
        right = left + element.size['width']
        bottom = top + element.size['height']
        cropped_img = img.crop((left, top, right, bottom))
        ret_file_like = cStringIO.StringIO()
        cropped_img.save(ret_file_like, format='png')
        return ret_file_like

The webdriver api is slender (to say the least), if you have an element you can only get an attribute if you know the name of it, there's no way to get all the attributues for the current element. You can get the parent element but not children/descendants so no tree walking (but further searches are within the element which suffices). You can get the page source at any given time (after scripts have run) and parse it with lxml or similar, but I feel that shouldn't be necessary. But don't get me wrong, it is a nice tool.
Splinter looks like it could be a little better, I'll check that out.

I haven't tried it yet but I assume the python equivalent of this java works to add proxies

Webdriver and proxy server for firefox - Stack Overflow

j0hnsmith · Jul 26, 2011

As Hammi said, proxies aren't working atm, should be fixed soon Issue 2061 - selenium - Firefox binary ignoring some settings in profile - Browser automation framework - Google Project Hosting

Hammi · Jul 26, 2011

Rexibit said:
Splinter

uplinked recommended that to me instead of Selenium and it's quite easier to use. It still doesn't support setting a proxy, but you can probably write a default class to handle that via the about:config area in firefox. It's on my list of things to do or I'd link to it.

that looks cool, thanks.

Jake232 · Jul 26, 2011

Rexibit said:
Splinter

uplinked recommended that to me instead of Selenium and it's quite easier to use. It still doesn't support setting a proxy, but you can probably write a default class to handle that via the about:config area in firefox. It's on my list of things to do or I'd link to it.

I've been learning python for about 3 - 4 hours, and I think even I could manage to throw something up in Splinter. It looks pretty amazing to be honest.

Might have a play with it in a day or two.

Hammi · Jul 26, 2011

Btw you can get proxies working easily by wrapping the proxy ip like this "ip.goes.here.lol"

example would be:
profile = FirefoxProfile()

profile.set_preference('network.proxy.type',1) # sets it to use the manual proxy setting
profile.set_preference('network.http.proxy','"your.proxy.ip.here"')
... set proxies for ssl as well

browser = webdriver.Firefox(profile)

p.s. How do you create code blocks in this forum?

chatmasta · Jul 26, 2011

Code:

[code]blah

[/CODE]

bigmoneyrob · Jul 26, 2011

chatmasta said:
[code]blah[/code]

ftfy

j0hnsmith · Jul 26, 2011

Hammi said:
profile = FirefoxProfile()

profile.set_preference('network.proxy.type',1) # sets it to use the manual proxy setting
profile.set_preference('network.http.proxy','"your.proxy.ip.here"')
... set proxies for ssl as well

browser = webdriver.Firefox(profile)

This is exactly what I tried and it doesn't work atm, I think it's only a bug in the latest selenium release, or maybe just with ff5.

Hammi · Jul 26, 2011

I'm running ff6.

also notice the ip has double quotes inside of single quotes.

as a quick test you can try going to console and print out profile.userPrefs. Open that file and look at it. the proxy ip should be wrapped in "ip" within the javascript file. If its not it breaks the load.

If its saving right I don't know what the problem is. I think I'm testing on ff4

mattseh · Jul 28, 2011

I'm loving splinter so far. Not using proxies or anything, just scraping products from some sites, not as fast as gevent + urllib2, but a lot faster and easier to code. FOr most things, who cares if scraping a big site takes a day.

Jake232 · Jul 28, 2011

mattseh said:
I'm loving splinter so far. Not using proxies or anything, just scraping products from some sites, not as fast as gevent + urllib2, but a lot faster and easier to code. FOr most things, who cares if scraping a big site takes a day.

I also made myself a little bot to scrape some sites. The only problem I ran into was that I had to use Firefox / Chrome, because I couldn't use a headless browser because the page I was scraping used some JS.

Is anyone aware of any headless browsers for Python that support JS, that are pretty simple to use? I'm having a hard time finding anything worthwhile.

mattseh · Jul 28, 2011

Jake232 said:
I also made myself a little bot to scrape some sites. The only problem I ran into was that I had to use Firefox / Chrome, because I couldn't use a headless browser because the page I was scraping used some JS.

Is anyone aware of any headless browsers for Python that support JS, that are pretty simple to use? I'm having a hard time finding anything worthwhile.

Maybe htmlunit. Most of my stuff is one off right now, so I don't mind a minimised firefox on my machine. check out xvfb to run X apps (that's firefox) without an actual display.

Jake232 · Jul 28, 2011

mattseh said:
Maybe htmlunit. Most of my stuff is one off right now, so I don't mind a minimised firefox on my machine. check out xvfb to run X apps (that's firefox) without an actual display.

Alright, I'll take a look.

On another note, maybe it was my machine. But does your firefox RAM usage increase with every navigated page when using Splinter? It's asif it doesn't free of the memory used from the last window.

On my machine it just keeps increasing by around 2 - 3mb for every single page change.

I got around it by doing the following inside my loop:

Code:

if int(i) % 10 == 0:
        browser.quit()
        browser = Browser()

It's probably a newbie way to do it, but it works. Once the browser has been reopened, it starts all over again and starts increasing. (thus me opening a new browser every 10 page changes)

mattseh · Jul 28, 2011

I am currently using 115MB for that firefox. seems to be holding steady. Closing a browser will destroy a session, so not a great workaround. Make sure you are using the latest version of firefox I guess.

selenium + python for bots

New member

New member

Senior Botter

New member

import this

Automation, I has it.

Enlightened Member

Enlightened Member

New member

New member

New member

Well-known member

.

Enlightened Member

New member

import this

New member

import this

New member

import this