selenium + python for bots

Hammi

New member
Dec 23, 2008
212
4
0
Where havent I lived
Hey,

Selenium 2 seems to be a big step up from the old pre-webdriver selenium.

There are still a few issues with the python connector being unable to set firefox proxies, but I think that is being fixed soon.

If anyone want's to give it a shot with python you might find this useful. It's a little function to retrieve an item from the in memory firefox cache and save it to the hard drive. You could probably edit it to send the binary data directly to your captcha solver of choice if you wanted.


[high=python]
def recover_file_from_cache(browser, key):
'''
@arg: browser - instance of selenium.webdriver.Firefox
@arg: key - the url of the image

@usage: recover_file_from_cache(browser, 'http://example.com/myjpeg.jpg')
'''
cachepath = ''.join(['about:cache-entry?client=HTTP&sb=1&key=',key])
b = browser

# open a new window to retrieve the cached image from
b.execute_script('window.open()')
main_window = browser.current_window_handle
cache_window = [a for a in browser.window_handles if a != main_window][0]
b.switch_to_window(cache_window)
b.get(cachepath)

# extract the hex data from the page and save it to the harddrive.
representation = b.find_element_by_tag_name('pre').text
cleandump = [a[11:73] for a in representation.lstrip().strip().split('\n')]
hs = ' '.join(cleandump).replace(' ','') # hex string.
hb = binascii.a2b_hex(hs) # hex to binary

# replace with the path you want to save the captchas to.
f = open('/home/h/Desktop/captchas/'+str(uuid.uuid4().time_low)+'.jpg','wb')
f.write(hb) # write binary data to file
f.close()

# switch back to the main browser window
b.close() # close browser window
b.switch_to_window(main_window)
[/high]
 
Last edited:
  • Like
Reactions: gutterseo


that's neat, I worked on something similar with Watir and Ruby a while back. What's the speed like on Selenium 2? I found that watir would be slow as fuck from time to time
 
It is similar to watir from what I remember. The slowest part is opening/closing the browser.

It automatically handles multiple simultaneous instances of firefox with seperate profiles, which I found really nice. I'm hoping I can figure out how to update preferences (including proxy) with a browser that is already open. Then I can just keep a pool of browsers alive, and whenever I have a new task for it, I can setup cookies/proxies/user-agent and get started in about 1/20th of the time.
 
very nice! pip has completely failed to install selenium, will have to try when I get back to my main machine. Lots of possibilities!
 
Splinter

uplinked recommended that to me instead of Selenium and it's quite easier to use. It still doesn't support setting a proxy, but you can probably write a default class to handle that via the about:config area in firefox. It's on my list of things to do or I'd link to it.
 
  • Like
Reactions: Hammi
I'm working on a project at the moment that uses selenium, here's how I solved the captcha problem

Code:
from PIL import Image
from selenium.webdriver import Firefox
import cStringIO


class FirefoxWithElementScreenShot(Firefox):
    def __init__(self, *args, **kwargs):
        super(FirefoxWithElementScreenShot, self).__init__(*args, **kwargs)
        self.execute_script("window.resizeTo(1280, 1024)")

    def get_screenshot_of_element_as_base_64(self, element):
        """
        Take a screenshot of just a single element and return a file-like object (eg cString.StringIO)
        """
        screenshot = self.get_screenshot_as_base64()
        file_like = cStringIO.StringIO(screenshot.decode('base64'))
        img = Image.open(file_like)
        left = element.location['x']
        top = element.location['y']
        right = left + element.size['width']
        bottom = top + element.size['height']
        cropped_img = img.crop((left, top, right, bottom))
        ret_file_like = cStringIO.StringIO()
        cropped_img.save(ret_file_like, format='png')
        return ret_file_like
The webdriver api is slender (to say the least), if you have an element you can only get an attribute if you know the name of it, there's no way to get all the attributues for the current element. You can get the parent element but not children/descendants so no tree walking (but further searches are within the element which suffices). You can get the page source at any given time (after scripts have run) and parse it with lxml or similar, but I feel that shouldn't be necessary. But don't get me wrong, it is a nice tool.
Splinter looks like it could be a little better, I'll check that out.

I haven't tried it yet but I assume the python equivalent of this java works to add proxies

Webdriver and proxy server for firefox - Stack Overflow
 
Splinter

uplinked recommended that to me instead of Selenium and it's quite easier to use. It still doesn't support setting a proxy, but you can probably write a default class to handle that via the about:config area in firefox. It's on my list of things to do or I'd link to it.

I've been learning python for about 3 - 4 hours, and I think even I could manage to throw something up in Splinter. It looks pretty amazing to be honest.

Might have a play with it in a day or two.
 
Btw you can get proxies working easily by wrapping the proxy ip like this "ip.goes.here.lol"

example would be:
profile = FirefoxProfile()

profile.set_preference('network.proxy.type',1) # sets it to use the manual proxy setting
profile.set_preference('network.http.proxy','"your.proxy.ip.here"')
... set proxies for ssl as well

browser = webdriver.Firefox(profile)

p.s. How do you create code blocks in this forum?
 
profile = FirefoxProfile()

profile.set_preference('network.proxy.type',1) # sets it to use the manual proxy setting
profile.set_preference('network.http.proxy','"your.proxy.ip.here"')
... set proxies for ssl as well

browser = webdriver.Firefox(profile)

This is exactly what I tried and it doesn't work atm, I think it's only a bug in the latest selenium release, or maybe just with ff5.
 
I'm running ff6.

also notice the ip has double quotes inside of single quotes.

as a quick test you can try going to console and print out profile.userPrefs. Open that file and look at it. the proxy ip should be wrapped in "ip" within the javascript file. If its not it breaks the load.

If its saving right I don't know what the problem is. I think I'm testing on ff4
 
I'm loving splinter so far. Not using proxies or anything, just scraping products from some sites, not as fast as gevent + urllib2, but a lot faster and easier to code. FOr most things, who cares if scraping a big site takes a day.
 
I'm loving splinter so far. Not using proxies or anything, just scraping products from some sites, not as fast as gevent + urllib2, but a lot faster and easier to code. FOr most things, who cares if scraping a big site takes a day.

I also made myself a little bot to scrape some sites. The only problem I ran into was that I had to use Firefox / Chrome, because I couldn't use a headless browser because the page I was scraping used some JS.

Is anyone aware of any headless browsers for Python that support JS, that are pretty simple to use? I'm having a hard time finding anything worthwhile.
 
I also made myself a little bot to scrape some sites. The only problem I ran into was that I had to use Firefox / Chrome, because I couldn't use a headless browser because the page I was scraping used some JS.

Is anyone aware of any headless browsers for Python that support JS, that are pretty simple to use? I'm having a hard time finding anything worthwhile.

Maybe htmlunit. Most of my stuff is one off right now, so I don't mind a minimised firefox on my machine. check out xvfb to run X apps (that's firefox) without an actual display.
 
Maybe htmlunit. Most of my stuff is one off right now, so I don't mind a minimised firefox on my machine. check out xvfb to run X apps (that's firefox) without an actual display.

Alright, I'll take a look.

On another note, maybe it was my machine. But does your firefox RAM usage increase with every navigated page when using Splinter? It's asif it doesn't free of the memory used from the last window.

On my machine it just keeps increasing by around 2 - 3mb for every single page change.

I got around it by doing the following inside my loop:

Code:
if int(i) % 10 == 0:
        browser.quit()
        browser = Browser()

It's probably a newbie way to do it, but it works. Once the browser has been reopened, it starts all over again and starts increasing. (thus me opening a new browser every 10 page changes)
 
I am currently using 115MB for that firefox. seems to be holding steady. Closing a browser will destroy a session, so not a great workaround. Make sure you are using the latest version of firefox I guess.