selenium + python for bots

I also made myself a little bot to scrape some sites. The only problem I ran into was that I had to use Firefox / Chrome, because I couldn't use a headless browser because the page I was scraping used some JS.

Is anyone aware of any headless browsers for Python that support JS, that are pretty simple to use? I'm having a hard time finding anything worthwhile.

javascript support in htmlunit is poor, try pyvirtualdisplay Corey Goldberg: Python - Taking Browser Screenshots With No Display (Selenium/Xvfb) or PyPhantomJS http://dev.umaclan.com/projects/pyphantomjs (let me know if you have any success with that one). I don't think either could be described as 'simple to use' :)
 


javascript support in htmlunit is poor, try pyvirtualdisplay Corey Goldberg: Python - Taking Browser Screenshots With No Display (Selenium/Xvfb) or PyPhantomJS http://dev.umaclan.com/projects/pyphantomjs (let me know if you have any success with that one). I don't think either could be described as 'simple to use' :)

I'm just on the pyvirtualdisplay site, came across it when searching for xvfb. it doesn't seem too complicated as far as I can tell.

On the other hand, I found PyPhantomJS this morning, and didn't even attempt to get it working. I think with around 10 hours experience in python, I'll leave that to somebody else.

Going to try out pyvirtualdisplay later though, I'll let you know how I get on with that :)
 
I am currently using 115MB for that firefox. seems to be holding steady. Closing a browser will destroy a session, so not a great workaround. Make sure you are using the latest version of firefox I guess.

I'm on FF 5, although I'm currently running this on OS X. Might try it on my windows machine later, see how it runs on there.
 
I'm on FF 5, although I'm currently running this on OS X. Might try it on my windows machine later, see how it runs on there.

Linux virtualbox machine would be best i think. I'm sure most of these things are developed and used most on linux.
 
That's just how firefox is, it's bloated as fuck.

Alright, I'll take a look.

On another note, maybe it was my machine. But does your firefox RAM usage increase with every navigated page when using Splinter? It's asif it doesn't free of the memory used from the last window.

On my machine it just keeps increasing by around 2 - 3mb for every single page change.

I got around it by doing the following inside my loop:

Code:
if int(i) % 10 == 0:
        browser.quit()
        browser = Browser()

It's probably a newbie way to do it, but it works. Once the browser has been reopened, it starts all over again and starts increasing. (thus me opening a new browser every 10 page changes)
 
anyone had any problem with pulling browser.page_source (browser.html on splinter) and sometimes getting a web driver exception? I'm having real trouble pinning down the cause of this.
 
Seleniun 2.3 is out, all the profile preference setting bugs are fixed so proxies work if they're not password protected, if they are you need to manually set the passwords then copy the temporary profile folder for future use.
 
anyone ever had flash crash using selenium (well splinter which uses selenium)? This is a site that just happens to use flash. Works fine on desktop, on server without flash it blocks waiting for user input to install flash, with flash, flash crashes. Both ubuntu 11.04.
 
p.s.

php may be a memory hog but it does not use 125MB per instance of curl. Isn't there a way around this?
 
there once was a gay webmaster named uplinked
who liked to spin right-round the butt-rink
but that account's fucked, i can't log in, shucks!
and now here i am under a different monicker making you think i was about to rhyme but then disappointing you

WHY FLASH NO WORK ON MY VPS MAAAN IT DRIVE ME CRAZY.