Python Scraping

Rage9

Banned
Jan 7, 2008
6,061
101
0
Wrote my first scraper in Python today, using requests and lxml (xpath).

Fells good man. I am a convert. Covered a ton of ground in short time. Thanks mattseh. :485:

Great_success.jpg
 


Requests and NLTK is what got me into python. 3 or 4 lines of code can get shit done 1st try where it used to take me a couple hours trial and error in php and cURL.
 
Requests and NLTK is what got me into python. 3 or 4 lines of code can get shit done 1st try where it used to take me a couple hours trial and error in php and cURL.

Right? But on the plus side, I know the shit out of cURL now, lol. Was sort of a bitch to get everything installed on Windows, but so worth it in terms of speed. xPath for the win.
 
Right? But on the plus side, I know the shit out of cURL now, lol. Was sort of a bitch to get everything installed on Windows, but so worth it in terms of speed. xPath for the win.

If you insist on keeping windows as primary OS, at least VirtualBox yourself a 512mb ubuntu instance, share a folder w/ your host OS, ssh into ubuntu to launch python and use your windows share to edit the code. This is what I do from OSX; frankly the python experience on linux is a league above what any other platform provides. Getting packages and dependencies is rarely more than a matter of "pip install _____", and when it's any more complicated, "apt-get install ____-dev" is what you need the rest of the time.

Welcome to dark side :) soon we start the MVC enema.
 
BeautifulSoup kicks ass when you're extracting info from HTML, check it out.

I did look in to it, was going to use it but just seemed more confusing than it needed to be. I had dug through some threads here and saw the use of xpath. The only issue was getting it set up, and although I can't compare it directly to BeautifulSoup, I would say xpath is dead simple when you get down to it.

If you insist on keeping windows as primary OS, at least VirtualBox yourself a 512mb ubuntu instance, share a folder w/ your host OS, ssh into ubuntu to launch python and use your windows share to edit the code. This is what I do from OSX; frankly the python experience on linux is a league above what any other platform provides. Getting packages and dependencies is rarely more than a matter of "pip install _____", and when it's any more complicated, "apt-get install ____-dev" is what you need the rest of the time.

Welcome to dark side :) soon we start the MVC enema.

Too much of a hassle really. I use to keep both an Ubuntu Linux and a Windows development environment, it just became too much of hassle. You can do pretty much any Linux stuff on Windows if you're crafty enough. :) Not that long ago I thought about running a Linux environment again (dual boot with Windows) but I fucking hate Ubuntu's Unity, and Mint was giving me major issues. So fuck that. I'm sure it'd work fine from a virtual machine but I like to immerse myself in it.

Although it would be nice if Python was a bit more Windows friendly, once you figure out how to install the proper dependencies on Windows it's not that bad (albeit getting there can be a bitch.) Most of the time you can just pip install ____ but on occasion packages need to be compiled because they have c counterparts, and, as I found out, you can normally find pre-compiled versions to install as it's seemingly impossible to properly install gcc and get python to compile the binaries properly (you'd think by now they'd solve this.) For those looking for such resource check it out here: Python Extension Packages for Windows - Christoph Gohlke it's a sizable list of pre-compiled libraries, but not everything.

If anyone needs to know how to setup lxml on Windows I'm happy to write a guide up. It's fairly simple once you figure it out.

Also thanks mattseh, will look in to those they look useful.
 
You guys spin my head around with how talented you are with this stuff. Always interesting to me.
 
rage, any tips/tricks for someone coming from PHP that is mostly a hack coder/linear style guy?

Im looking for advice other then, just programming. Things like getting your head around OOP and shit.
 
rage, you could get a $20 / mo linode for python scraping, works well.

Or just, ya know, not, and save $20. Just because you can, doesn't mean you should or need to. ;)

rage, any tips/tricks for someone coming from PHP that is mostly a hack coder/linear style guy?

Im looking for advice other then, just programming. Things like getting your head around OOP and shit.

Yeah, want to learn fast? Download other peoples stuff and look at how they do it, I'm talking whole site scripts. Then come up with an idea, and set out to do it. Bang your head a lot, it hurts but after you figure it out it'll stick. Don't overthink OOP, Classes are nothing more than a bunch of functions wrapped in a wrapper. If you want to learn how to properly setup stuff on Linux, work exclusively in Linux. After a couple of months you'll have it down.
 
I dunno, I just don't see the point in using languages that are already written... they're so bloated. I just write my own programming languages when I need to scrape a site. Sure it takes a while, but by the end I have a programming language that can pick my boogers.
 
I dunno, I just don't see the point in using languages that are already written... they're so bloated. I just write my own programming languages when I need to scrape a site. Sure it takes a while, but by the end I have a programming language that can pick my boogers.

I mean, yeah, that's the way to go if you can. ;)
 
You use Scrapy (python scraping framework) ?

I don't know why I'm even replying to this, but know. Scrapy looks way more complex than it needs to be, at least for most scraping applications.

For hard core scraping, maybe, but 95% + of scraping jobs will not fall in to that category.
 
Congrats with that. Why don't you use more handy software for scraping instead of writing from scratch?