XPath Questions Thread

mattseh

import this
Apr 6, 2009
5,504
72
0
A ~= A
From speaking to a few people, it seems a lot of the people who read this forum have heard of xpath, but either don't know how to get started, or don't get why it's such a good idea. So any questions, please ask here. I'm gonna show how to do it in python (using https://github.com/mattseh/python-web ), but it should be similar for all languages.

Dchuk doesn't know yet, but he has volunteered some ruby code samples for how to download a page and grab code samples from it. Someone should do it for php as well.

First example:

Get all imgur links from reddit.com homepage.
Code:
page = web.grab('http://www.reddit.com/')
Now that we have the page, it's time to figure out the xpath we need.
Picking a random imgur link, we right click it and inspect with firebug:
vct7.png


The basic xpath to get all links on a page is
Code:
//a/@href
But that will give us all sorts of crap, so let's be more specific. If we use
Code:
//p[@class="title"]/a/@href
We only get links that are within the p tag with the class "title", so now we only have links from stories.
We only care about links to the domain imgur.com, so we can check if the string "imgur.com" is inside the link:
Code:
//p[@class="title"]/a[contains(@href, "imgur.com")]/@href
Some links are to the image's page, not the image itself, so we can make sure that the image link contains the extensions jpg gif or png
Code:
//p[@class="title"]/a[contains(@href, "imgur.com") and (contains(@href, ".png") or contains(@href, ".jpg") or contains(@href, ".gif"))]/@href
To run this xpath on our downloaded page we do
Code:
print page.xpath('//p[@class="title"]/a[contains(@href, "imgur.com") and (contains(@href,  ".png") or contains(@href, ".jpg") or contains(@href, ".gif"))]/@href')
An example of it in action:
jzp3.png


If anyone has any specific requests that they can out publicly, I'll show the process of constructing the xpath to get the data they need.
c85x.png
 


wait, what are we demonstrating? I'd port your example to ruby but it's basically the exact same thing in ruby except for how to load and parse a page. We can do other examples though, I accidentally set my alarm an hour early today so I'm ready to fuckin rock
 
alright I'll make a deal with you guys: based on reading about what my library does here (https://github.com/dchuk/Arachnid) come up with a simple scraper idea and I'll code it and dump the code here with comments to explain it. Will use xpath and all that jazz.

An example would be something like what matt posted above (get all imgur links from reddit) or maybe something like all nsfw links from reddit too for some fun.
 
How about a script where you give it a bunch of domains, it crawls all domains, only outputs pages with < 50 EXTERNAL links?