From speaking to a few people, it seems a lot of the people who read this forum have heard of xpath, but either don't know how to get started, or don't get why it's such a good idea. So any questions, please ask here. I'm gonna show how to do it in python (using https://github.com/mattseh/python-web ), but it should be similar for all languages.
Dchuk doesn't know yet, but he has volunteered some ruby code samples for how to download a page and grab code samples from it. Someone should do it for php as well.
First example:
Get all imgur links from reddit.com homepage.
Now that we have the page, it's time to figure out the xpath we need.
Picking a random imgur link, we right click it and inspect with firebug:
The basic xpath to get all links on a page is
But that will give us all sorts of crap, so let's be more specific. If we use
We only get links that are within the p tag with the class "title", so now we only have links from stories.
We only care about links to the domain imgur.com, so we can check if the string "imgur.com" is inside the link:
Some links are to the image's page, not the image itself, so we can make sure that the image link contains the extensions jpg gif or png
To run this xpath on our downloaded page we do
An example of it in action:
If anyone has any specific requests that they can out publicly, I'll show the process of constructing the xpath to get the data they need.
Dchuk doesn't know yet, but he has volunteered some ruby code samples for how to download a page and grab code samples from it. Someone should do it for php as well.
First example:
Get all imgur links from reddit.com homepage.
Code:
page = web.grab('http://www.reddit.com/')
Picking a random imgur link, we right click it and inspect with firebug:

The basic xpath to get all links on a page is
Code:
//a/@href
Code:
//p[@class="title"]/a/@href
We only care about links to the domain imgur.com, so we can check if the string "imgur.com" is inside the link:
Code:
//p[@class="title"]/a[contains(@href, "imgur.com")]/@href
Code:
//p[@class="title"]/a[contains(@href, "imgur.com") and (contains(@href, ".png") or contains(@href, ".jpg") or contains(@href, ".gif"))]/@href
Code:
print page.xpath('//p[@class="title"]/a[contains(@href, "imgur.com") and (contains(@href, ".png") or contains(@href, ".jpg") or contains(@href, ".gif"))]/@href')

If anyone has any specific requests that they can out publicly, I'll show the process of constructing the xpath to get the data they need.
