XPath Questions Thread

mattseh · Feb 27, 2012

From speaking to a few people, it seems a lot of the people who read this forum have heard of xpath, but either don't know how to get started, or don't get why it's such a good idea. So any questions, please ask here. I'm gonna show how to do it in python (using https://github.com/mattseh/python-web ), but it should be similar for all languages.

Dchuk doesn't know yet, but he has volunteered some ruby code samples for how to download a page and grab code samples from it. Someone should do it for php as well.

First example:

Get all imgur links from reddit.com homepage.

Code:

page = web.grab('http://www.reddit.com/')

Now that we have the page, it's time to figure out the xpath we need.
Picking a random imgur link, we right click it and inspect with firebug:

The basic xpath to get all links on a page is

Code:

//a/@href

But that will give us all sorts of crap, so let's be more specific. If we use

Code:

//p[@class="title"]/a/@href

We only get links that are within the p tag with the class "title", so now we only have links from stories.
We only care about links to the domain imgur.com, so we can check if the string "imgur.com" is inside the link:

Code:

//p[@class="title"]/a[contains(@href, "imgur.com")]/@href

Some links are to the image's page, not the image itself, so we can make sure that the image link contains the extensions jpg gif or png

Code:

//p[@class="title"]/a[contains(@href, "imgur.com") and (contains(@href, ".png") or contains(@href, ".jpg") or contains(@href, ".gif"))]/@href

To run this xpath on our downloaded page we do

Code:

print page.xpath('//p[@class="title"]/a[contains(@href, "imgur.com") and (contains(@href,  ".png") or contains(@href, ".jpg") or contains(@href, ".gif"))]/@href')

An example of it in action:

If anyone has any specific requests that they can out publicly, I'll show the process of constructing the xpath to get the data they need.

Bofu2U · Feb 27, 2012

Winning.

eliquid · Feb 27, 2012

PHP is gonna pwn this thread.. brb

dchuk · Feb 27, 2012

wait, what are we demonstrating? I'd port your example to ruby but it's basically the exact same thing in ruby except for how to load and parse a page. We can do other examples though, I accidentally set my alarm an hour early today so I'm ready to fuckin rock

mattseh · Feb 27, 2012

yes basic ruby code to load and parse, plus your own example.

JCash · Feb 27, 2012

Please don't bump threads from 2002.

mattseh · Feb 27, 2012

JCash said:
Please don't bump threads from 2002.

Please troll only when you contribute *anything* of value to this forum.

dchuk · Feb 28, 2012

alright I'll make a deal with you guys: based on reading about what my library does here (https://github.com/dchuk/Arachnid) come up with a simple scraper idea and I'll code it and dump the code here with comments to explain it. Will use xpath and all that jazz.

An example would be something like what matt posted above (get all imgur links from reddit) or maybe something like all nsfw links from reddit too for some fun.

mattseh · Feb 28, 2012

How about a script where you give it a bunch of domains, it crawls all domains, only outputs pages with < 50 EXTERNAL links?

dooogen · Feb 29, 2012

xpath is the bomb. just made my life easier.

Search

Search

XPath Questions Thread

mattseh

import this

Bofu2U

Automation Specialist

eliquid

Serpwoo.com

dchuk

Senior Botter

mattseh

import this

JCash

Oldschool IM

mattseh

import this

dchuk

Senior Botter

mattseh

import this

dooogen

WF Premium Member