Extracting title and article from a webpage

mattseh · Nov 29, 2009

Anyone got a method / code (preferably python) for extracting the title and article from an arbitrary webpage?

jryan21 · Nov 29, 2009

Look for DOM parsers for python; here's one I use for php:

PHP: DOM - Manual

gutterseo · Nov 30, 2009

Should be very easy even without a dom parser. Have a look at stripos in php.

Deliguy · Nov 30, 2009

whats the footprint for the article text and is the title of the article the same as whats in the title tags?

You can modify the last few lines of http://www.bluehatseo.com/wp-content/uploads/2006/11/crawlercgi.txt right under the CLOSE(OUTF); to put the title and article into a flat file.

mattseh · Nov 30, 2009

No footprints, this could be any page

I'm thinking some generic regexs and some analysis should work. title is nearly done (grabbing h tags' contents). Will post code when i've got something good.

7figures · Nov 30, 2009

Best way I've found is to scrape 2 pages from that site which both use the same template. A blog for example, every post has the same layout right? Scrape 2 of them, then compare. The content that appears on both pages will be the menu, footer, etc etc.. And the content that is unique will generally be the content you want. This is easy to do if you grab each line of HTML and put it into an array.

mattseh · Nov 30, 2009

very good point 7figures. I have a title scraper working for most sites. some blue hat site doesn't work, what idiot has their article titles inside a <p>?

Search

Search

Extracting title and article from a webpage

mattseh

import this

jryan21

Level 4 Grindstone

gutterseo

▬▬▬▬▬▬▬&

Deliguy

New member

mattseh

import this

7figures

New member

mattseh

import this

Extracting title and article from a webpage

mattseh

import this

jryan21

Level 4 Grindstone

gutterseo

&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&#9644;&

Deliguy

New member

mattseh

import this

7figures

New member

mattseh

import this

▬▬▬▬▬▬▬&