Extracting title and article from a webpage

mattseh

import this
Apr 6, 2009
5,504
72
0
A ~= A
Anyone got a method / code (preferably python) for extracting the title and article from an arbitrary webpage?
 


Should be very easy even without a dom parser. Have a look at stripos in php.
 
No footprints, this could be any page :) I'm thinking some generic regexs and some analysis should work. title is nearly done (grabbing h tags' contents). Will post code when i've got something good.
 
Best way I've found is to scrape 2 pages from that site which both use the same template. A blog for example, every post has the same layout right? Scrape 2 of them, then compare. The content that appears on both pages will be the menu, footer, etc etc.. And the content that is unique will generally be the content you want. This is easy to do if you grab each line of HTML and put it into an array.
 
very good point 7figures. I have a title scraper working for most sites. some blue hat site doesn't work, what idiot has their article titles inside a <p>? ;)