removing excess junk from scraped content

Status
Not open for further replies.

ozonew4m

ThisTimeNextYearRodders
Jul 6, 2007
234
3
0
At my desk!
Has anybody got a php function, or php snippet or any good ideas of how to remove the final bits of whitespace and formatting left over from scraped content..

so if i was left with something like this:

Code:
<p></p>
 
 
<br>  <br>\n       <p></p>
 
<br>
<br>
<br>
<br>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. </p>
 
\n        \n    <br>   <p><p /><p>
 
\n\n\n\n   
 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut <br>  \n aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. <br>


How can i get rid of all that excess whitespace/newlines etc yet try and keep some formatting to the text?

any php ideas please?
 


one function is strip_tags()

to kill the white space and line breaks:

$clean = str_replace("\t", "", $clean);
$clean = str_replace("\n", "", $clean);
$clean = str_replace(" ", "", $clean);
 
yeah I already know that but i would like to try and keep some of the formatting so

Code:
\n        \n    <br>   <p><p /><p>

\n\n\n\n

could possible become a single
Code:
\n
or
Code:
<br>
or
Code:
<p />

I was thinking of trying to first convert them all into a certain character then replace multiples of the character to a single new line or something like that

am i making sense :updown:
 
gotcha.

my only suggestion then would be to use very tight parsing, strip all the junk and formatting, and then add it back (the formatting) in a standard way, so that everything is uniform.
 
strip everything except <p> & </p> individually, also remove the whitespace by killing each line that has nothing on it.

When you're done with the above

$clean = str_replace("<p></p>", "", $clean);

And you've got a well formatted article
 
I would say it depends on the exact format you usually find the articles in. Look for patterns within patterns; like if you see an advertisement pattern, with extra formatting around it that is not part of the article, extract those.
 
I tend to convert all empty paragraphs to <br> tags, then use preg_replace to change all multiple br tags to single tags (preg_replace( '|<br >+|', '<br>', $input) ).

I also trim out all \n and put them back in, and run some trim() and the like. It's really a method of approximation, knowing what's going in and knowing what I want to come out.
 
I tend to convert all empty paragraphs to <br> tags, then use preg_replace to change all multiple br tags to single tags (preg_replace( '|<br >+|', '<br>', $input) ).

A haaa... thats what i was looking for... something to replace multiples with singular .. thanks +rep
Code:
(preg_replace( '|<br >+|', '<br>', $input))

Take some PHP code and then do some stuff to get rid of the things you dont want. Then it will be fixed.

yeah now why didnt i think of that :rolleyes:



Thanks for the help people ;)
 
Status
Not open for further replies.