PHP Regex Question - Can You Solve It?

Status
Not open for further replies.

plepco

New member
May 24, 2007
275
11
0
New Orleans
www.massindexer.com
I am scraping content from a site, and for the most part things are going okay. 'Cept I have a problem with a portion of the scraped content - I want to remove part of it and keep part of it.

Here's the portion I am dealing with:
Original text:
HTML:
<span class="prefix">DATE:</span> May 25, 2006   <span class="prefix">AUTHOR:</span> <a href="http://samplesite.com/authors/Jack" target="_blank">JACK</a> </p> <p><span class="prefix">JACK SAYS:</span> “blah blah blah etc.”</p></div>

What I want to do is strip everything out so I end up with simply "JACK SAYS: 'blah blah blah'" and none of the other junk.

One reason for the difficulty is that I have like 100 entries that have different dates, different AUTHOR info, etc. and I want to do the same thing to each entry.

(I would post what I've tried already except none of it has worked out.) If you can solve this it will be much appreciated.
 


preg_match('/<p><span class="prefix">(.+?):.+?&(.+?)</', $subject,$match)

that should return

$match[1] = JACK SAYS;
$match[2] = “blah blah blah etc.”

If you don't want the quotes...
preg_match('/<p><span class="prefix">(.+?):.+?“(.+?)&/', $subject,$match);

and you get -
$match[1] = "JACK SAYS";
$match[2] = "blah blah blah etc.";

hope that answers your question....
 
  • Like
Reactions: plepco
This is untested, but it should be very close to working. And I bet there is a faster regex to use than (.*?), but I couldn't see all the variations on the text so I had to use that.

PHP:
<?php

$contents = '<span class="prefix">DATE:</span> May 25, 2006   <span class="prefix">AUTHOR:</span> <a href="http://samplesite.com/authors/Jack" target="_blank">JACK</a> </p> <p><span class="prefix">JACK SAYS:</span> “blah blah blah etc.”</p></div>';

// get author
$regex = "/<span class=\"prefix\">AUTHOR:<\/span> <a href=\"http:\/\/samplesite\.com\/authors\/(.*?)\" target=\"_blank\">(.*?)<\/a>/ims";
preg_match($regex, $contents, $matches);
$author = $matches[2];

// get dialogue - this assumes that the link text is formatted exactly as the name is
$regex = "/<p><span class=\"prefix\">$author SAYS:<\/span> (.*?)<\/p><\/div>/ims";
preg_match($regex, $contents, $matches);
$says = $matches[1];

// if you want to convert entities to their characters, uncomment this line:
#$says = html_entity_decode($says);

// if you want to STRIP out all entites, uncomment this line:
#preg_replace("/&([A-Za-z]+);/ims", '', $contents);

echo $author . ' says: ' . $says . '<br />';

?>
 
  • Like
Reactions: plepco
+rep to both of yahs. Thanks guys...

Ironically I ended up figuring it out for myself a couple minutes before reading both of your replies.

I did this:
PHP:
 $document=preg_replace('/DATE:(.+)<br.*><\/p>/i', '', $document);

(Even more ironic than THAT was that I had that stupid song by Alanis Morisette stuck in my head because I heard it in a store yesterday. You know the one about "rain on your wedding day" being "ironic"? That song makes no sense. That's why I listen to Screeching Weasel and Social Distortion. They make so much more sense.)
 
Glad you figured it out. Here is one of the better tips/tricks I've read about regex for some good reading and notes.
Blackhat SEO » Blog Archive » Regex tips and tricks

I'm currently trying to learn Regex and don't wanna spend money on a book so thanks for the link. I've been reading so many of these kinda tutorials over the past couple of days....

Seems like regex is one of the more important and useful things to learn. Besides just learning PHP, I think regex is the next thing to learn. Its impossible to do so many of the things I want to do without it.
 
You bet. PHP.net is a good resource.

I'm currently trying to learn Regex and don't wanna spend money on a book so thanks for the link. I've been reading so many of these kinda tutorials over the past couple of days....

Seems like regex is one of the more important and useful things to learn. Besides just learning PHP, I think regex is the next thing to learn. Its impossible to do so many of the things I want to do without it.
 
Status
Not open for further replies.