Regex question, PHP

tomaszjot

Membership Suspended
Dec 22, 2009
1,934
77
0
Albany Plantation
I have this HTML, need to grab price

Code:
<div class="somediv">                                                                                                                                    <p class="pricediv">  
               £33.11             </p> 
                                             </div>

Price is 33.11 (as you probably see).

I try to build regex to get it, no success. I'm totally rubbish with it so got few questions:

1. Should I treat empty spaces between lines in some special way? Like use /n or something to go to the next line?

2. What if the price div is wrapped with <noscript> tag, should it change my approach?

3. I built very simple regex to get price:

Code:
$regex = '/.*?<div class="somediv "><p class="pricediv">(.+?)<\/p<\/div>.*?/';
It doesn't work of course. I built successful regex for some other pages but stuck with this one.

Any help much appreciated.
 


Given the available information, I'd be inclined to just be lazy:

$regex = '/£([0-9\.]*)\s/';
 
I would go about solving the problem using existing tools, namely: PHP Simple HTML DOM Parser

It looks like once you grab the page in question you may be able to just do something like this:

$ret = $html->find('div[id=foo]')


Unless you had a good reason not to do this (e.g. you're trying to write very lean code), using this DOM parser would be the way I'd go about it.
 
If you want to use the regex, you'll need the 's' operator for the . to match across a line:
Code:
'/regex/s'
 
PHP Simple HTML DOM Parser is a nice tool but tends to break down on big, javascript etc heavy pages. In those cases use xpath. It's got a bit of a learning curve but well worth it if you're going to do any serious scraping

PHP: DOMXPath - Manual
 
I would leave out the DIV, just look for the P and the pound. The dom path is probably your best bet with the paragraph.