Question

我正在尝试使用php cURL和preg_match从下面的html页面/链接中提取价格。基本上我期望这段代码输出4,550但由于某些原因我得到了

 Notice: Undefined offset: 1 in C:\wamp\www\test.php on line 22

我认为模式是正确的，因为如果我把html本身放在一个变量中并且逃避“”它就可以了！。此外，如果我输出（echo $ result;）它显示从foxtons网站正确抓取的html，所以我只是无法弄清楚为什么整个事情不起作用。我需要做这项工作，如果你告诉我为什么会产生这个通知以及为什么我当前的脚本无效，我将不胜感激。

$url = "http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717";
$ch = curl_init($url);

curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1); 
$result = curl_exec($ch);
curl_exec($ch);
curl_close($ch);
$result2 = str_replace('"', '\"', $result);

$tagname1= ");</script>
    ";
 $tagname2= "</noscript> 
    per month</a>";

$pattern = "/$tagname1(.*?)$tagname2/";
preg_match($pattern, $result, $matches);
$prices = $matches[1];

print_r($prices);

?>

Answer 1

我重写了一下脚本，以便超过1＆lt; noscript＆gt;在页面上。你需要使用preg_match_all来查找所有的匹配，而不仅仅是在第一个匹配。



$url = "http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_exec($ch);
curl_close($ch);

preg_match_all("/<noscript>(.*)<\/noscript>/", $result, $matches);
print_r($matches);

输出



Array
(
    [0] => Array
        (
            [0] => £1,050
            [1] => 4,550
        )

    [1] => Array
        (
            [0] => £1,050
            [1] => 4,550
        )

)

我在我的盒子上尝试了这个并且它有效 - 让我知道它是否适合你

Answer 2

不要使用REGEX来解析html ，而是使用html dom解析器，例如PHP Simple HTML DOM Parser

include("simple_html_dom.php") ;

$html = file_get_html("http://www.foxtons.co.uk/search?bedrooms_from=0&property_id=727717");

foreach($html->find('noscript') as $noscript)
{

    echo $noscript->innertext."<br>";
}

回声的：

php cURL。 preg_match，从xhtml中提取文本

2 个答案: