Question

我试图通过以下代码来刮取imdb。

$url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
$html = new simple_html_dom();
$html->load(str_replace('&nbsp;','',$data = get_data($url)));

foreach($html->find('#left') as $total_movies)
{
$content = $total_movies->plaintext;
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
{
    print_r($matches);
}
echo $content."<br>";
}

get_data（）只是我创建的卷曲函数。

问题是preg_match不起作用。我不知道为什么，但在这里使用工作时同样的事情。 $ content包含我在上面代码中搜索的文本。

$content = "1-50 of 101 titles.";
if(preg_match("/(?<total>[0-9,]+) titles/",$content,$matches))
print_r($matches);

Answer 1

网站上的来源实际上是：

<div id="left">
1-50 of 564,592
titles.
</div>

注意\n这需要剥离或添加到您的病情中。

这是一种在不使用任何额外库的情况下实现目标的方法。

  <?php 
    $url = "http://www.imdb.com/search/title?languages=en|1&explore=year";
    $temp=file_get_contents($url);

    $xml = new DOMDocument();
    @$xml->loadHTML($temp);

    foreach($xml->getElementsByTagName('div') as $div) {
        if($div->getAttribute('id')=='left'){
            preg_match("#of ([0-9,]+)#",$div->nodeValue,$match);
            $matchs[]=preg_replace('/[^0-9]/', '', $match[0]);
        }
    }

    echo number_format($matchs[0]); //564,592

    ?>

simple_html_dom库中的PHP正则表达式

1 个答案: