Question

可能重复：
How to parse and process HTML with PHP?

您好我已经抓了一个网页

  <div class="col blue">
        <img  src="/en/media/Dentalscreenresized.jpg" />
        <h4>This is line i want to scrap</h4>
        <p class="date">12 Sep
            <span class="year">2012</span></p>
        <p>13 people were diagnosed with oral cancer after last year&rsquo;s Mouth Cancer Awareness Day. Ring 021-4901169 to arrange for a free screening on the 19th September.</p>
        <p class="readmore"><a href="/en/news/abcd.html">Read More</a></p>
        <p class="rightreadmore"><a href="http://www.xyz.ie/en/news/">See all News&nbsp;&nbsp;&nbsp;</a></p>
    </div>

现在我要显示<h4>的{{1}}标记。我在网上看到使用class="col blue"我不熟悉正则表达式...请帮助

Answer 1

不要使用正则表达式来解析HTML。使用库和专用解决方案似乎很困难。你可以在那里找到很多“不要使用正则表达式”的答案。

我建议使用SimpleHTMLDOM simple to use。

    <?php
// include necessary classes first.
// e.g. include('simple_html_dom.php'); // assuming the file is in same folder as the php file. Or include it at first or you will get a fatal error.
    $html = str_get_html('<div class="col blue">
            <img  src="/en/media/Dentalscreenresized.jpg" />
            <h4>This is line i want to scrap</h4>
            <p class="date">12 Sep
                <span class="year">2012</span></p>
            <p>13 people were diagnosed with oral cancer after last year&rsquo;s Mouth Cancer Awareness Day. Ring 021-4901169 to arrange for a free screening on the 19th September.</p>
            <p class="readmore"><a href="/en/news/abcd.html">Read More</a></p>
            <p class="rightreadmore"><a href="http://www.xyz.ie/en/news/">See all News&nbsp;&nbsp;&nbsp;</a></p>
        </div>
    ');

    $h4 = $html->find('h4.col.blue');
    ?>

现在$ h4包含带有col和blue类的h4标签的所有元素。

Answer 2

嗯，就像在生活中一样，这里有两个选项（我假设抓取页面的内容存储在$content变量中）：

~~的方式（Cthulhu）~~正则表达式：

$matches = array();
preg_match_all('#<div class="col blue">.+?<h4>([^<]+)#is', $content, $matches);
var_dump($matches[1]);

DOM解析方式：

$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//div[@class="col blue"]/h4');
foreach ($elements as $el) {
   var_dump($el->textContent);
}

当然，真正的问题是选择何种方式。

第一个选项简洁，简洁，整体上非常诱人。我承认我会使用它一次，两次或（pony he comes）甚至更多 - 如果我知道我使用的HTML将始终归我正常化我可以应对其结构突然改变为非 - 可预测的方式。

第二个选项略大，可能看起来过于通用。然而，在我看来，它更灵活，更灵活地适应源头的变化。

例如，考虑如果源HTML中的某些“蓝色”div可能在没有<h4>元素的情况下出现会发生什么。要在这种条件下正常工作，正则表达式必须变得更加复杂。和XPath查询？不会改变 - 即使是一点点。

Answer 3

不要使用正则表达式从HTML解析/抓取信息，尝试像PHP内置的DOM解析器。

Answer 4

使用DOM和Xpath。把你的html数据放在$ html。

$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xmlElements = simplexml_import_dom($dom);

$divs = $xmlElements->xpath("//div[@class='col blue']");
foreach($divs as $div)
{
  $heading = $div->h4;
  var_dump($heading);
}

附加说明：

Don't use regular expressions to parse/scrape info from HTML. Its a Bad technique

如何从php中的废弃网页获取特定数据

4 个答案: