Question

我正在整理一个快速脚本来抓取页面以获得一些结果，而我在查找如何忽略我的正则表达式中的空格和新行时遇到了麻烦。

例如，以下是页面在HTML中显示结果的方式：

<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>

如何更改以下正则表达式以忽略空格和新行：

$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';

任何帮助将不胜感激。帮助也解释了为什么你做了什么将非常感谢！

Answer 1

通过尝试使用带有HTML代码的正则表达式，您无需提醒您正在玩火。无论如何回答你的问题你可以使用这个正则表达式：

$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';

更新：以下是基于DOM Parser的代码，可以获得您想要的内容：

$html = <<< EOF
<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[@class='things']/div[@class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->nodeValue;
    echo "$val\n"; // prints: I need to capture this text.
}

现在请不要在代码中使用正则表达式解析HTML。

Answer 2

SimpleHTMLDomParser可让您抓取所选div的内容或<p> <h1> <img>等元素的内容。

这可能是实现您的目标的更快捷方式。

Answer 3

解决方案是不在HTML上使用正则表达式。请参阅有关此主题的精彩文章：http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

底线是HTML不是常规语言，因此正则表达式不适合。你有不同的空白区域，可能是未封闭的标签（也就是说你要抓的HTML总是正确的？），以及其他挑战。

相反，使用PHP的DomDocument，给你的朋友留下深刻的印象，并且每次都以正确的方式做到：

  // create a new DOMDocument
    $doc = new DOMDocument();

    // load the string into the DOM
    $doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');

    // since we are working with HTML fragments here, remove <!DOCTYPE 
    $doc->removeChild($doc->firstChild);            

    // likewise remove <html><body></body></html> 
    $doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

    $contents = array();
    //Loop through each <p> tag in the dom and grab the contents
    // if you need to use selectors or get more complex here, consult the documentation
    foreach($doc->getElementsByTagName('p') as $paragraph) {
        $contents[] = $paragraph->textContent;
    } 

   print_r($contents);

<强>文档

PHP的DomDocument - http://php.net/manual/en/class.domdocument.php
PHP的DomElement - http://www.php.net/manual/en/class.domelement.php

此PHP扩展被视为“标准”，并且通常已安装在大多数Web服务器上 - 无需第三方脚本或库。享受！

如何正则表达式刮取HTML并忽略代码中的空格和换行符？

3 个答案: