Question

我想提取具有以下描述格式的页面的描述。即使我相信我是对的，我也不明白。

$file_string = file_get_contents('');

preg_match('/<div class="description">(.*)<\/div>/i', $file_string, $descr);
$descr_out = $descr[1];

echo $descr_out; 


<div class="description">
<p>some text here</p>
</div>

Answer 1

看起来您需要在正则表达式中打开单行模式。修改它以添加-s标志：

preg_match('/<div class="description">(.*)<\/div>/si', $file_string, $descr);

单线模式允许。用于匹配换行符的字符。没有它，。*将不会匹配开始和结束div标签之间的换行符。

Answer 2

我建议使用DOMDocument类和xpath从HTML文档中提取随机片段，基于regexp的解决方案在更改输入时非常脆弱（在奇怪的地方添加额外的属性，空格等等）。）它对于更复杂的场景是可读的。

$html = '<html><body><div class="description"><p>some text here</p></div></body></html>';
// or you could fetch external sites 
// $html = file_get_contents('http://example.com');

$doc = new DOMDocument();
// prevent parsing errors (frequent with HTML)
libxml_use_internal_errors(true);
$doc->loadHTML($html);
// enable back parsing errors as the HTML document is already parsed and stored in $doc
libxml_use_internal_errors(false);
$xpath = new DOMXpath($doc);

foreach ($xpath->query('//div[@class="description"]') as $el) {
    var_dump($el->textContent);
}

不能preg_match以下。我究竟做错了什么？

2 个答案: