Question

我想从网页处理html并提取符合我标准的段落。正则表达式的味道是PHP。

这是示例网页HTML：

<div class="special">
    <p>Some interesting text I would like to extract</p>
    <p>More interesting text I would like to extract</p>
    <p>Even more interesting text I would like to extract</p>
</div>

正则表达式在<div class="special">和</div>标记之间查找，并将所有内容放入捕获组或变量中，以供下一步中的参考。下一步就是我遇到的麻烦。我不能为我的生活写一个正则表达式，它捕获<p>和</p>之间的每段文字。

我已尝试/<p>(.+?)<\/p>/s返回：

<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>

我希望每个段落作为数组中的项目单独返回。非贪婪的?似乎不起作用。有什么建议吗？

Answer 1

你必须为p标签转义斜杠。

所以它将是

/<p>(.+?)<\/p>/s

Answer 2

太蠢了！正则表达式完美无缺。所有正则表达式都完美无缺。问题出在输入上。我正在处理的输入HTML文件具有以下结构，这使得正则表达式不起作用。

<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>

我使用var_dump（htmlfile.html）来查看我收到的HTML页面，但我的浏览器对其进行了处理，因此我没有获得原始数据。我能够通过使用：

获取原始数据并找到我的错误

include 'filename.php'; 
file_put_contents('filename.php', $data);

现在我知道不相信我的浏览器会再次返回原始数据！

如何使用正则表达式匹配多个段落？

2 个答案: