Question

我正在从同一行中具有各种<p...>data data的html文档中提取数据。我想在新行中提取每个段落中的数据。我怎样才能做到这一点？我看了this answer，但问题是它用一个字符指定结尾，不能用一组字符。

示例：

<p...> data1 <b>imp</b> data2 </p>

应该给我data1 imp data2，而是在data1抓住＆lt; Aniket Choudhary to Warner, SIX, .. and Warner makes the most of the free-hit.大胆的标签。

编辑：这是另一个例子： Aniket Choudhary to Warner, SIX, .. and Warner makes the most of the free-hit.应该给我<?php $data=file_get_contents($url); $data = base64_decode($data); $im = imagecreatefromstring($data); if ($im !== false) { header('Content-Type: image/jpeg'); imagejpg($im); imagedestroy($im); } else { echo 'An error occurred.'; } ?>

Answer 1

假设使用GNU grep（Mac用户使用BSD grep，这将不起作用）：

grep -Poz '<p[^>]*>\K[\S\s]*?(?=<\/p>)'

这发现<p...>，然后由于\K而“忘记”它。然后它缓慢匹配，直到达到。如果您的...块会很大，那么将需要很长时间才能完成。

使用-o标志的原因是返回“ o nly”您想要的文本。

使用-z标志的原因是它不会在每一行的结尾处停止；相反，它认为每个输入都将一直执行到找到空值为止。如果您的文本包含和之间的换行符，则应尝试查找它。

注意：...stuff here...this here...more here...将返回

...stuff here...<p>this here

因为它不能测试第一个是否包含嵌套的。

使用grep提取一行中的多个匹配项

1 个答案: