Question

我有大量的HTML。

有了这个：

~<div>(?:.*?)<a[\s]+[^>]*?href[\s]?=[\s"\']+(#_ftnref([0-9]+))["\']+.*?>(?:[^<]+|.*?)?</a>(.*?)</div>~si

我抓住了这个：

<div> </div><hr align="left" size="1" width="33%" /><div><p><a title="" href="#_ftnref1">[1]</a> This is not to suggest that there are only two possible arguments to be made in support of  blah blah <em>blah</em>.</p></div>

但是！我想要这个：

<div><p><a title="" href="#_ftnref1">[1]</a> This is not to suggest that there are only two possible arguments to be made in support of  blah blah <em>blah</em>.</p></div>

你能帮忙吗？

PS：(?: )与( )形成对比，用于避免捕获文本。我是故意这样做的，因为我希望返回的$ matches数组对于本文中未提及的几个不同的正则表达式是一致的。

Answer 1

如果与.*?的延迟匹配不起作用，您需要提出一些排除模式。

(?:(?!</div>).)*

例如，只会匹配一个div并在任何包含</div>之后停止/跳过

或者，长度约束可以是变通方法：

(?:.{0,20})

正则表达式匹配超过预期

1 个答案: