Question

I have built this regex code:

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:

[^<>]*?<\/a>

But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: https://regex101.com/r/lM3hC5/6之外的所有网址（应该是10个匹配项）。

对我来说，负面的前瞻仍然很棘手。我认为以下内容应该有效，但事实并非如此：

(?!<a.+?<\/a>)

https://regex101.com/r/hT1cG5/1

这些是帮助我的最后一次讨论：

Answer 1

事实证明，最好的解决办法可能如下：

((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

看起来负面预测只有在以quantifiers而不是字符串开头时才能正常工作。对于这种情况，实际上我们只能做回溯。

同样，我们只想确保HTML标记内的任何内容都不会被混淆。然后我们执行从</a到第一个"符号的回溯（因为它不是有效的URL符号，但嵌套标记中存在<>个符号）。

现在还可以正确找到<a>标记内的嵌套标记。当然，代码并不完美，但它应该适用于几乎任何简单的HTML标记。你可能需要小心一点：

在<a>标记内放置引号;
不要在没有任何属性的<a>标记上使用此算法（placeholders）;
以及您可能需要避免使用多个嵌套标记/行，除非<a>标记内的URL在任何双引号之后。

这是一个非常好的和混乱的例子（不应该找到最后一场比赛，但它是）：

https://regex101.com/r/pC0jR7/2

遗憾的是，这个前瞻不起作用：(?!<a.*?<\/a>)

Javascript正则表达式：查找<a> tags - Nested Tags

1 个答案: