Question

我收到了以下文字/ html：

Hello ! You should check this link : http://google.com
And this link too : <a href="http://example.com">http://example2.com</a>

我想要一个正则表达式来捕获我的文本中的网址，以便用<a>替换它们。我得到了以下正则表达式：

var REG_EXP = /[^">]((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)[^"<]/gi;

但我的正则表达式也会抓住http://example.com和http://example2.com。我不知道如何改进它以避免这种情况。

Answer 1

检查此答案https://stackoverflow.com/a/4217452/1795220。绝对拥有像<a href="http://example.com">http://example2.com</a>这样的HTML是不正确的。

Answer 2

这可能符合您的需求：

(?<!href=")(http://[a-z0-9]++(?:[.-:/?&=][a-z0-9]+)++)(?!</a>)

请注意我使用的url模式非常简单且容许：

http://[a-z0-9]+(?:[.-:/?&=][a-z0-9]+)+

(?<!href=")表示“前面没有href="”
(?!</a>)表示“未跟</a>”
++被称为possessive quantifier

只需将<a href="$1">$1</a>替换为this example。

在尝试解决这类工作时，不要期望正则表达式太多，这不是他们所做的。

在文本中查找URL，忽略html标记

2 个答案: