Question

我需要在C＃中使用一个正常的Regex代码来检测字符串中的纯文本URL（http / https / ftp / ftps），并通过在其周围放置一个带有相同url的锚标记来使它们可单击。我已经制作了一个Regex模式，代码附在下面。

但是，如果输入字符串中已存在任何可点击的URL，则上面的代码会在其上添加另一个锚标记。例如，下面代码中的现有子字符串：string sContent：“ftp：//www.abc.com'> ftp://www.abc.com”在运行下面的代码时，它上面有另一个锚标记。有没有办法解决它？

        string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";

        Regex regx = new Regex("(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

        MatchCollection mactches = regx.Matches(sContent);

        foreach (Match match in mactches)
        {
            sContent = sContent.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
        }

另外，我想要一个正则表达式代码，使用“mailto”标签制作可点击的电子邮件。我可以自己做，但上面提到的双锚标签问题也会出现在其中。

Answer 1

试试这个

Regex regx = new Regex("(?<!(?:href='|>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

它适用于你的例子。

(?<!(?:href='|>))是一个负面的背后隐藏，这意味着只有在“href ='”或“＆gt;”之前不存在模式匹配。

查看regular-expressions.info

上的外观

，特别是zero-width negative lookbehind assertion on msdn

查看similar on Regexr内容。我不得不从后面的外观中删除交替，但.net应该能够处理它。

<强>更新

为确保正确处理“<p>ftp://www.def.com</p>”之类的（可能）案例，我改进了正则表达式

Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

lookbehind (?<!(?:href='|<a[^>]*>))现在正在检查没有“href ='”，也没有以“

teststring的输出

ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p>ftp://www.def.com</p> abbbbb http://www.ghi.com

使用此表达式

ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p><a href='ftp://www.def.com'>ftp://www.def.com</a></p> abbbbb <a href='http://www.ghi.com'>http://www.ghi.com</a>

Answer 2

我在您的示例测试字符串中注意到，如果重复链接，例如ftp://www.abc.com在字符串中并且已经链接，然后结果将是双重锚定该链接。您已经拥有的正则表达式以及@stema提供的正则表达式将起作用，但您需要以不同的方式处理如何替换sContent变量中的匹配项。

以下代码示例应该为您提供所需内容：

string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";

Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

MatchCollection matches = regx.Matches(sContent);

for (int i = matches.Count - 1; i >= 0 ; i--)
{
    string newURL = "<a href='" + matches[i].Value + "'>" + matches[i].Value + "</a>";

   sContent = sContent.Remove(matches[i].Index, matches[i].Length).Insert(matches[i].Index, newURL);
}

Answer 3

我知道我迟到了这个派对，但正则表达式存在一些问题，即现有的答案没有解决。首先也是最令人讨厌的是，那里有反射的森林。如果您使用C＃的逐字字符串，则不必执行所有双重转义。无论如何，首先不需要大部分反斜杠。

其次，有一点：([\\w+?\\.\\w+])+。方括号形成一个字符类，其中的所有内容都被视为文字字符或类\w之类的简写。但摆脱方括号并不足以使其发挥作用。我怀疑这是你想要的：\w+(?:\.\w+)+。

第三，正则表达式末尾的量词 - ]*)? - 不匹配。 *可以匹配零个或多个字符，因此没有必要使封闭组可选。而且，这种安排会导致严重的性能下降。有关详细信息，请参阅this page。

还有其他一些小问题，但我现在不会进入它们。这是新的和改进的正则表达式：

@"(?n)(https?|ftps?)://\w+(\.\w+)+([-a-zA-Z0-9~!@#$%^&*()_=+/?.:;',\\]*)(?![^<>]*+(>|</a>))"

否定前瞻 - (?![^<>]*+(>|</a>))是阻止标记内或锚元素内容中匹配的内容。不过，它仍然很粗糙。有几个区域，比如内部<script>元素，您不希望它匹配，但确实如此。但试图涵盖所有可能性将导致一英里长的正则表达式。

Answer 4

结帐：Detect email in text using regex和Regex URL Replace, ignore Images and existing Links，只需替换链接的正则表达式，它永远不会替换标记内的链接，只会替换内容。

http://html-agility-pack.net/?z=codeplex

类似的东西：

string textToBeLinkified = "... your text here ...";
const string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(textToBeLinkified);

var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    node.InnerHtml = urlExpression.Replace(node.InnerHtml, @"<a href=""$0"">$0</a>");
}
string linkifiedText = doc.DocumentNode.OuterHtml;

在使纯文本URL可点击时的正则表达式字符串问题

4 个答案: