Question

我需要一个正则表达式来提取以“ http：//”，“ https：//”或“ www”开头的URL。来自HTML字符串。但是，如果这样的URL出现在<a href=...>属性中，则将其忽略。

我尝试使用正则表达式@"\b(?:https?://|www\.)\S+\b"，但这仍然包括href字符串：

var input = "<a href='//www.facebook.com'>www.facebook.com</a><br><br>https://www.amazon.in/<br><br><a href='http://www.google.com'>Testlink</a><br><br>https://in.yahoo.com<img src ='dev.salesrep.ly/Utility/GetLogoV1?CID=xx6&&PID=ukjh&&SID=4a9' height = 0 width = 0>www.ssd.com";

foreach (Match match in Regex.Matches(input, @"\b(?:https?://|www.)\S+\b"))
{
    Console.WriteLine(match.Value);
}

预期输出

https://www.amazon.in 
https://in.yahoo.com
www.ssd.com

观察到的输出

https://www.amazon.in
https://in.yahoo.com<img src ='dev.salesrep.ly/Utility/GetLogoV1?CID=xx6&&PID=ukjh&&SID=4a9

Answer 1

Regex并不是解析HTML的正确工具，因为-嗯-HTML不是常规语言。可能更好的解决方案是使用一些HTML解析器，例如HtmlAgilityPack。那就是...

选项1： 您可以通过首先从输入中删除每个<a>标记来使提取工作

var input = "issuetesing www.abc.in sdsd <br> <a href='www.facebook.com'>www.facebook.com</a> <br>https://www.flipart.com";

var inputWithoutHyperlinks = Regex.Replace(input, @"<a .*?(/>|</a>)", "");

这将同时删除<a href='whatever'>whatever</a>和缩写形式<a href='whatever'/>。

然后仅在此新字符串中搜索。

foreach (Match match in Regex.Matches(inputWithoutHyperlinks, @"\b(?:https?://|www.)[\w+\.]+"))
{
    Console.WriteLine(match.Value);
}

结果：

www.abc.in
https://www.flipart.com

请注意，我对您的Regex识别URL进行了一些更改，以使它停在第一个字符（不是单词char或点）（肯定有更好的URL正则表达式）上。

选项2： 但是也许您想要一些稍微不同的东西：比较所有找到的类似于URL的子字符串，并删除存在<a href='...'>的那些子字符串？

var input = "issuetesing www.abc.in sdsd <br> <a href='www.facebook.com'>www.facebook.com</a> <br>https://www.flipart.com";

var allFoundUrls = new HashSet<string>(Regex.Matches(input, @"\b(?:https?://|www.)[\w+\.]+").Cast<Match>().Select(m => m.Value));
var allHrefs = new HashSet<string>(Regex.Matches(input, @"(?<=<a\s+href=['""])[^'""]+").Cast<Match>().Select(m => m.Value));

Console.WriteLine("Found hrefs:");
foreach (var url in allHrefs)
    Console.WriteLine("  {0}", url);
    
Console.WriteLine("Found URLs:");
foreach (var url in allFoundUrls)
    Console.WriteLine("  {0}", url);

Console.WriteLine("Found URLs without href:");
foreach (var url in allFoundUrls.Except(allHrefs))
    Console.WriteLine("  {0}", url);

输出：

Found hrefs:
  www.facebook.com
Found URLs:
  www.abc.in
  www.facebook.com
  https://www.flipart.com
Found URLs without href:
  www.abc.in
  https://www.flipart.com

正则表达式仅在href外部匹配URL

1 个答案: