Question

如何替换字符串中的所有href标记，例如：

＆lt; a href =“http://thedomain.com/about”＆gt;链接标题＆lt; / a＆gt;和＆lt; a href =“http://anotherlink.com”＆gt;另一个链接＆lt; / a＆gt;

..将标记内容放在括号中的URL：

链接标题[http：// thedomain.com/about]和另一个链接[http://anotherlink.com]

允许资本A HREF和资本/ A.

这将用于在发送纯文本电子邮件时重新格式化超链接。

可以使用RegEx。相近： Replace Hyperlink with Plain-Text URL Using REGEX

Answer 1

这个C＃正则表达式和替换正则表达式在我的测试中使用Expresso。正则表达式选项按照您的要求指定不区分大小写，并且还忽略空白，我希望留下空白以便于阅读。

using System;
using System.Text.RegularExpressions;

string inputText = "your text here";
string rx = "<a\\s+ .*? href\\s*=\\s*(?:\"|') (?<url>.*?) (?:\"|') .*?> (?<anchorText>.*?) \\</a>";
Regex regex = new Regex( rx, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );
string regexReplace = "${anchorText} [${url}]";

string result = regex.Replace( inputText, regexReplace );

Answer 2

完全替换

经过一番扭转，我正在发布这个解决方案。使用或不使用，更多用于我当前或将来的参考。令人惊讶的是，tag-att-val部分几乎涵盖了所有用例。不过，建议不要使用正则表达式来解析html。但如果使用它，它应该是相当准确的，这是。

可在此处找到C＃代码示例 - http://ideone.com/TBxXm
它使用CNN.com的源页面在VS2008中进行了调试，然后将工作副本粘贴到ideone上以获得永久链接。

这是一个温和的评论正则表达式

<a 
  (?=\s) 

  # Optional preliminary att-vals (should prevent overruns)
  (?:[^>"']|"[^"]*"|'[^']*')*?

  # HREF, the attribute we're looking for
  (?<=\s) href \s* =

     # Quoted attr value (only)
     # (?> \s* (['"]) (.*?) \1 )
     # ---------------------------------------
     # Or,
     # Unquoted attr value (only)
     # (?> (?!\s*['"]) \s* ([^\s>]*) (?=\s|>) )
     # ---------------------------------------
     # Or,

  # Quoted/unquoted attr value (empty-unquoted value is allowed)
  (?: (?>             \s* (['"]) (?<URL>.*?)     \1       )
    | (?> (?!\s*['"]) \s*        (?<URL>[^\s>]*) (?=\s|>) )   
  )

  # Optional remaining att-vals
  (?> (?:".*?"|'.*?'|[^>]?)+ )

  # Non-terminated tag
  (?<!/)
>
(?<TEXT>.*?)
</a \s*>

并且在这里，因为它存在于C＃源

中

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;


namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = @"
               <a asdf = href=  >BLANK</a>
               <a href= a""'tz target=_self >ATZ</a>
               <a href=/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1 target=""_self"">Last missing U.S. soldier in Iraq ID'd</a>
               <a id=""weatherLocBtn"" href=""javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);""><span>Go</span></a>
               <a href=""javascript:CNN_handleOverlay('profile_signin_overlay')"">Log in</a>
               <a no='href' here> NOT FOUND </a>
               <a this href= is_ok > OK </a>
            ";
            string regex = @"
               <a 
                 (?=\s) 
                 (?:[^>""']|""[^""]*""|'[^']*')*?
                 (?<=\s) href \s* =
                 (?: (?>              \s* (['""]) (?<URL>.*?)     \1       )
                   | (?> (?!\s*['""]) \s*         (?<URL>[^\s>]*) (?=\s|>) )   
                 )
                 (?> (?:"".*?""|'.*?'|[^>]?)+ )
                 (?<!/)
               >
               (?<TEXT>.*?)
               </a \s*>
            ";
            string output = Regex.Replace(input, regex, "${TEXT} [${URL}]",
                                RegexOptions.IgnoreCase |
                                RegexOptions.Singleline |
                                RegexOptions.IgnorePatternWhitespace);

            Console.WriteLine(input+"\n------------\n");
            Console.WriteLine(output);
        }
    }
}

带输出

           <a asdf = href=  >BLANK</a>
           <a href= a"'tz target=_self >ATZ</a>
           <a href=/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1 target="_self">Last missing U.S. soldier in Iraq ID'd</a>
           <a id="weatherLocBtn" href="javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);"><span>Go</span></a>
           <a href="javascript:CNN_handleOverlay('profile_signin_overlay')">Log in</a>
           <a no='href' here> NOT FOUND </a>
           <a this href= is_ok > OK </a>

------------

           BLANK []
           ATZ [a"'tz]
           Last missing U.S. soldier in Iraq ID'd [/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1]
           <span>Go</span> [javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);]
           Log in [javascript:CNN_handleOverlay('profile_signin_overlay')]
           <a no='href' here> NOT FOUND </a>
            OK  [is_ok]

干杯！

Answer 3

尝试使用正则表达式解析html通常不是一个好主意，因为查找正则表达式以满足所有可能情况的复杂性。当然，如果你需要解析一个小字符串，那么它可能是可以接受的。

更好的选择是使用解析器，而不是http://roberto.open-lab.com/2010/03/04/a-html-sanitizer-for-c/

另请参阅此处的答案：RegEx match open tags except XHTML self-contained tags

修改

好的，这是使用 htmlAgilityPack的一种方式：

static void Main(string[] args)
    {
        HtmlDocument htmlDoc = new HtmlDocument();    
        htmlDoc.Load(@"c:\test.html");    
        var listofHyperLinkTags = from hyperlinks in htmlDoc.DocumentNode.Descendants()
                          where hyperlinks.Name == "a" &&
                               hyperlinks.Attributes["href"] != null
                          select new
                          {
                              Address = hyperlinks.Attributes["href"].Value,
                              LinkTitle = hyperlinks.InnerText
                          };

        foreach(var linkDetail in listofHyperLinkTags)
            Console.WriteLine(linkDetail.LinkTitle + "[" + linkDetail.Address + "]");

        Console.Read();
    }

如果LINQ不是一个选项，请使用XPath表达式

var anchorTags = htmlDoc.DocumentNode.SelectNodes("//a");

foreach (var tag in anchorTags)
{
}

如果您想修改文档，请使用（可能有更好的方法）

var parentNode = tag.ParentNode;

HtmlNode node = htmlDoc.CreateElement("br");

node.InnerHtml = tag.InnerText + "[" + tag.Attributes["href"].Value + "]";
parentNode.RemoveChild(tag);
parentNode.AppendChild(node);

用纯文本替换超链接，然后使用C＃替换括号中的URL

3 个答案: