解决方案1：XML有效输入

Question

我有一个看起来像的字符串：

4000 BCE–5000 BCE and 600 CE–650 CE。

我正在尝试使用正则表达式搜索字符串，查找所有字符代码并用相应的实际字符替换所有字符代码。对于我的示例字符串，我想最终得到一个看起来像

的字符串

4000 BCE–5000 BCE and 600 CE–650 CE。

我尝试在代码中编写它，但我无法弄清楚要写什么：

string line = "4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE";

listof?datatype matches = search through `line` and find all the matches to  "&#.*?;"

foreach (?datatype match in matches){
    int extractedNumber = Convert.ToInt32(Regex.(/*extract the number that is between the &# and the ?*/));

    //convert the number to ascii character
    string actualCharacter = (char) extractedNumber + "";

    //replace character code in original line
    line = Regex.Replace(line, match, actualCharacter); 
}

修改

我的原始字符串实际上有一些HTML，看起来像：

4000 BCE–5000 BCE and 600 CE–650 CE

我使用line = Regex.Replace(note, "<.*?>", string.Empty);删除了标记，但显然，根据SO上最受欢迎的问题之一RegEx match open tags except XHTML self-contained tags，您真的不应该使用RegEx来删除HTML。

Answer 1

如何在委托替换中执行此操作编辑：作为旁注，这是删除所有标签和脚本块的一个很好的正则表达式

<(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

C＃：

string line = @"4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE";
Regex RxCode = new Regex(@"&#([0-9]+);");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return "" + (char)Convert.ToInt32( match.Groups[1].Value);
    }
);
Console.WriteLine( lineNew );

输出：

4000 BCE-5000 BCE and 600 CE-650 CE

编辑：如果您还需要 hex 表单，也可以处理。

 #  @"&\#(?:([0-9]+)|x([0-9a-fA-F]+));"

 &\#
 (?:
      ( [0-9]+ )                    # (1)
   |  x
      ( [0-9a-fA-F]+ )              # (2)
 )
 ;

C＃：

Regex RxCode = new Regex(@"&#(?:([0-9]+)|x([0-9a-fA-F]+));");
string lineNew = RxCode.Replace(
    line,
    delegate( Match match ) {
        return match.Groups[1].Success ? 
            "" + (char)Convert.ToInt32( match.Groups[1].Value ) :
            "" + (char)Int32.Parse( match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
    }
);

Answer 2

您不需要任何正则表达式来将XML实体引用转换为文字字符串。

解决方案1：XML有效输入

这是一个假定您具有XML有效输入的解决方案。

添加using System.Xml;命名空间并使用此方法：

public string XmlUnescape(string escaped)
{
    XmlDocument doc = new XmlDocument();
    XmlNode node = doc.CreateElement("root");
    node.InnerXml = escaped;
    return node.InnerText;
}

像这样使用：

var output1 = XmlUnescape("4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE.");

结果：

enter image description here

解决方案2：使用HTML / XML实体的无效XML输入

如果您不能将XmlDocument与字符串一起使用，因为它们包含无效的XML语法，您可以使用以下方法使用HttpUtility.HtmlDecode仅转换已知HTML和XML实体的实体：

public string RevertEntities(string test)
{
   Regex rxHttpEntity = new Regex(@"(&[#\w]+;)"); // Declare a regex (better initialize it as a property/field of a static class for better performance
   string last_res = string.Empty; // a temporary variable holding a previously found entity
   while (rxHttpEntity.IsMatch(test)) // if our input has something like &#101; or &nbsp;
   {
       test = test.Replace(rxHttpEntity.Match(test).Value, HttpUtility.HtmlDecode(rxHttpEntity.Match(test).Value.ToLower())); // Replace all the entity references with there literal value (&amp; => &)
       if (last_res == test) // Check if we made any change to the string
           break; // If not, stop processing (there are some unsupported entities like &ourgreatcompany;
       else
           last_res = test; // Else, go on checking for entities
    }
    return test;
}

如下所示：

var output2 = RevertEntities("4000 BCE&#8211;5000 BCE and 600 CE&#8211;650 CE.");

解决方案3：HtmlAgilityPack和 HtmlEntity.DeEntitize

使用管理解决方案的NuGet包下载并安装HtmlAgilityPack并使用此代码获取所有文本：

public string getCleanHtml(string html)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}

然后使用

var txt = "4000 <small>BCE</small>&#8211;5000 <small>BCE</small> and 600 <small>CE</small>&#8211;650 <small>CE</small>";
var clean = getCleanHtml(txt);

结果：

enter image description here doc.DocumentNode.InnerText.Substring（doc.DocumentNode.InnerText.IndexOf（ “\ n”））修剪（）;

您可以将LINQ与HtmlAgilityPack一起使用，下载页面（使用var webGet = new HtmlAgilityPack.HtmlWeb(); var doc = webGet.Load(url);）等等。最好的是没有可以手动处理的实体。

正则表达式用实际字符替换所有ASCII字符代码

修改

2 个答案:

解决方案1：XML有效输入

解决方案2：使用HTML / XML实体的无效XML输入

解决方案3：HtmlAgilityPack和 HtmlEntity.DeEntitize