我有一个看起来像的字符串:
4000 BCE–5000 BCE and 600 CE–650 CE
。
我正在尝试使用正则表达式搜索字符串,查找所有字符代码并用相应的实际字符替换所有字符代码。对于我的示例字符串,我想最终得到一个看起来像
的字符串
4000 BCE–5000 BCE and 600 CE–650 CE
。
我尝试在代码中编写它,但我无法弄清楚要写什么:
string line = "4000 BCE–5000 BCE and 600 CE–650 CE";
listof?datatype matches = search through `line` and find all the matches to "&#.*?;"
foreach (?datatype match in matches){
int extractedNumber = Convert.ToInt32(Regex.(/*extract the number that is between the &# and the ?*/));
//convert the number to ascii character
string actualCharacter = (char) extractedNumber + "";
//replace character code in original line
line = Regex.Replace(line, match, actualCharacter);
}
我的原始字符串实际上有一些HTML,看起来像:
4000 <small>BCE</small>–5000 <small>BCE</small> and 600 <small>CE</small>–650 <small>CE</small>
我使用line = Regex.Replace(note, "<.*?>", string.Empty);
删除了<small>
标记,但显然,根据SO上最受欢迎的问题之一RegEx match open tags except XHTML self-contained tags,您真的不应该使用RegEx来删除HTML。
答案 0 :(得分:2)
如何在委托替换中执行此操作 编辑:作为旁注,这是删除所有标签和脚本块的一个很好的正则表达式
<(?:script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
C#:
string line = @"4000 BCE–5000 BCE and 600 CE–650 CE";
Regex RxCode = new Regex(@"&#([0-9]+);");
string lineNew = RxCode.Replace(
line,
delegate( Match match ) {
return "" + (char)Convert.ToInt32( match.Groups[1].Value);
}
);
Console.WriteLine( lineNew );
输出:
4000 BCE-5000 BCE and 600 CE-650 CE
编辑:如果您还需要 hex 表单,也可以处理。
# @"&\#(?:([0-9]+)|x([0-9a-fA-F]+));"
&\#
(?:
( [0-9]+ ) # (1)
| x
( [0-9a-fA-F]+ ) # (2)
)
;
C#:
Regex RxCode = new Regex(@"&#(?:([0-9]+)|x([0-9a-fA-F]+));");
string lineNew = RxCode.Replace(
line,
delegate( Match match ) {
return match.Groups[1].Success ?
"" + (char)Convert.ToInt32( match.Groups[1].Value ) :
"" + (char)Int32.Parse( match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
}
);
答案 1 :(得分:1)
您不需要任何正则表达式来将XML实体引用转换为文字字符串。
这是一个假定您具有XML有效输入的解决方案。
添加using System.Xml;
命名空间并使用此方法:
public string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}
像这样使用:
var output1 = XmlUnescape("4000 BCE–5000 BCE and 600 CE–650 CE.");
结果:
如果您不能将XmlDocument
与字符串一起使用,因为它们包含无效的XML语法,您可以使用以下方法使用HttpUtility.HtmlDecode
仅转换已知HTML和XML实体的实体:
public string RevertEntities(string test)
{
Regex rxHttpEntity = new Regex(@"(&[#\w]+;)"); // Declare a regex (better initialize it as a property/field of a static class for better performance
string last_res = string.Empty; // a temporary variable holding a previously found entity
while (rxHttpEntity.IsMatch(test)) // if our input has something like e or
{
test = test.Replace(rxHttpEntity.Match(test).Value, HttpUtility.HtmlDecode(rxHttpEntity.Match(test).Value.ToLower())); // Replace all the entity references with there literal value (& => &)
if (last_res == test) // Check if we made any change to the string
break; // If not, stop processing (there are some unsupported entities like &ourgreatcompany;
else
last_res = test; // Else, go on checking for entities
}
return test;
}
如下所示:
var output2 = RevertEntities("4000 BCE–5000 BCE and 600 CE–650 CE.");
使用管理解决方案的NuGet包下载并安装HtmlAgilityPack并使用此代码获取所有文本:
public string getCleanHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}
然后使用
var txt = "4000 <small>BCE</small>–5000 <small>BCE</small> and 600 <small>CE</small>–650 <small>CE</small>";
var clean = getCleanHtml(txt);
结果:
doc.DocumentNode.InnerText.Substring(doc.DocumentNode.InnerText.IndexOf( “\ n”))修剪();
您可以将LINQ与HtmlAgilityPack一起使用,下载页面(使用var webGet = new HtmlAgilityPack.HtmlWeb(); var doc = webGet.Load(url);
)等等。最好的是没有可以手动处理的实体。