我有一个xml文件,如下所示
<root>
</ some junk that is removed in regex
<element>
<other>This is some text<other>
</element>
<element>
<something>H<sub>2</sub>0</somthing>
</element>
<?? more junt that is removed by the regex
<element>
<else>more<sub>-</sub>text
</element>
</root>
我有以下代码通过xml文件运行并进行一些清理。
public void Main()
{
string filename = @"C:\InnerTags.xml";
string config = @"C:\RegexConfig.xml";
string outputfn = @"C:\output.xml";
XmlDocument xdoc = new XmlDocument();
xdoc.Load(config);
XmlElement xmlRoot = xdoc.DocumentElement;
XmlNodeList xmlNodes = xmlRoot.SelectNodes(" /root/line");
using (FileStream fs = File.Open(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
using (StreamWriter writer = new StreamWriter(outputfn))
{
string line;
while ((line = sr.ReadLine()) != null)
{
string output = line;
foreach (XmlNode node in xmlNodes)
{
string pattern = node["pattern"].InnerText;
string replacement = node["replacement"].InnerText;
Regex rgx = new Regex(pattern);
output = rgx.Replace(output, replacement);
rgx = null;
}
if (output.Length > 0)
{
writer.WriteLine(output);
}
}
writer.Close();
}
}
这是为了清理和删除一些垃圾线。
我现在发现在这样的us sub,sup等中有很多HTML标签
我希望能够修改此脚本以编码已知的HTML标记,例如此列表中的标记:https://msdn.microsoft.com/en-us/library/system.web.ui.htmltextwritertag(v=vs.110).aspx
同时还保留了XML标记。
所以输出将是
<root>
</ some junk that is removed in regex
<element>
<other>This is some text<other>
</element>
<element>
<something>H₂0</somthing>
</element>
<?? more junt that is removed by the regex
<element>
<else>more—text
</element>
</root>
但再次强调我不只是想要这两个标签,列表中的任何标签,所以斜体,粗体,br等...
如何实现这一目标?
如果我尝试将它编码为line be line,它将编码更糟糕的xml标签。