在文件中编码html标记

时间:2016-02-19 12:04:23

标签: c# html

我有一个xml文件,如下所示

<root>
</ some junk that is removed in regex
    <element>
        <other>This is some text<other>
    </element>
    <element>
        <something>H<sub>2</sub>0</somthing>
    </element>
<?? more junt that is removed by the regex
    <element>
        <else>more<sub>-</sub>text
    </element>
</root>

我有以下代码通过xml文件运行并进行一些清理。

 public void Main()
        {
            string filename = @"C:\InnerTags.xml";
            string config = @"C:\RegexConfig.xml";
            string outputfn = @"C:\output.xml";

            XmlDocument xdoc = new XmlDocument();
            xdoc.Load(config);
            XmlElement xmlRoot = xdoc.DocumentElement;
            XmlNodeList xmlNodes = xmlRoot.SelectNodes(" /root/line");

            using (FileStream fs = File.Open(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (BufferedStream bs = new BufferedStream(fs))
            using (StreamReader sr = new StreamReader(bs))
            using (StreamWriter writer = new StreamWriter(outputfn))
            {

                string line;
                while ((line = sr.ReadLine()) != null)
                {
                    string output = line;

                    foreach (XmlNode node in xmlNodes)
                    {
                        string pattern = node["pattern"].InnerText;
                        string replacement = node["replacement"].InnerText;
                        Regex rgx = new Regex(pattern);
                        output = rgx.Replace(output, replacement);
                        rgx = null;
                    }
                    if (output.Length > 0)
                    {                        
                        writer.WriteLine(output);
                    }
                }                
                writer.Close();
            }

        }

这是为了清理和删除一些垃圾线。

我现在发现在这样的us sub,sup等中有很多HTML标签

我希望能够修改此脚本以编码已知的HTML标记,例如此列表中的标记:https://msdn.microsoft.com/en-us/library/system.web.ui.htmltextwritertag(v=vs.110).aspx

同时还保留了XML标记。

所以输出将是

<root>
</ some junk that is removed in regex
    <element>
        <other>This is some text<other>
    </element>
    <element>
        <something>H&#x2082;0</somthing>
    </element>
<?? more junt that is removed by the regex
    <element>
        <else>more&#x2014;text
    </element>
</root>

但再次强调我不只是想要这两个标签,列表中的任何标签,所以斜体,粗体,br等...

如何实现这一目标?

如果我尝试将它编码为line be line,它将编码更糟糕的xml标签。

0 个答案:

没有答案