Question

考虑以下XML文档：

<?xml version="1.0" encoding="iso-8859-1" ?>
<a>
    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
    </b>
    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
        <c4 description="abc123"" />    
        <c5 description="bbbasdasdbc123" /> 
        <c6 description="cccbasdasdc123" /> 
    </b>
    <b>
        <c1 description="abcaslkjkl123" weight="10" />
    </b>
</a>

目前这个XML文档无效，在Firefox中它指向违规行：第12行第27行...即额外的双引号。双引号不是问题。错误的原因可能是导致XML文档无效的任何原因。

关键是当我尝试加载XML文档时会发生错误 - 从中我知道行号和列... - 之后我别无选择，只能将文件标记为错误 - 做某事 - 与-IT-后面上。

我想要做的是删除封装了违规行的<b>节点（或在以后提取它以便进一步处理错误）

即删除

    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
        <c4 description="abc123"" />    
        <c5 description="bbbasdasdbc123" /> 
        <c6 description="cccbasdasdc123" /> 
    </b>

离开

<?xml version="1.0" encoding="iso-8859-1" ?>
<a>
    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
    </b>
    <b>
        <c1 description="abcaslkjkl123" weight="10" />
    </b>
</a>

XML可能非常大＆lt; = 100Mb

我调查了这些导致我最终使用File.ReadLines（sourceXMLFile）.Take（...）等

How to read a text file reversely with iterator in C#

Get last 10 lines of very large text file > 10GB

https://msdn.microsoft.com/en-us/library/w5aahf2a%28v=vs.110%29.aspx

并且使用模式预先验证XML不是一个选项（http://www.codeguru.com/csharp/csharp/cs_data/xml/article.php/c6737/Validation-of-XML-with-XSD.htm）。

我已经考虑过如何尝试解决这个问题，知道有问题的行号并想出了这个：

    public void ProcessXMLFile(string sourceXMLFile, string errorFile)
    {
        XmlDocument xmlDocument = new XmlDocument();

        string outputFile1 = @"c:\temp\f1.txt";
        string outputFile2 = @"c:\temp\f2.txt";

        string soughtOpeningNode = "<b>";
        string soughtClosingNode = "</b>";

        string firstPart = "";
        string secondPart = "";
        int lastNode = 0;
        int firstNode = 0;


        try
        {
            xmlDocument.Load(sourceXMLFile);
        }
        catch (XmlException ex)
        {
            int offendingLineNumber = ex.LineNumber;

            // Create the first part of the file that comprises everything upto and including the line that caused the error
            using (StreamWriter f1 = new StreamWriter(outputFile1))
            {
                firstPart = string.Join("\r\n", File.ReadLines(sourceXMLFile).Take(offendingLineNumber));
                f1.WriteLine(firstPart);
                lastNode = firstPart.LastIndexOf(soughtOpeningNode);
            }

            // Create the file that contains the remainder of the original file starting after the line number that caused the error
            using (StreamWriter f2 = new StreamWriter(outputFile2))
            {
                secondPart = string.Join("\r\n", File.ReadLines(sourceXMLFile).Skip(offendingLineNumber));
                f2.WriteLine(secondPart);
                firstNode = secondPart.IndexOf(soughtClosingNode);
            }

            // Create the XML file without the node whose child caused the error...
            using (StreamWriter d1 = new StreamWriter(sourceXMLFile))
            {
                d1.WriteLine(firstPart.Substring(0, lastNode));
                d1.WriteLine(secondPart.Substring(firstNode + soughtOpeningNode.Length + 1));
            }

            // Write the node that contained the offending line number for later processing
            using (StreamWriter d1 = new StreamWriter(errorFile, true))
            {
                d1.WriteLine(firstPart.Substring(lastNode));
                d1.WriteLine(secondPart.Substring(0, firstNode + soughtClosingNode.Length + 1));
            }

            File.Delete(outputFile1);
            File.Delete(outputFile2);

            ProcessXMLFile(sourceXMLFile, errorFile);
        }
    }

开始：

ProcessXMLFile(@"c:\temp\myBigFile.xml", @"c:\temp\myBigFile-errors.txt");

我的问题是：

这样可行，但有更好的方法吗？
处理包含许多错误的XML文件（c70Mb）时，它最终耗尽内存（任务管理器显示内存使用率在16Gb m / c上永远向上爬升至99％）。
即使我强制执行例行操作，内存仍然保持在99％并且只有在VS2010停止时才会下降，那么如何才能提高内存使用效率呢？

指针将不胜感激。

西

Answer 1

尝试这样做似乎是个狡猾的事情。通常，如果XML文件格式不正确，则无法将其作为XML文件读取。出现在错误消息中的行和列不一定会告诉您＆＃34;这是错误的位置＆＃34;，它只是告诉您XML解析器无法理解的位置该文件并放弃了。

所以充其量，您正在处理XML文件中可能的错误的子集。在您的情况下，您可能知道您希望看到什么样的错误（例如，元素中未正确编码的数据），在这种情况下，尝试删除封闭元素可能是有意义的，但仍然会修复创建输入文件的代码。

现在解决您的具体问题，您的代码似乎以合理的方式执行，但如果您确切地知道您期望的错误类型（例如，在示例中为双倍引号），那么您可以在文件中搜索这些错误具体的事情，而不是反复尝试将其解析为XML并处理产生的错误。

就内存使用而言，在进行发布版本并在调试器外部运行时，是否仍有问题？我发现在调试器下不断增加内存使用量，大概是因为垃圾收集不是积极地进行，但是当我运行Release版本时它会保持稳定。

如果其中一个属性包含无效数据，如何从XML文档中删除无效的XML节点

1 个答案: