XmlReader缓冲区似乎忽略了对缓冲区的更改?

时间:2015-08-11 19:50:10

标签: c# streamreader xmlreader

也许我对应该发生的事情的理解是错误的,所以希望有人可以在这里纠正我的思考过程。

我正在尝试处理许多大型XML文件,这些文件经常被发送给我们,文本中嵌入了坏字符(0x1A)...不幸的是,我们的客户正在发送文件,所以无论我们多么好问他们为了使文件实际上是有效的XML,他们认为它是我们的问题。

最终我写了一个StreamReader的子类,如下所示:

public class CleanTextReader : StreamReader
{
    private readonly ILog _logger;

    public CleanTextReader(Stream stream, ILog logger) : base(stream)
    {
        this._logger = logger;
    }

    public CleanTextReader(Stream stream) : this(stream, LogManager.GetLogger<CleanTextReader>())
    {
        //nothing to do here.
    }
    public override int Read(char[] buffer, int index, int count)
    {
        try
        {
            var rVal = base.Read(buffer, index, count);

            var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();

            Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
            return rVal;
        }
        catch (Exception ex)
        {
            this._logger.Error("Read(char[], int, int)", ex);
            throw;
        }
    }

    public override int ReadBlock(char[] buffer, int index, int count)
    {
        try
        {
            var rVal = base.ReadBlock(buffer, index, count);
            var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
            Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
            return rVal;
        }
        catch (Exception ex)
        {
            this._logger.Error("ReadBlock(char[], in, int)", ex);
            throw;
        }
    }

    public override string ReadToEnd()
    {
        var chars = new char[4096];
        int len;
        var sb = new StringBuilder(4096);
        while ((len = Read(chars, 0, chars.Length)) != 0)
        {
            sb.Append(chars, 0, len);
        }
        return sb.ToString();
    }
}

...然后像我这样实施XmlReader

using (var theCleanser = new CleanTextReader(myStreamedInput))
using (var theReader = XmlReader.Create(theCleanser))
{
    ...
    // do stuff with theReader
}

我有这样的单元测试:

    [TestMethod]
    public void CleanTextReaderCleans0X1A()
    {
        //arrange
        var originalString = "The quick brown fox jumped over the lazy dog.";
        var badChars = new string(new[] {(char) 0x1a});
        var concatenated = originalString.Replace("jumped", badChars + "jumped" + badChars);

        //act
        using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(concatenated)))
        {
            using (var reader = new CleanTextReader(stream))
            {
                var newString = reader.ReadToEnd().Trim().Replace("  ", " ");
                //assert
                Assert.IsTrue(originalString.Equals(newString));
            }
        }
    }

......这过去了。

但是当我尝试解析其中包含0x1A字符的XML文件时,我仍然得到System.Xml.XmlException:'',十六进制值0x1A,是一个无效字符。 XX行,XX位置

深入研究CleanTextReader我检查Read(char[], int, int)方法,因为它似乎被XmlReader击中。原始buffer包含非法字符,但filteredBuffer没有,Buffer.BlockCopy()运行后,bufferfilteredBuffer都没有特殊字符。

另外值得注意的是,我发现行号和位置引用不是无效字符的第一个实例,而是第二个,因此它会看到第一个并更正它,但只有一次。< / p>

所以我在这里摸不着头脑。 XmlReader如何获得特殊字符?是否在控制从方法返回之前从缓冲区读取?我该如何解决这个问题?

更新

根据请求,这是我得到的堆栈跟踪:

"System.Xml.XmlException: '', hexadecimal value 0x1A, is an invalid character. Line 84, position 38.
   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
   at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
   at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
   at System.Xml.XmlTextReaderImpl.ParseText()
   at System.Xml.XmlTextReaderImpl.ParseElementContent()
   at System.Xml.XmlTextReaderImpl.Read()
   at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
   at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
   at System.Xml.Linq.XElement.ReadElementFrom(XmlReader r, LoadOptions o)
   at System.Xml.Linq.XNode.ReadFrom(XmlReader reader)
   at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilityElements>d__2b.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 138
   at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilities>d__18.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 71
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at MyCompany.Importers.GroupEligibilityModel.Test.LoadingTests.GroupEligibilityFileWithBadCharactersProperlyCleansed() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel.Test\\LoadingTests.cs:line 118"   string

0 个答案:

没有答案