也许我对应该发生的事情的理解是错误的,所以希望有人可以在这里纠正我的思考过程。
我正在尝试处理许多大型XML文件,这些文件经常被发送给我们,文本中嵌入了坏字符(0x1A)...不幸的是,我们的客户正在发送文件,所以无论我们多么好问他们为了使文件实际上是有效的XML,他们认为它是我们的问题。
最终我写了一个StreamReader
的子类,如下所示:
public class CleanTextReader : StreamReader
{
private readonly ILog _logger;
public CleanTextReader(Stream stream, ILog logger) : base(stream)
{
this._logger = logger;
}
public CleanTextReader(Stream stream) : this(stream, LogManager.GetLogger<CleanTextReader>())
{
//nothing to do here.
}
public override int Read(char[] buffer, int index, int count)
{
try
{
var rVal = base.Read(buffer, index, count);
var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
return rVal;
}
catch (Exception ex)
{
this._logger.Error("Read(char[], int, int)", ex);
throw;
}
}
public override int ReadBlock(char[] buffer, int index, int count)
{
try
{
var rVal = base.ReadBlock(buffer, index, count);
var filteredBuffer = buffer.Select(x => XmlConvert.IsXmlChar(x) ? x : ' ').ToArray();
Buffer.BlockCopy(filteredBuffer, 0, buffer, 0, count);
return rVal;
}
catch (Exception ex)
{
this._logger.Error("ReadBlock(char[], in, int)", ex);
throw;
}
}
public override string ReadToEnd()
{
var chars = new char[4096];
int len;
var sb = new StringBuilder(4096);
while ((len = Read(chars, 0, chars.Length)) != 0)
{
sb.Append(chars, 0, len);
}
return sb.ToString();
}
}
...然后像我这样实施XmlReader
:
using (var theCleanser = new CleanTextReader(myStreamedInput))
using (var theReader = XmlReader.Create(theCleanser))
{
...
// do stuff with theReader
}
我有这样的单元测试:
[TestMethod]
public void CleanTextReaderCleans0X1A()
{
//arrange
var originalString = "The quick brown fox jumped over the lazy dog.";
var badChars = new string(new[] {(char) 0x1a});
var concatenated = originalString.Replace("jumped", badChars + "jumped" + badChars);
//act
using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(concatenated)))
{
using (var reader = new CleanTextReader(stream))
{
var newString = reader.ReadToEnd().Trim().Replace(" ", " ");
//assert
Assert.IsTrue(originalString.Equals(newString));
}
}
}
......这过去了。
但是当我尝试解析其中包含0x1A字符的XML文件时,我仍然得到System.Xml.XmlException
:'',十六进制值0x1A,是一个无效字符。 XX行,XX位置
深入研究CleanTextReader
我检查Read(char[], int, int)
方法,因为它似乎被XmlReader
击中。原始buffer
包含非法字符,但filteredBuffer
没有,Buffer.BlockCopy()
运行后,buffer
和filteredBuffer
都没有特殊字符。
另外值得注意的是,我发现行号和位置引用不是无效字符的第一个实例,而是第二个,因此它会看到第一个并更正它,但只有一次。< / p>
所以我在这里摸不着头脑。 XmlReader
如何获得特殊字符?是否在控制从方法返回之前从缓冲区读取?我该如何解决这个问题?
更新
根据请求,这是我得到的堆栈跟踪:
"System.Xml.XmlException: '', hexadecimal value 0x1A, is an invalid character. Line 84, position 38.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlTextReaderImpl.Read()
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XElement.ReadElementFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XNode.ReadFrom(XmlReader reader)
at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilityElements>d__2b.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 138
at MyCompany.Importers.GroupEligibilityModel.Loader.<GetGroupEligibilities>d__18.MoveNext() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel\\Loader.cs:line 71
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
at MyCompany.Importers.GroupEligibilityModel.Test.LoadingTests.GroupEligibilityFileWithBadCharactersProperlyCleansed() in c:\\Projects\\MyCompanyHealth\\MyCompany.Importers\\MyCompany.Importers.GroupEligibilityModel\\MyCompany.Importers.GroupEligibilityModel.Test\\LoadingTests.cs:line 118" string