编辑：

Question

我有几个要处理的xml文件。样本文件如下：

  <DOC>
  <DOCNO>2431.eng</DOCNO>
  <TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
  <DESCRIPTION>view of several pools with steaming water; people, houses and 
   trees behind it, and a mountain range in the distant background;</DESCRIPTION>
   <NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to
   "Baños  del Inca" (baths of the Inka) with the arrival of the Spaniards . 
   Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
   <LOCATION>Cajamarca, Peru</LOCATION>
   </DOC>

在使用xmlread（）matlab函数时，我收到以下错误。

    [Fatal Error] 2431.eng:3:29: Invalid byte 2 of 4-byte UTF-8 sequence.
    ??? Java exception occurred:
    org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

    Error in ==> xmlread at 98
    parseResult = p.parse(fileName);

有关如何解决此问题的任何建议吗？

Answer 1

您发布的样本效果很好。

正如错误消息所示，我认为您的实际文件编码错误。请记住，并非所有可能的字节序列都是有效的UTF-8序列：http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

快速检查方法是在Firefox中打开文件。如果XML文件存在编码问题，您将看到如下错误消息：

XML解析错误：格式不正确

编辑：

所以我看一下file：您的问题是XML解析器将没有<?xml ... ?>声明行的文件视为UTF-8，但您的文件看起来被编码为{{3} （拉丁语1）或ISO-8859-1（CP-1252）代替。

例如，SAX解析器在以下标记上被阻塞：Baños。这个字符“带有代字号的n个字母”，即Windows-1252，在两种编码中具有不同的表示形式：

在ISO-8859-1中，它表示为一个字节：0xF1
在UTF-8中，它表示为两个字节：0xC3 0xB1

虽然U+00F1旨在向后兼容UTF-8，但字符ñ属于扩展的ASCII范围，它们都表示为UTF-8中的两个或更多字节。

因此，当将拉丁文-1中存储为ño的子字符串11110001 01101111解释为UTF-8编码时，解析器会看到第一个字节，ASCII将其视为a的开头形式为11110xxx 10xxxxxx 10xxxxxx 10xxxxxx的4字节UTF-8序列。但由于它显然不遵循该格式，因此会引发错误：

org.xml.sax.SAXParseException：4字节UTF-8序列的字节2无效。

底线是：recognizes！在您的情况下，在所有文件的开头添加以下行：

<?xml version="1.0" encoding="ISO-8859-1"?>

或者更好的是，修改生成这些文件的程序以写入所述行。

在此更改之后，MATLAB（或真正的Java）应该能够正确读取XML文件：

>> doc = xmlread('2431.eng');
>> doc.saveXML([])
ans =
<?xml version="1.0" encoding="UTF-16"?>
<DOC>
<DOCNO>annotations/02/2431.eng</DOCNO>
<TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
<DESCRIPTION>view of several pools with steaming water; people, houses and trees behind it, and a mountain range in the distant background;</DESCRIPTION>
<NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to "Baños del Inca" (baths of the Inka) with the arrival of the Spaniards . Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
<LOCATION>Cajamarca, Peru</LOCATION>
<DATE>October 2002</DATE>
<IMAGE>images/02/2431.jpg</IMAGE>
<THUMBNAIL>thumbnails/02/2431.jpg</THUMBNAIL>
</DOC>

（注意：显然，一旦MATLAB读取文件，它会在内部将其重新编码为UTF-16）

如何在matlab中的XML文件中处理一些特殊的UTF-8字符

1 个答案:

编辑：