Java:MalformedByteSequenceException(XML)

时间:2009-12-09 03:00:27

标签: java xml utf-8 text

我正在尝试使用此class解析XML。当我输入一个简单的文件时,它工作正常。

<testData>
    <text>
        odp
    </text>
</testData>

这是我的main

public static void main(String[] args) { 
    Xml train = new Xml(args[0], "trainingData");
    Xml test = new Xml(args[1], "testData");
}

但是,当我使用从MSFT Office OneNote复制和粘贴时获得的文件时,我收到错误:

Exception in thread "main" java.lang.RuntimeException: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at odp.compling.Xml.rootElement(Xml.java:41)
    at odp.compling.Xml.<init>(Xml.java:61)
    at odp.compling.ParseTreeAnalysis2.main(ParseTreeAnalysis2.java:10)
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at odp.compling.Xml.rootElement(Xml.java:33)
    ... 2 more

造成这种情况的原因是什么?我在Notepad ++中编辑了有问题的XML文件,并将编码更改为UTF-8。这引起了一些奇怪的字符来自重音/特殊引号,我编辑出来了。我没有正确转换吗?

(我不知道任何关于文本编码格式的内容,如果你不知道的话。)

1 个答案:

答案 0 :(得分:1)

您的文件未正确编码为UTF-8,但您的解析器需要UTF-8编码。

这有助于确定问题是你可以发布文件的hexdump。