我注意到使用java读取XML时出现问题:基本上我正在使用javax.xml.parsers.*
,特别是对于给定的InputStream stream
我执行以下操作:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
org.w3c.dom.Element docElem = db.parse(stream).getDocumentElement();
我的文件通常使用UTF-8编码,但它们实际上根本不包含任何unicode字符。然而,编码被指定为<?xml version="1.0" encoding="UTF-8" ?>
。问题是某些XML文件相当大。出于这个原因,我通常使用gzip file.xml
来压缩它们。我使用以下方法获取InputStream
,具体取决于文件名的扩展名:
private static InputStream getInputStream(File file) throws IOException {
String extension = "";
String fileName = file.getName();
int i = fileName.lastIndexOf('.');
if (i > 0) {
extension = fileName.substring(i+1);
}
InputStream stream = new FileInputStream(file);
if("gz".equals(extension)) {
return new GZIPInputStream(stream);
}
else {
if(!"xml".equals(extension)) {
LOGGER.warning(String.format("Unknown extension: %s, assuming plain XML", extension));
}
return stream;
}
}
问题在于,如果我在扩展名为gz
的文件上使用上面的代码段,则会出现以下异常:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:691)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1743)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1614)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1652)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:205)
当我在读取XML之前使用gunzip
解压缩文件时,问题不会出现。我在这里做错了什么?