用java读取压缩的XML

时间:2015-02-04 12:21:58

标签: java xml utf-8 gzip

我注意到使用java读取XML时出现问题:基本上我正在使用javax.xml.parsers.*,特别是对于给定的InputStream stream我执行以下操作:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

org.w3c.dom.Element docElem = db.parse(stream).getDocumentElement();

我的文件通常使用UTF-8编码,但它们实际上根本不包含任何unicode字符。然而,编码被指定为<?xml version="1.0" encoding="UTF-8" ?>。问题是某些XML文件相当大。出于这个原因,我通常使用gzip file.xml来压缩它们。我使用以下方法获取InputStream,具体取决于文件名的扩展名:

private static InputStream getInputStream(File file) throws IOException {
    String extension = "";
    String fileName = file.getName();

    int i = fileName.lastIndexOf('.');
    if (i > 0) {
        extension = fileName.substring(i+1);
    }

    InputStream stream = new FileInputStream(file);

    if("gz".equals(extension)) {
        return new GZIPInputStream(stream);
    }
    else {
        if(!"xml".equals(extension)) {
            LOGGER.warning(String.format("Unknown extension: %s, assuming plain XML", extension));
        }

        return stream;
    }
}

问题在于,如果我在扩展名为gz的文件上使用上面的代码段,则会出现以下异常:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
  at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:691)
  at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
  at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1743)
  at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1614)
  at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1652)
  at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
  at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
  at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
  at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
  at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
  at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
  at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:205)

当我在读取XML之前使用gunzip解压缩文件时,问题不会出现。我在这里做错了什么?

0 个答案:

没有答案