在scala中阅读gzipped XML

时间:2016-08-16 19:32:31

标签: xml scala

当我尝试将xml.gz文件读入Scala时,我收到以下错误:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:701)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:567)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1896)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1761)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1799)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:156)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:41)
at scala.xml.XML$.loadXML(XML.scala:60)
at scala.xml.factory.XMLLoader$class.loadFile(XMLLoader.scala:50)
at scala.xml.X

我有以下代码:

import scala.xml.XML 
val xml = XML.loadFile("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz") 

我有超过2,000个xml.gz文件可供阅读。对此有什么有效的解决方案?非常感谢!!

1 个答案:

答案 0 :(得分:1)

.xml.gz不是外层的XML - 它是gzip。使用GZIPInputStream解压缩,因为它正在被读取:

import java.io.FileInputStream
import java.util.zip.GZIPInputStream
import scala.xml.XML

def loadXmlGz(filename : String) = {
  XML.load(new GZIPInputStream(new FileInputStream(new java.io.File(filename))))
}

var xml = loadXmlGz("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz")