我已经尝试过一切我可以想到使用Java的dblp.xml.gz
来解析here中的文件SAXParser
(注意:压缩后为400MB以上)。无论我做什么,都会收到错误消息:
Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
...
我的代码的核心如下:
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxParserFactory.newSAXParser();
DBLPHandler handler = new DBLPHandler(); // custom handler
InputStream instream = new GZIPInputStream(new FileInputStream(input));
InputSource insource = new InputSource(instream);
insource.setEncoding("ISO-8859-1");
saxParser.parse(insource, handler); // error thrown from here
我尝试了各种解决方案,包括:
setEncoding
。InputSource
InputStreamReader
设置ISO-8859-1
(和UTF-8
)作为编码方式ISO-8859-1-> UTF-8预处理(同样失败):
InputStream gzis = new GZIPInputStream(new FileInputStream(input));
InputStreamReader ir = new InputStreamReader(gzis,"ISO-8859-1");
BufferedReader br = new BufferedReader(ir);
OutputStreamWriter osw = new OutputStreamWriter(new GZIPOutputStream(new FileOutputStream(output)), "UTF-8");
PrintWriter pw = new PrintWriter(new BufferedWriter(osw));
String line = null;
int read = 0;
while((line = br.readLine())!=null){
if(read==0){
// replace the declared encoding in the XML header
line = line.replaceAll("ISO-8859-1", "UTF-8");
}
read ++;
pw.println(line);
}
pw.close();
br.close();
无论我尝试了什么,我仍然会遇到相同的错误。值得的是,在原始解压缩的文件上,file -i
告诉我字符集为us-ascii
。原始解压缩文件的十六进制转储(od -x
)给我:
0000000 3f3c 6d78 206c 6576 7372 6f69 3d6e 3122
0000020 302e 2022 6e65 6f63 6964 676e 223d 5349
0000040 2d4f 3838 3935 312d 3f22 0a3e 213c 4f44
...
任何帮助解决这个令人沮丧的问题的人,将不胜感激!