使用SAXParser解析ISO-8859-1(DBLP转储)

时间:2018-07-27 23:06:00

标签: utf-8 character-encoding xml-parsing saxparser iso-8859-1

我已经尝试过一切我可以想到使用Java的dblp.xml.gz来解析here中的文件SAXParser(注意:压缩后为400MB以上)。无论我做什么,都会收到错误消息:

Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
...

我的代码的核心如下:

SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxParserFactory.newSAXParser();
DBLPHandler handler = new DBLPHandler(); // custom handler
InputStream instream = new GZIPInputStream(new FileInputStream(input));
InputSource insource = new InputSource(instream);
insource.setEncoding("ISO-8859-1");
saxParser.parse(insource, handler); // error thrown from here

我尝试了各种解决方案,包括:

  • 使用解压缩的文件
  • 有和没有setEncoding
  • 有无中间InputSource
  • 使用和不使用InputStreamReader设置ISO-8859-1(和UTF-8)作为编码方式
  • 使用单独的代码将文件初始转换为UTF-8(请参见下文)
  • ...

ISO-8859-1-> UTF-8预处理(同样失败):

    InputStream gzis = new GZIPInputStream(new FileInputStream(input));
    InputStreamReader ir = new InputStreamReader(gzis,"ISO-8859-1");
    BufferedReader br = new BufferedReader(ir);

    OutputStreamWriter osw = new OutputStreamWriter(new GZIPOutputStream(new FileOutputStream(output)), "UTF-8");
    PrintWriter pw = new PrintWriter(new BufferedWriter(osw));

    String line = null;

    int read = 0;
    while((line = br.readLine())!=null){
        if(read==0){
            // replace the declared encoding in the XML header
            line = line.replaceAll("ISO-8859-1", "UTF-8");
        }
        read ++;

        pw.println(line);
    }

    pw.close();
    br.close();

无论我尝试了什么,我仍然会遇到相同的错误。值得的是,在原始解压缩的文件上,file -i告诉我字符集为us-ascii。原始解压缩文件的十六进制转储(od -x)给我:

0000000 3f3c 6d78 206c 6576 7372 6f69 3d6e 3122
0000020 302e 2022 6e65 6f63 6964 676e 223d 5349
0000040 2d4f 3838 3935 312d 3f22 0a3e 213c 4f44
...

任何帮助解决这个令人沮丧的问题的人,将不胜感激!

0 个答案:

没有答案