Question

我已经下载了Stack Over Flow站点的xml转储。在将转储转移到mysql数据库时，我继续遇到以下错误：出现异常：字符引用“某些字符集如＆amp;＃x10”是无效的XML字符。

我使用UltraEdit（它是一个800兆字节的文件）从文件中删除一些字符，但如果我删除一个无效的字符集并运行解析器，我会收到错误，识别出更多无效字符。关于如何解决这个问题的任何建议？

干杯，

Ĵ

Answer 1

XML中允许的字符集是here。如您所见，＃x10不是其中之一。如果stackoverflow转储中存在这些，那么它不符合XML。

或者，您正在使用错误的字符编码来读取XML。

Answer 2

您使用的是哪个转储？第一个版本存在问题（不仅是无效字符，而且<出现在它不应该出现的地方）但它们应该已在second dump中修复。

为了它的价值，我使用两个正则表达式替换修复了原始文件中的无效字符。替换“＆amp;＃x0 [12345678BCEF];”和“？”每个都有“？” - 当然，将它们视为正则表达式。

Answer 3

您应该将文件转换为UTF-8 我在java中开发，下面是我的转换

public String FileUTF8Cleaner（File xmlfile）{

    String out = xmlfile+".utf8";
    if (new File(out).exists())
        System.out.println("### File conversion process ### Deleting utf8 file");
        new File(out).delete();
        System.out.println("### File conversion process ### Deleting utf8 file [DONE!]");

    try {
        System.out.println("### File conversion process ### Converting file");
        FileInputStream fis = new FileInputStream(xmlfile);
        DataInputStream in = new DataInputStream(fis);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        String strLine;

        FileOutputStream fos = new FileOutputStream(out);

        while ((strLine = br.readLine()) != null) {

            fos.write(strLine.replaceAll("\\p{Cc}", "").getBytes());
            fos.write("\n".getBytes());
        }

        fos.close();
        fis.close();
        in.close();
        br.close();
        System.out.println("### File conversion process ### Converting file [DONE)]");

    } catch(Exception e) {
        e.printStackTrace();
    }

        System.out.println("### File conversion process ### Processing file : "+xmlfile.getAbsolutePath()+" [DONE!]");
        return out;

}

Sax无效的XML字符异常

3 个答案: