Question

我正在从FTP服务器下载XML。我必须为我的SAX Parser做好准备。为此，我需要删除BOM字节并将其编码为UTF-8。但不知何故，它不适用于每个文件。

这是我的两个函数的代码：

public static void copy(File src, File dest){

    try {
        byte[] data = Files.readAllBytes(src.toPath());

        writeAsUTF8(dest, skipBom(data));

    } catch (IOException e) {
        e.printStackTrace();
    }
}


private static void writeAsUTF8(File out, byte[] data){

    try {

        FileOutputStream outStream = new FileOutputStream(out);
        OutputStreamWriter outUTF = new OutputStreamWriter(outStream,"UTF8");

        outUTF.write(new String(data, "UTF8"));
        //outUTF.write(new String(data));
        outUTF.flush();
        outStream.close();
        outUTF.close();
    }
    catch(Exception ex){
        ex.printStackTrace();
    }
}

    private static byte[] skipBom(byte[] data){

    int skipBytes = getBomSize(data);

    byte[] tmp = new byte[data.length - skipBytes];

    for(int x = 0; x < tmp.length; x++){
        tmp[x] = data[x + skipBytes];
    }

    return tmp;
}

任何想法我做错了什么？

Answer 1

简化。

    writeAsUTF8(dest, data);



try {
    int BOM_LENGTH = "\uFFFE".getBytes(StandardCharsets.UTF_8);
    if (!new String(data, 0, BOM_LENGTH).equals("\uFFFE")) {
        BOM_LENGTH = 0;
    }
    FileOutputStream outStream = new FileOutputStream(out);
    outStream.write(data, BOM_LENGTH, data.length - BOM_LENGTH));
    outStream.close();
}
catch(Exception ex){
    ex.printStackTrace();
}

检查是否存在BOM（U + FFFE）。只读全部字符串会更简单：

String xml = new String(data, StandardCharsets.UTF_8);
xml = xml.replaceFirst("^\uFFFE", "");

使用Charset而不是String编码参数意味着要捕获一个Exception：UnsupportedEncodingException（IOException）。

检测XML编码：

String xml = new String(data, StandardCharsets.ISO_8859_1);
String encoding = xml.replaceFirst(
        "(?s)^.*<\\?xml.*encoding=([\"'])([\\w-]+)\\1.*\\?>.*$",
        "$2");

if (encoding.equals(xml)) {
    encoding = "UTF-8";
}
xml = new String(data, encoding);
xml = xml.replaceFirst("^\uFFFE", "");

Answer 2

为什么要删除BOM字节？您只需要将文件读取到包含文件编码的字符串，然后使用UTF-8编码将字符串写入文件。

Answer 3

我无法弄清楚你的代码有什么问题。我前段时间遇到过同样的问题，我用下面的代码来做。首先，以下函数读取跳过第一个字节的文件。如果您确定所有文件都有BOM，那么这当然才有意义。

public byte[] load (File inputFile, int lines) throws Exception {

    try (BufferedReader reader
        = new BufferedReader(
            new InputStreamReader(
                new FileInputStream(inputFile), "UTF-8")))
    {
        // Discard the Byte Order Mark
        int firstByte = reader.read();

        String line = null;
        int lineCount = 0;

        StringBuilder builder = new StringBuilder();
        while( lineCount <= lines && (line = reader.readLine()) != null ) {
            lineCount += 1;
            builder.append(line + "\n");
        }
    }

    return builder.toString().getBytes();
}

您可以重写上述功能，将数据写回UTF-8中的另一个文件。我偶尔使用以下方法转换磁盘上的文件，将其从ISO转换为UTF-8：

public static void convertToUTF8 (Path p) throws Exception {
    Path docPath = p;
    Path docPathUTF8 = docPath;

    InputStreamReader in = new InputStreamReader(new FileInputStream(docPath.toFile()), StandardCharsets.ISO_8859_1);

    CharBuffer cb = CharBuffer.allocate(100 * 1000 * 1000);
    int c = -1;

    while ( (c = in.read()) != -1 ) {
        cb.put((char) c);
    }
    in.close();

    OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(docPathUTF8.toFile()), StandardCharsets.UTF_8);

    char[] x = new char[cb.position()];
    System.arraycopy(cb.array(), 0, x, 0, x.length);

    out.write(x);
    out.flush();
    out.close();
}

下载xml，删除bom并编码utf8

3 个答案: