我正在从FTP服务器下载XML。我必须为我的SAX Parser做好准备。为此,我需要删除BOM字节并将其编码为UTF-8。但不知何故,它不适用于每个文件。
这是我的两个函数的代码:
public static void copy(File src, File dest){
try {
byte[] data = Files.readAllBytes(src.toPath());
writeAsUTF8(dest, skipBom(data));
} catch (IOException e) {
e.printStackTrace();
}
}
private static void writeAsUTF8(File out, byte[] data){
try {
FileOutputStream outStream = new FileOutputStream(out);
OutputStreamWriter outUTF = new OutputStreamWriter(outStream,"UTF8");
outUTF.write(new String(data, "UTF8"));
//outUTF.write(new String(data));
outUTF.flush();
outStream.close();
outUTF.close();
}
catch(Exception ex){
ex.printStackTrace();
}
}
private static byte[] skipBom(byte[] data){
int skipBytes = getBomSize(data);
byte[] tmp = new byte[data.length - skipBytes];
for(int x = 0; x < tmp.length; x++){
tmp[x] = data[x + skipBytes];
}
return tmp;
}
任何想法我做错了什么?
答案 0 :(得分:1)
简化。
writeAsUTF8(dest, data);
try {
int BOM_LENGTH = "\uFFFE".getBytes(StandardCharsets.UTF_8);
if (!new String(data, 0, BOM_LENGTH).equals("\uFFFE")) {
BOM_LENGTH = 0;
}
FileOutputStream outStream = new FileOutputStream(out);
outStream.write(data, BOM_LENGTH, data.length - BOM_LENGTH));
outStream.close();
}
catch(Exception ex){
ex.printStackTrace();
}
检查是否存在BOM(U + FFFE)。只读全部字符串会更简单:
String xml = new String(data, StandardCharsets.UTF_8);
xml = xml.replaceFirst("^\uFFFE", "");
使用Charset而不是String编码参数意味着要捕获一个Exception:UnsupportedEncodingException(IOException)。
检测XML编码:
String xml = new String(data, StandardCharsets.ISO_8859_1);
String encoding = xml.replaceFirst(
"(?s)^.*<\\?xml.*encoding=([\"'])([\\w-]+)\\1.*\\?>.*$",
"$2");
if (encoding.equals(xml)) {
encoding = "UTF-8";
}
xml = new String(data, encoding);
xml = xml.replaceFirst("^\uFFFE", "");
答案 1 :(得分:0)
为什么要删除BOM字节?您只需要将文件读取到包含文件编码的字符串,然后使用UTF-8编码将字符串写入文件。
答案 2 :(得分:0)
我无法弄清楚你的代码有什么问题。我前段时间遇到过同样的问题,我用下面的代码来做。首先,以下函数读取跳过第一个字节的文件。如果您确定所有文件都有BOM,那么这当然才有意义。
public byte[] load (File inputFile, int lines) throws Exception {
try (BufferedReader reader
= new BufferedReader(
new InputStreamReader(
new FileInputStream(inputFile), "UTF-8")))
{
// Discard the Byte Order Mark
int firstByte = reader.read();
String line = null;
int lineCount = 0;
StringBuilder builder = new StringBuilder();
while( lineCount <= lines && (line = reader.readLine()) != null ) {
lineCount += 1;
builder.append(line + "\n");
}
}
return builder.toString().getBytes();
}
您可以重写上述功能,将数据写回UTF-8中的另一个文件。我偶尔使用以下方法转换磁盘上的文件,将其从ISO转换为UTF-8:
public static void convertToUTF8 (Path p) throws Exception {
Path docPath = p;
Path docPathUTF8 = docPath;
InputStreamReader in = new InputStreamReader(new FileInputStream(docPath.toFile()), StandardCharsets.ISO_8859_1);
CharBuffer cb = CharBuffer.allocate(100 * 1000 * 1000);
int c = -1;
while ( (c = in.read()) != -1 ) {
cb.put((char) c);
}
in.close();
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(docPathUTF8.toFile()), StandardCharsets.UTF_8);
char[] x = new char[cb.position()];
System.arraycopy(cb.array(), 0, x, 0, x.length);
out.write(x);
out.flush();
out.close();
}