Question

我正在写一个应该检测使用的字符集然后将其切换到utf-8的函数。我正在使用juniversalchardet，它是mozilla的universalchardet的java端口。
这是我的代码：

private List<List<String>> setProperEncoding(List<List<String>> input) {
    try {

        // Detect used charset
        UniversalDetector detector = new UniversalDetector(null);

        int position = 0;
        while ((position < input.size()) & (!detector.isDone())) {
            String row = null;
            for (String cell : input.get(position)) {
                row += cell;
            }
            byte[] bytes = row.getBytes();
            detector.handleData(bytes, 0, bytes.length);
            position++;
        }
        detector.dataEnd();

        Charset charset = Charset.forName(detector.getDetectedCharset());
        Charset utf8 = Charset.forName("UTF-8");
        System.out.println("Detected charset: " + charset);

        // rewrite input using proper charset
        List<List<String>> newLines = new ArrayList<List<String>>();
        for (List<String> row : input) {
            List<String> newRow = new ArrayList<String>();
            for (String cell : row) {
                //newRow.add(new String(cell.getBytes(charset)));
                ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset));
                CharBuffer cb = charset.decode(bb);
                bb = utf8.encode(cb);
                newRow.add(new String(bb.array()));
            }
            newLines.add(newRow);
        }

        return newLines;

    } catch (Exception e) {
        e.printStackTrace();
        return input;
    }
}

我的问题是，当我用例如波兰字母表的字符读取文件时，像ł，±，ć和类似的字母被替换为？和其他奇怪的事情。我做错了什么？

编辑：对于编译，我使用的是eclipse。

方法参数是读取MultipartFile的结果。只需使用FileInputStream获取每一行，然后通过某个分隔符拆分每行（它是为xls，xlsx和csv文件准备的）。没什么特别的。

Answer 1

首先，您将数据以二进制格式存储在某处。为简单起见，我认为它来自InputStream。

您希望将输出写为UTF-8字符串，我想它可以是OutputStream。

我建议您创建一个AutoDetectInputStream：

public class AutoDetectInputStream extends InputStream  {
    private InputStream is;
    private byte[] sampleData = new byte[4096];
    private int sampleLen;
    private int sampleIndex = 0;

    public AutoDetectStream(InputStream is) throws IOException {
        this.is = is;
        // pre-read the data
        sampleLen = is.read(sampleData);
    }

    public Charset getCharset() {
        // detect the charset
        UniversalDetector detector = new UniversalDetector(null);
        detector.handleData(sampleData, 0, sampleLen);
        detector.dataEnd();
        return detector.getDetectedCharset();
    }

    @Override
    public int read() throws IOException {
        // simulate the stream for the reader
        if(sampleIndex < sampleLen) {
            return sampleData[sampleIndex++];
        }
        return is.read();
    }
}

第二个任务非常简单，因为Java将字符串（字符）存储在UTF-8中，因此只需使用简单的OutputStreamWriter即可。所以，这是你的代码：

// open input with Detector stream
// we use BufferedReader so we could read lines
InputStream is = new FileInputStream("in.txt");
AutoDetectInputStream detector = new AutoDetectInputStream(is);
Charset charset = detector.getCharset();
// here we can use the charset to decode the bytes into characters
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset));

// open output to write to
OutputStream os = new FileOutputStream("out.txt");
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8"));

// copy the whole file
String line;
while((line = rdr.readLine()) != null) {
    utf8Writer.append(line);
}

// close streams        
rdr.close();
utf8Writer.flush();
utf8Writer.close();

所以，最后你把你所有的txt文件转码为UTF-8。

请注意，缓冲区大小应足够大，以便提供UniversalDetector。

在java中更改编码

1 个答案: