如何正确读取Java中的阿拉伯数据集?

时间:2019-06-20 10:36:32

标签: java text encoding utf-8 arabic

场景::我想读取带有 utf-8 编码的阿拉伯数据集。每行中的每个单词都用空格分隔。


问题:当我阅读每一行时,输出为:

  

??????? ?? ???? ?? ???


问题::如何读取文件并打印每一行? 有关更多信息,here是我的阿拉伯数据集,并且读取数据的部分源代码类似于以下内容:

private ContextCountsImpl extractContextCounts(Map<Integer, String> phraseMap) throws IOException {
        Reader reader;
        reader = new InputStreamReader(new FileInputStream(inputFile), "utf-8");
        BufferedReader rdr = new BufferedReader(reader);
        while (rdr.ready()) {
            String line = rdr.readLine();
            System.out.println(line);
            List<String> phrases = splitLineInPhrases(line);
            //any process on this file
        }
}

1 个答案:

答案 0 :(得分:0)

我可以使用UTF-8进行阅读,您可以这样尝试吗?

public class ReadArabic {
    public static void main(String[] args) {
        try {
            String line;
            InputStream fileInputStream = new FileInputStream("arabic.txt");
            Reader reader = new InputStreamReader(fileInputStream, "UTF-8"); // leave charset out for default
            BufferedReader bufferedReader = new BufferedReader(reader);
            while ((line = bufferedReader.readLine()) != null) {
                System.out.println(line);
            }
        } catch (Exception e) {
            System.err.println(e.getMessage()); // handle all exceptions
        }
    }
}

Output