Question

我需要逐字符地解析UTF-8输入（从文本文件中）（并且字符是指完整的UTF-8字符（UTF-8代码点），而不是Java的字符）。

我应该使用什么方法？

Answer 1

例如：

// if you want to work line by line, use Files.readAllLines()
// if you use Guava, there's also Guava's Files.toString() for reading the whole file into a String
byte[] bytes = Files.readAllBytes(Paths.get("test.txt"));
String text = new String(bytes, StandardCharsets.UTF_8);

IntStream codePoints = text.codePoints();

// do something with the code points
codePoints.forEach(codePoint -> System.out.println(codePoint));

Answer 2

您可以使用read（）方法使用InputStreamReader轻松完成此操作。 read方法将返回一个int，它是一个代码点。点击此处了解更多信息：http://docs.oracle.com/javase/tutorial/i18n/text/stream.html

FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
//Use isr.read() to read character by character.

从Java输入流中读取下一个字符（完整的unicode代码点）

2 个答案: