我正在尝试从Java中的文本文件读取Unicode代码点。 InputStreamReader
类通过int
返回流的内容int
,我希望可以完成我想要的操作,但是它不组成代理对。
我的测试程序:
import java.io.*;
import java.nio.charset.*;
class TestChars {
public static void main(String args[]) {
InputStreamReader reader =
new InputStreamReader(System.in, StandardCharsets.UTF_8);
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
其行为如下:
$ java TestChars
> keyboard ⌨. pizza
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)',
.
我的问题是组成比萨表情符号的替代对是分开读取的。我想将符号读入单个int
中并完成它。
问题:是否有一个类似于Reader(类)的类,在阅读时会自动将替代对与字符组成? (而且,如果输入格式错误,可能会引发异常。)
我知道我可以自己组成这对,但是我宁愿避免重新发明轮子。
答案 0 :(得分:2)
如果您利用String
具有返回码点流的方法的优势,则不必自己处理代理对:
import java.io.*;
class cptest {
public static void main(String[] args) {
try (BufferedReader br =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
} catch (Exception e) {
System.err.println("Error: " + e);
}
}
private static void print(int cp) {
String s = new String(Character.toChars(cp));
System.out.println("Character " + cp + ": " + s);
}
}
会产生
$ java cptest <<< "keyboard ⌨. pizza "
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:
Character 9000: ⌨
Character 46: .
Character 32:
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:
Character 127829:
答案 1 :(得分:1)
您可以用一个简单的类包装 Reader 实例,该类可以解码代理对:
import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;
public class CodepointStream implements Closeable {
private Reader reader;
public CodepointStream(Reader reader) {
this.reader = reader;
}
public int read() throws IOException {
int unit0 = reader.read();
if (unit0 < 0)
return unit0; // EOF
if (!Character.isHighSurrogate((char)unit0))
return unit0;
int unit1 = reader.read();
if (unit1 < 0)
return unit1; // EOF
if (!Character.isLowSurrogate((char)unit1))
throw new RuntimeException("Invalid surrogate pair");
return Character.toCodePoint((char)unit0, (char)unit1);
}
public void close() throws IOException {
reader.close();
reader = null;
}
}
主要功能需要稍作修改:
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
public final class App {
public static void main(String args[]) {
CodepointStream reader = new CodepointStream(
new InputStreamReader(System.in, StandardCharsets.UTF_8));
try {
System.out.print("> ");
int code = reader.read();
while (code != -1) {
String s =
String.format("Code %x is `%s', %s.",
code,
Character.getName(code),
new String(Character.toChars(code)));
System.out.println(s);
code = reader.read();
}
} catch (Exception e) {
}
}
}
然后您的输出将变为:
> keyboard ⌨. pizza
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE', .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE', .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE', .
Code 1f355 is `SLICE OF PIZZA', .
Code a is `LINE FEED (LF)',
.