Question

G-Clef（U + 1D11E）不是Basic Multilingual Plane（BMP）的一部分，这意味着它需要超过16位。几乎所有Java的读取函数都只返回包含only 16 bit的char或int。哪个函数读取完整的Unicode符号，包括SMP，SIP，TIP，SSP和PUA？

更新

我已经问过如何从输入流中读取单个Unicode符号（或代码点）。我既没有任何整数数组，也不想读一行。

可以使用Character.toCodePoint()构建代码点，但此功能需要char。另一方面，由于char返回read()，因此无法阅读int。到目前为止，我最好的工作是这个，但它仍然包含不安全的演员阵容：

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

如何做得更好？

更新2

另一个返回String但仍使用强制转换的版本：

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

剩下的问题：演员阵容是安全的还是如何避免演员？

Answer 1

到目前为止我最好的工作是这个，但它仍然包含不安全的演员

您提交的代码唯一不安全的事情是，如果ch16达到EOF，则input可能为-1。如果您首先检查此条件，则可以保证其他(char)强制转换是安全的Reader.read() is specified，以返回-1或{{1}范围内的值（0 - 0xFFFF）。

char

这仍然不理想，你真的需要处理边缘情况，其中第一个public int read_code_point (Reader input) throws java.io.IOException { int ch16 = input.read(); if (ch16 < 0 || !Character.isHighSurrogate((char)ch16)) return ch16; else { int loSurr = input.read(); if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) return ch16; // or possibly throw an exception else return Character.toCodePoint((char)ch16, (char)loSurr); } }读取是高代理，但第二个不是匹配的低代理，在这种情况下你可能想要返回第一个char as-is 并备份阅读器，以便下次阅读为您提供下一个字符。但这只适用于char。如果您可以保证那么

input.markSupported() == true

或者您可以将原始阅读器包裹在public int read_code_point (Reader input) throws java.io.IOException { int firstChar = input.read(); if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) { return firstChar; } else { input.mark(1); int secondChar = input.read(); if(secondChar < 0) { // reached EOF return firstChar; } else if(!Character.isLowSurrogate((char)secondChar)) { // unpaired surrogates, un-read the second char input.reset(); return firstChar; } else { return Character.toCodePoint((char)firstChar, (char)secondChar); } } }中并使用PushbackReader

Answer 2

完整的Unicode可以通过字节序列表示为UTF-8和UTF-16。字节对（“java chars”）。从String中可以使用以下内容提取完整的Unicode 代码点：

int[] codePoints = { 0x1d11e };
String s = new String(codePoints, 0, codePoints.length);

for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
}

对于基本拉丁字符的文件，UTF-8似乎没问题。

以下内容读取完整的标准Unicode文件（UTF-8格式）：

try (BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
    for (;;) {
        String line = in.readLine();
        if (line == null) {
            break;
        }
        ... do some thing with a Unicode line ...
    }
} catch (FileNotFoundException e) {
    System.err.println("No file: " + file.getPath());
} catch (IOException e) {
    ...
}

提供一个（或多个Unicode代码）的Java字符串的函数：

String s = unicodeToString(0x1d11e);
String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e);

public static String unicodeToString(int... codepoints) {
    return new String(codePoints, 0, codePoints.length);
}

如何从文件中读取Unicode G-Clef（U + 1D11E）？

2 个答案: