Question

最近我在Java中遇到了String的codePointAt方法。我还发现了一些其他 codePoint 方法：codePointBefore，codePointCount等。它们肯定与Unicode有关，但我不明白。

现在我想知道何时以及如何使用codePointAt和类似的方法。

Answer 1

简答：它会为您提供从String中指定索引处开始的Unicode codepoint。即该位置角色的“unicode number”。

更长的回答：当16位（又名char）足以容纳任何存在的Unicode字符时创建Java（这些部分现在称为Basic Multilingual Plane or BMP ）。后来，Unicode扩展到包含代码点＆gt;的字符。 2 ¹⁶。这意味着char无法再保留所有可能的Unicode代码点。

UTF-16是解决方案：它以16位（即恰好一个char）存储“旧的”Unicode代码点，以32位（即两个char值存储所有新的）。这两个16位值称为“代理对”。现在严格来说，char拥有一个“UTF-16代码单元”，而不是像过去那样拥有“Unicode字符”。

现在所有“旧”方法（仅处理char）都可以使用，只要你没有使用任何“新”Unicode字符（或者并不真正关心它们），但如果您也关心新角色（或者只需要完全支持Unicode），那么您将需要使用实际支持所有可能的Unicode代码点的“codepoint”版本。 / p>

注意：一个众所周知的不在BMP中的unicode字符示例（即仅在使用代码点变体时才起作用）是Emojis：即使是简单的Grinning Face U + 1F600无法用单个char表示。

Answer 2

代码点支持65535以上的字符，即Character.MAX_VALUE。

如果您的文字包含如此高的字符，则必须使用代码点或int代替char。

这不支持UTF-16，它可以使用一个或两个16位字符并将其转换为int

AFAIK，通常这只是最近添加的Supplementary Multiligual和Supplementary Ideographic个字符所必需的，例如非繁体中文。

Answer 3

下面的代码示例有助于阐明 codePointAt

    String myStr = "1?3";
    System.out.println(myStr.length()); // print 4, because ? is two char
    System.out.println(myStr.codePointCount(0, myStr.length())); //print 3, factor in all unicode
    
    int result = myStr.codePointAt(0);
    System.out.println(Character.toChars(result)); // print 1
    
    result = myStr.codePointAt(1);
    System.out.println(Character.toChars(result)); // print ?, because codePointAt will get surrogate pair (high and low)
    
    result = myStr.codePointAt(2);
    System.out.println(Character.toChars(result)); // print low surrogate of ? only, in this case it show "?"
    
    result = myStr.codePointAt(3);
    System.out.println(Character.toChars(result)); // print 3

Answer 4

简而言之，很少有人在Java中使用默认字符集:)但是要获得更详细的解释，请尝试以下帖子：

Comparing a char to a code-point? http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html http://javarevisited.blogspot.com/2012/01/java-string-codepoint-get-unicode.html

希望这有助于为您澄清事情：）

String.codePointAt究竟做了什么？

4 个答案: