Question

对我来说，java.lang.StringBuilder的appendCodePoint（...）方法以一种意想不到的方式运行。

对于Character.MAX_VALUE以上的unicode代码点（在UTF-8中需要3或4个字节进行编码，这是我的Eclipse工作区设置），它的行为很奇怪。

我将String的Unicode代码点逐个附加到StringBuilder，但其输出最终看起来不同。我怀疑在AbstractStringBuilder＃appendCodePoint（...）中调用Character.toSurrogates（codePoint，value，count）会导致这种情况，但我不知道如何解决它。

我的代码：

    // returns random string in range of unicode code points 0x2F800 to 0x2FA1F
    // e.g. 
    String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
    System.out.println(s);

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < getCodePointCount(s); i++) {
        sb.appendCodePoint(s.codePointAt(i));
    }
    // prints some of the CJK characters, but between them there is a '?'

    // e.g. ???????????????
    System.out.println(sb.toString());

    // returns random string in range of unicode code points 0x20000 to 0x2A6DF
    // e.g. 
    s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
    // prints the CJK characters correctly
    System.out.println(s);

    sb = new StringBuilder();
    for (int i = 0; i < getCodePointCount(s); i++) {
        sb.appendCodePoint(s.codePointAt(i));
    }

    // prints some of the CJK characters, but between them there is a '?'
    // e.g. ???????????????
    System.out.println(sb.toString());

使用：

public static int getCodePointCount(String s) {
    return s.codePointCount(0, s.length());
}

public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
    return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}

public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
    return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}

private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < length; i++) {

        // try to find a valid character MAX_TRIES times
        for (int j = 0; j < MAX_TRIES; j++) {

            int unicodeInt = from + random.nextInt(to - from);

            if (Character.isValidCodePoint(unicodeInt) &&
                    (Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
                    Character.isWhitespace(unicodeInt))) {
                sb.appendCodePoint(unicodeInt);
                break;
            }

        }

    }

    return  new String(sb.toString().getBytes(), "UTF-8");
}

Answer 1

您不正确地迭代代码点。您应该使用Jonathan Feinberg提出的策略here

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

或自Java 8以来

s.codePoints().forEach(/* do something */);

请注意String#codePointAt(int)

的Javadoc

返回指定索引处的字符（Unicode代码点）。的的 index是指char值（Unicode代码单元），范围从0到 length（） - 1。

您正在从0迭代到codePointCount。如果角色不是高低代理对，则单独返回。在这种情况下，您的索引应该只增加1.否则，它应该增加2（Character#charCount(int)处理此事项），因为您获得了与该对相对应的代码点。

Answer 2

从此处更改循环：

for (int i = 0; i < getCodePointCount(s); i++) {

到此：

for (int i = 0; i < getCodePointCount(s); i = s.offsetByCodePoints(i, 1)) {

在Java中，char是单个UTF-16值。补充代码点占用String中的两个字符。

但是你正在循环String中的每个字符。这意味着您正在阅读每个补充代码点两次：第一次，您正在阅读其两个UTF-16代理字符;第二次，你正在阅读并附加低代理字符。

考虑一个只包含一个代码点的字符串0x2f8eb。表示该代码点的Java String实际上包含：

"\ud87e\udceb"

如果循环遍历每个char索引，那么你的循环将有效地执行此操作：

sb.appendCodePoint(0x2f8eb);    // codepoint found at index 0
sb.appendCodePoint(0xdceb);     // codepoint found at index 1

StringBuilder＃appendCodePoint（int）意外行为

2 个答案: