Question

所以我知道String#codePointAt(int)，但是它被char偏移量索引，而不是代码点偏移量。

我正在考虑尝试类似的事情：

使用String#charAt(int)获取索引的char
测试char是否在high-surrogates range中
- 如果是，请使用String#codePointAt(int)获取代码点，并将索引增加2
- 如果没有，请使用给定的char值作为代码点，并将索引增加1

但我担心的是

我不确定自然位于高代理范围内的代码点是否会存储为两个char值或一个
这似乎是一种非常昂贵的迭代字符的方式
某人必须想出更好的东西。

Answer 1

是的，Java使用UTF-16-esque编码进行字符串的内部表示，是的，它使用代理方案对基本多语言平面（BMP）之外的字符进行编码。

如果你知道你将处理BMP之外的字符，那么这是迭代Java字符串字符的规范方法：

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

Answer 2

Java 8添加CharSequence#codePoints，返回包含代码点的IntStream。您可以直接使用流来迭代它们：

string.codePoints().forEach(c -> ...);

或通过将流收集到数组中使用for循环：

for(int c : string.codePoints().toArray()){
    ...
}

这些方式可能比Jonathan Feinbergs's solution更昂贵，但它们的读/写速度更快，性能差异通常无关紧要。

Answer 3

以为我会添加一个与foreach循环（ref）一起使用的变通方法，并且当你转移到java 8时，你可以轻松地将它转换为java 8的新String#codePoints方法：

您可以像foreach一样使用它：

 for(int codePoint : codePoints(myString)) {
   ....
 }

这是助手方法：

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

或者，如果您只想将字符串转换为int数组（可能使用比上述方法更多的RAM）：

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

谢天谢地，使用“codePoints”安全地处理UTF-16的代理配对（java的内部字符串表示）。

Answer 4

迭代代码点作为Sun的功能请求提交。

请参阅Sun Bug Entry

还有一个关于如何在那里迭代String CodePoints的例子。

如何遍历Java String的unicode代码点？

4 个答案: