Question

另一个与UTF-8相关的问题。用Java编码的中文字符用UTF-8＆＃39;有些时候在编码时变成3个字节长。我不知道为什么，我认为所有中文字符的代码点都是2字节宽。但当我手动尝试检测到它似乎也没有那样。有没有办法检测UTF-8字符的字节宽度（非零字节）？

import java.io.UnsupportedEncodingException;
public class a {


public static void main(String[] args) throws UnsupportedEncodingException {
    String s = "我是一1"; //expected 7 actually 6
    String s1 = "一1";
    String s2 = "1";

    //String r1 = "\\p{InCJK_Compatibility}";
    //String r1 = "\\p{InCJK_Compatibility_Ideographs}";
    //String r1 = "\\p{Han}"; //unfortunately not supported in java6

    int cnt = 0;
    final int length = s.length();
    for (int offset = 0; offset < length; ) {
        final int codepoint = s.codePointAt(offset);
        if( (codepoint & 0xFF) > 0 ) cnt++;
        if( (codepoint & 0xFF00) > 0 ) cnt++;
        if( (codepoint & 0xFF0000) > 0 ) cnt++;
        if( (codepoint & 0xFF000000) > 0 ) cnt++;
        offset += Character.charCount(codepoint);
    }

        System.out.println( cnt );
    }
}

Answer 1

UTF-8字符长度可以是一到四个字节。找到UTF-8字符大小的一种方法是将char（字符串）转换为字节数组并检查数组的长度，如果这是你要求的：

myString.getBytes(Charset.forName("UTF-8")).length;

Answer 2

这应该显示编码为UTF-8

的字符串中每个字符的长度

    for (int i = 0; i < s.length(); ) {
        int cp = s.codePointAt(i);
        int l = new String(Character.toChars(cp)).getBytes("UTF-8").length;
        System.out.println(l);
        i += Character.charCount(cp);
    }

要计算代码点中的非零字节数，我们可以使用这个公式

int l = (31 - Integer.numberOfLeadingZeros(x)) / 8 + 1;

Answer 3

Unicode是三个字节范围内的字符编号，称为代码点。

UTF-16（UTF-16LE和UTF-16BE）使用两个字节，但对于某些Unicode点需要转义组合（4个字节）。 char使用UTF-16BE。它仍然不能代表整个Unicode代码点。

UTF-8使用一个字节表示纯ASCII（0 .. 127,7位）。对于更高的代码点，它将Unicode代码点的位分成几个字节，其中较高的位是固定的。最高位始终为1，因此不会出现使用ASCII字符的错误。

int byteCount(int codePoint) {
    int[] codePoints = new int[] { codePoint };
    String s = new String(codePoints, 0, codePoints.length);
    int byteCount = s.getBytes(StandardCharsets.UTF_8).length;
    return byteCount;
}

这个java代码是不言自明的。 StandardCharsets类包含所有编码的字符串常量，这些编码是标准的=在每个java发行版中始终可用。因此，不需要处理UnsupportedEncodingException。

java混合字符代码点宽度检测

3 个答案: