Question

中提到的ASCII字符的数字值

String str = "™æ‡©Æ";
for(int i = 0; i < str.length() ; i++) {
    char c = str.charAt(i);
    int code = (int) c;
    System.out.println(c + ":" +code);
}

输出：

™:8482
æ:230
‡:8225
©:169
Æ:198

我的问题是：为什么'™'和'‡'的值分别不是'153'和'135'？如果可能的话，如何获得这些值？

Answer 1

ASCII值大于128的字符不是ASCII字符，而是说它们是Unicode字符会更好。扩展ASCII也不是ASCII。您最好参考Unicode表。

另外要提到Java内部使用Unicode。它内部不使用ASCII。 实际上，它大部分时间都使用UTF-16

您可以参考this和List of Unicode characters。

Answer 2

ASCII仅将值分配给128个字符 （az，AZ，0-9，空格，一些标点符号和一些控制字符）。 的前128个Unicode代码点与ASCII相同。

Unicode是一种计算行业标准，旨在对全世界书面语言中使用的字符进行一致且唯一的编码。 Unicode标准使用十六进制表示字符。

Unicode有两种常见格式，UTF-8 ，每个值使用1-4个字节（因此对于前128个字符，UTF-8与ASCII完全相同）和UTF -16，使用2或4个字节。

Answer 3

虽然我没有考虑转换器的Javadocs，但我确实创建了一个示例来说明为什么ASCII和Java Unicode不易兼容。我在这里将Unicode字符转换为字节数组，然后转换为表示字节数组的字符串。我建议不要使用Java类，而是创建一个ASCII等效数组，并引用该数组进行输出。

  public void showChars()  
    {  
        char c = ' ';  
        int end = 8192;
        for(int i=0;i<end;++i)
        {
            try {
                c = (char) i;
                byte[] data = Character.toString((char) i).getBytes("UTF8");
                String byteStr = Arrays.toString(data);
                System.out.println("" + i + " char is " + c + " or " + byteStr);
            } catch (UnsupportedEncodingException ex) {
                Logger.getLogger(Dinker.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
    }

Answer 4

为了回答被问到的第二个问题：

final String str = "™æ‡©Æ";

final byte[] cp1252Bytes = str.getBytes("windows-1252");
for (final byte b: cp1252Bytes) {
    final int code = b & 0xFF;
    System.out.println(code);
}

将代码与每个文本元素相关联是更多的工作。

final String str = "™æ‡©Æ";

final int length = str.length();
for (int offset = 0; offset < length; ) {
    final int codepoint = str.codePointAt(offset);
    final int codepointLength = Character.charCount(codepoint);
    final String codepointString = str.substring(offset, offset + codepointLength);
    System.out.println(codepointString);
    final byte[] cp1252Bytes = codepointString.getBytes("windows-1252");
    for(final byte code : cp1252Bytes) {
        System.out.println(code  & 0xFF);
    }
    offset += codepointLength;
}

这有点简单Java 8的String.codePoints（）方法：

final String str = "™æ‡©Æ";

str.codePoints()
    .mapToObj(i ->  new String(Character.toChars(i)))
    .forEach(c -> { 
        try {
            System.out.println(
                String.format("%s %s", 
                    c, 
                    unsignedBytesToString(c.getBytes("Windows-1252"))));
        } catch (Exception e) {
            e.printStackTrace();
        }
    });

无法在java中获取ASCII代码

4 个答案: