即使对于字母,编码也不同于ASCII

时间:2016-04-18 08:21:15

标签: java encoding character-encoding ascii codepages

是否有任何字符编码在消费者设备(与大型机相对)上相当常见,并且将字母A-Za-z0-9映射为与ASCII不同?

目前我正在考虑使用Java应用程序,所以我想知道某个国家/地区的某些Java软件的临时用户是否有可能以defaultCharset的方式报告"AZaz09".getBytes() 3}}返回与"AZaz09".getBytes("UTF-8")不同的内容。我正在努力弄清楚是否必须解决某些兼容性问题,这些问题可能是由于这方面的不同行为造成的。

我知道,历史上,EBCDIC将是ASCII不兼容编码的主要示例。但它是否被用于任何最近的消费者设备,或仅用于IBM大型机和老式计算机? EBCDIC的遗产是否存在于某些国家的常见编码中?

我也知道UTF-16与ASCII不兼容,并且在Windows上以这种方式编码文件是很常见的。但据我所知,这始终只是文件内容,而不是默认的应用程序区域设置。用户是否可以将其Windows机器配置为使用UTF-16作为系统代码页而不破坏至少一半的应用程序?

据我所知,在亚洲使用的所有Unicode前多字节编码仍然将ASCII范围00-7F映射到与ASCII兼容的字符和数字。是否仍有任何亚洲编码使用多于一个字节的所有其代码点?或者也许在其他大陆?

1 个答案:

答案 0 :(得分:3)

这是一个简单的程序,可以找出答案。由你决定失败的字符集是否足够常见。

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class EncodingTest {
    public static void main(String[] args) {
        String s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
        byte[] b = s.getBytes(StandardCharsets.UTF_8);
        for (Charset cs : Charset.availableCharsets().values()) {
            try {
                byte[] b2 = s.getBytes(cs);
                if (!Arrays.equals(b, b2)) {
                    System.out.println(cs.displayName() + " doesn't give the same result");
                }
            }
            catch (Exception e) {
                System.out.println(cs.displayName() + " throws an exception");
            }
        }
    }
}

我机器上的结果是

IBM-Thai doesn't give the same result
IBM01140 doesn't give the same result
IBM01141 doesn't give the same result
IBM01142 doesn't give the same result
IBM01143 doesn't give the same result
IBM01144 doesn't give the same result
IBM01145 doesn't give the same result
IBM01146 doesn't give the same result
IBM01147 doesn't give the same result
IBM01148 doesn't give the same result
IBM01149 doesn't give the same result
IBM037 doesn't give the same result
IBM1026 doesn't give the same result
IBM1047 doesn't give the same result
IBM273 doesn't give the same result
IBM277 doesn't give the same result
IBM278 doesn't give the same result
IBM280 doesn't give the same result
IBM284 doesn't give the same result
IBM285 doesn't give the same result
IBM290 doesn't give the same result
IBM297 doesn't give the same result
IBM420 doesn't give the same result
IBM424 doesn't give the same result
IBM500 doesn't give the same result
IBM870 doesn't give the same result
IBM871 doesn't give the same result
IBM918 doesn't give the same result
ISO-2022-CN throws an exception
JIS_X0212-1990 doesn't give the same result
UTF-16 doesn't give the same result
UTF-16BE doesn't give the same result
UTF-16LE doesn't give the same result
UTF-32 doesn't give the same result
UTF-32BE doesn't give the same result
UTF-32LE doesn't give the same result
x-IBM1025 doesn't give the same result
x-IBM1097 doesn't give the same result
x-IBM1112 doesn't give the same result
x-IBM1122 doesn't give the same result
x-IBM1123 doesn't give the same result
x-IBM1364 doesn't give the same result
x-IBM300 doesn't give the same result
x-IBM833 doesn't give the same result
x-IBM834 doesn't give the same result
x-IBM875 doesn't give the same result
x-IBM930 doesn't give the same result
x-IBM933 doesn't give the same result
x-IBM935 doesn't give the same result
x-IBM937 doesn't give the same result
x-IBM939 doesn't give the same result
x-JIS0208 doesn't give the same result
x-JISAutoDetect throws an exception
x-MacDingbat doesn't give the same result
x-MacSymbol doesn't give the same result
x-UTF-16LE-BOM doesn't give the same result
X-UTF-32BE-BOM doesn't give the same result
X-UTF-32LE-BOM doesn't give the same result