Question

根据此代码：

# coding=utf-8
import sys
import chardet

print(sys.getdefaultencoding())

a = 'abc'

print(type(a))
print(chardet.detect(a))

b = a.decode('ascii')

print(type(b))


c = '中文'

print(type(c))
print(chardet.detect(c))


m = b.encode('utf-8')
print(type(m))
print(chardet.detect(m))

n = u'abc'

print(type(n))

x = n.encode(encoding='utf-8')

print(type(x))
print(chardet.detect(x))

我使用 utf-8 对n进行编码，但结果仍显示结果为ascii。

所以我想知道， utf-8 ， ascii 和 unicode

我用python2运行。

===================结果=========================== ======

=======================最终结果====================== =======

Answer 1

UTF-8实际上是一种可变宽度编码，而ASCII字符将直接映射到UTF-8中。

由于您的UTF-8字符串仅包含 ASCII字符，因此字符串是ASCII和UTF-8字符串。

这种视觉效果可能会有所帮助：

>>> c = '中文abc中文'
>>>
>>>
>>> c
'中文abc中文'
>>> c.encode(encoding="UTF-8")
b'\xe4\xb8\xad\xe6\x96\x87abc\xe4\xb8\xad\xe6\x96\x87'

注意UTF-8字符串中的“abc”是如何只是单字节的？它们仍然与ascii对应的字节相同！

Answer 2

UTF-8 encoding是字符0-127（unicode代码点U + 0000到U + 007F）是相应的ascii字符并以相同的方式编码。 charset.detect因此自然地反驳了一个只包含ascii编码的字符的字符串，因为它实际上是......

python 3 is there only for retrocompatibility, and is the same as normal string notation中的u'...'表示法。因此u'abc'与'abc'相同。

Answer 3

这是因为Unicode和UTF-8的设计师都非常出色并且设法实现了令人印象深刻的向后兼容性。

它始于Latin-1字符集，它定义了256个字符，其中前128个字符直接取自ASCII。每个字符都适合一个字节。

Unicode构建了一个扩展字符集，它首先声明前256个代码点是Latin-1中的字符。这意味着前128个代码点保留了它们在ASCII中的相同数值。

然后是UTF-8，它使用了可变比特长度编码。通过设置每个字节的高位来表示占用多个字节的字符。这意味着高位清零的字节都是单字节字符。由于ASCII也有高位清除，这意味着这些字符的编码在ASCII和UTF-8之间相同！

为什么在utf-8中编码仍会产生ascii？

3 个答案: