怎么知道性格的字符集?

时间:2016-11-04 12:10:40

标签: character-encoding

尝试对utf-8中假定的iso-8859-1字符串进行编码时,python脚本失败:

>>> 'à'.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0300' in position 1: ordinal not in range(256)

如何知道这个字符是什么字符?在utf-8中编码时:

>>> 'à'.encode('utf-8')
b'a\xcc\x80'

a然后\xcc\x80。我可以在http://www.utf8-chartable.de/unicode-utf8-table.pl?start=768&names=-&utf8=string-literal utf8表中输入\xcc\x80

但它是utf-8?如果是utf-8,为什么'à'.encode('utf-8')无法在iso-8859-1中对此字符串进行编码?

1 个答案:

答案 0 :(得分:0)

'à'字符的来源有点不清楚。事实上,它是combining character sequence,你需要normalize它。下一个python脚本使用unicodedata module自我解释并解决您的问题:

import sys, platform
print (sys.stdout.encoding, platform.python_version())
print ()

import unicodedata
agraveChar='à'       # copied from your post
agraveDeco='à'       # typed as Alt+0224 (Windows, us keyboard)

# print Unicode names
print ('agraveChar', agraveChar, agraveChar.encode('utf-8'))
for ins in range( 0, len(agraveChar)):
    print ( agraveChar[ins], unicodedata.name(agraveChar[ins], '???'))

print ('agraveDeco', agraveDeco, agraveDeco.encode('utf-8'))
for ins in range( 0, len(agraveDeco)):
    print ( agraveDeco[ins], unicodedata.name(agraveDeco[ins], '???'))


print ('decomposition(agraveChar)', unicodedata.decomposition(agraveChar))
print ('\nagraveDeco normalized:\n')
print ("NFC  to utf-8", unicodedata.normalize("NFC" , agraveDeco).encode('utf-8'))
print ("NFC  to latin", unicodedata.normalize("NFC" , agraveDeco).encode('iso-8859-1'))
print ("NFKC to utf-8", unicodedata.normalize("NFKC", agraveDeco).encode('utf-8'))
print ("NFKC to latin", unicodedata.normalize("NFKC", agraveDeco).encode('iso-8859-1'))

<强>输出

==> D:\test\Python\40422359.py
UTF-8 3.5.1

agraveChar à b'\xc3\xa0'

à LATIN SMALL LETTER A WITH GRAVE

agraveDeco à b'a\xcc\x80'

a LATIN SMALL LETTER A
̀ COMBINING GRAVE ACCENT

decomposition(agraveChar) 0061 0300

agraveDeco normalized:

NFC  to utf-8 b'\xc3\xa0'
NFC  to latin b'\xe0'
NFKC to utf-8 b'\xc3\xa0'
NFKC to latin b'\xe0'

==>