如何在python 2.7.10中为Clés在Cl \ u00e9s和Cle \ u0301s之间转换
答案 0 :(得分:2)
unicodedata.normalize
函数将Unicode字符串转换为完全组成或完全分解的形式。
>>> import unicodedata as ud
>>> d = u'Cle\u0301s'
>>> c = u'Cl\u00e9s'
>>> ud.normalize('NFC',c) # no change, already composed form
u'Cl\xe9s' # Note: escape codes display with a smaller form if possible.
>>> ud.normalize('NFC',d) # changes to composed form
u'Cl\xe9s'
>>> ud.normalize('NFD',c) # changes to decomposed form
u'Cle\u0301s'
>>> ud.normalize('NFD',d) # no change, already decomposed form
u'Cle\u0301s'
如果您以该格式的字节字符串开头,则以下内容将首先转换为Unicode字符串:
>>> db = 'Cle\u0301s'
>>> cb = 'Cl\u00e9s'
>>> d = db.decode('unicode_escape')
>>> c = cb.decode('unicode_escape')
>>> d
u'Cle\u0301s'
>>> c
u'Cl\xe9s'
答案 1 :(得分:0)
感谢一百万@MarkRansom与我一起进行调试,得到了我现在想要的东西!
print uni
>> Clés
print v1.lower()
>> cl\u00e9s
print v2.lower()
>> cle\u0301s
print len(unicodedata.normalize('NFD', v1.lower().decode('UTF-8')))
>> 9
print len(unicodedata.normalize('NFC', v2.lower().decode('UTF-8')))
>> 10
print len(v1.lower().decode("unicode_escape"))
>> 4
print len(v2.lower().decode("unicode_escape"))
>> 5
print len(unicodedata.normalize('NFD', v1.lower().decode("unicode_escape")))
>> 5
print len(unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
>> 4
print len(v1.lower().decode("unicode_escape"))
>> 4
print (v1.lower().decode("unicode_escape") == unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
>> True
显然,对于大多数人来说,lower()和upper()并不是一个好主意,但是对我来说,这是可行的,因为我期望从两个不同的过程中或多或少地得到相同的词。