Comparing special characters in Python

时间:2018-12-19 11:13:59

标签: python python-3.x python-2.7 character-encoding

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.

  • b'Ope\xcc\x81rations'
  • b'Op\xc3\xa9rations'

My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.

1 个答案:

答案 0 :(得分:0)

很高兴知道:

您正在谈论两种类型的字符串,字节字符串和unicode字符串。每个都有一个将其转换为其他类型的字符串的方法。 Unicode字符串具有产生字节的.encode()方法,而字节字符串具有产生unicode的.decode()方法。这意味着:

  

unicode.enocde()---->字节

  

bytes.decode()-----> Unicode

UTF-8 无疑是最流行的Unicode存储和传输编码。它为每个代码点使用可变数量的字节。代码点值越高,在UTF-8中需要的字节越多。

指向重点:

如果将字符串重新定义为两个字节字符串和unicode字符串,如下所示:

a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'

b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'

您将看到:

print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))

print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))

输出:

a_byte lenght is:  11
b_byte lenght is:  10

所以你看到它们不一样。

我的解决方案:

如果您不想感到困惑,则可以使用repr(),并且在打印a_byte时,b_byte将打印Opérations作为输出,但是:

print repr(a_byte),repr(b_byte)

将返回:

'Ope\xcc\x81rations','Op\xc3\xa9rations'

您还可以在比较之前将Unicode规范化为@Daniel's answer,如下所示:

from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))