utf-8 list和utf-8字符串的python交集

时间:2015-03-29 14:41:13

标签: python utf-8 ascii intersection

当我使用包含ASCII字母和ASCII字符串的列表时,我使此代码正常工作,但我无法使其正常工作。

# -*- coding: utf-8 -*-
asa = ["ā","ē","ī","ō","ū","ǖ","Ā","Ē","Ī","Ō","Ū","Ǖ",
"á","é","í","ó","ú","ǘ","Á","É","Í","Ó","Ú","Ǘ",
"ǎ","ě","ǐ","ǒ","ǔ","ǚ","Ǎ","Ě","Ǐ","Ǒ","Ǔ","Ǚ",
"à","è","ì","ò","ù","ǜ","À","È","Ì","Ò","Ù","Ǜ"]
[x.decode('utf-8') for x in asa]
print list(set(asa) & set("ō"))

2 个答案:

答案 0 :(得分:2)

你需要将你的角色放在一个列表中,因为字符串是可迭代的对象,你的unicode字符包含2个字节的字符串,因此python假定“ō”为\xc5\x8d。 :

>>> list("ō")
['\xc5', '\x8d']
>>> print list(set(asa) & set(["ō"]))
['\xc5\x8d']
>>> print list(set(asa) & set(["ō"]))[0]
ō

答案 1 :(得分:1)

您的第一个集合包含"ō".decode('utf-8')形式的元素(类型unicode),相当于u"ō"

第二组包含"ō"(类型str)等字节字符串,因此它们不会比较相等而且没有交叉点。

Medidate:

>>> 'a' == u'a'
True
>>> 'ō' == u'ō'
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
>>> list('ō')
['\xc5', '\x8d']
>>> list(u'ō')
[u'\u014d']