在Python 3

时间:2016-03-29 12:06:04

标签: python python-3.x

我有一个错误:UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk

我正在解析数据,而一些表情符号则属于数组。我需要data = 'this variable contains some emoji'sツ'data = 'this variable contains some emoji's'

如何从数据中删除这些字符或在Python 3中处理这种情况?

2 个答案:

答案 0 :(得分:3)

如果目标只是删除'\uFFFF'以上的所有字符,那么直接的方法就是:

data = "this variable contains some emoji'sツ"
data = ''.join(c for c in data if c <= '\uFFFF')

你的字符串可能是分解形式的,所以你可能需要to normalize it to composed form,所以非BMP字符是可识别的:

import unicodedata

data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')

答案 1 :(得分:-1)

>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, data)
"this variable contains some emoji's"

对于BMP,请阅读:removing emojis from a string in Python