我正在制作Twitter API,我会收到有关特定字词的推文(现在它是' flafel')。除了这条推文外,一切都很好
b'当我想着把辣椒酱放在我的烤肉上时 而女服务员,一个Pinay,告诉我不要把它变成cos"印地语 masarap" \ XF0 \ x9f \ X98 \ X82'
我使用print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))
来查看推文,但每次都给我UnicodeEncodeError,如果我从decode()
那行删除print ("Tweet info: {}".format(str(tweet.text).encode('utf-8'))
我可以看到上面的实际推文,但是我想将\xf0\x9f\x98\x82
部分转换为str。我尝试了每个版本的每个版本的解码编码等。我怎样才能解决这个问题?
编辑:我刚刚访问该用户的Twitter帐户,看看那个非ASCII部分是什么,结果证明它是一个微笑:
是否可以转换那个笑脸?
Edit2 :代码为;
...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
q = "flafel",
result_type = "recent",
include_entities = True,
lang = "en").items():
print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))
答案 0 :(得分:1)
当您尝试在Windows上使用unicode字符\U0001f602
时,可能会出现此问题。 Python-3很适合将它从utf-8转换为完整的unicode,但Windows无法显示它。
我在Windows 7的盒子上以不同的方式尝试了这段代码:
>>> b = b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'
>>> u = b.decode('utf8')
>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
>>> print(u)
这是发生了什么:
UnicodeEncodeError:'UCS-2'编解码器无法编码139-139位的字符:Tk不支持非BMP字符
UnicodeEncodeError:'charmap'编解码器无法对位置139中的字符'\ U0001f602'进行编码:字符映射到< undefined>
(对于细心的读者,BMP意味着基本多语言平面)
在使用utf-8代码页的控制台中(chcp 65001)我没有错误但是显示奇怪:
>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitr
ess, a Pinay, tells me not to get it cos "hindi yan masarap."😂'
>>> print(u)
And when I'm thinking about getting the chili sauce on my flafel and the waitres
s, a Pinay, tells me not to get it cos "hindi yan masarap."😂
>>>
我的结论是转换中的错误utf-8< - > Unicode格式。但它看起来Window Tk版本不支持这个字符,也不支持任何控制台代码页(除了65001只是试图显示各个utf8字节!)
TL / DR:问题不在于核心Python处理,也不在UTF-8转换器中,而只是在用于显示字符'\U0001f602'
但希望,由于核心Python没有任何问题,您可以轻松地使用'\U0001f602'
更改有问题的':D'
,例如仅使用string.replace
(在上面的代码之后) :
>>> print (u.replace(U'\U0001f602', ':D'))
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":D
如果您想对BMP之外的所有字符进行特殊处理,只需知道它的最高代码是0xFFFF
即可。所以你可以使用这样的代码:
def convert(t):
with io.StringIO() as fd:
for c in t: # replace all chars outside BMP with a !
dummy = fd.write(c if ord(c) < 0x10000 else '!')
return fd.getvalue()
答案 1 :(得分:1)
正如我在评论中提到的,您可以使用标准unicodedata模块获取Unicode代码点的名称。这是一个小型演示:
import unicodedata as ud
test = ('And when I\'m thinking about getting the chili sauce on my flafel and the '
'waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001F602')
def convert_special(c):
if c > '\uffff':
c = ':{}:'.format(ud.name(c).lower().replace(' ', '_'))
return c
def convert_string(s):
return ''.join([convert_special(c) for c in s])
for s in (test, 'Some special symbols \U0001F30C, ©, ®, ™, \U0001F40D, \u2323'):
print('{}\n{}\n'.format(s.encode('unicode-escape'), convert_string(s)))
<强>输出强>
b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:
b'Some special symbols \\U0001f30c, \\xa9, \\xae, \\u2122, \\U0001f40d, \\u2323'
Some special symbols :milky_way:, ©, ®, ™, :snake:, ⌣
另一种选择是测试某个字符是否属于Unicode "Symbol_Other"类别。我们可以通过替换
来做到这一点if c > '\uffff':
使用
在convert_special
中进行测试
if ud.category(c) == 'So':
当我们这样做时,我们得到这个输出:
b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:
b'Some special symbols \\U0001f30c, \\xa9, \\xae, \\u2122, \\U0001f40d, \\u2323'
Some special symbols :milky_way:, :copyright_sign:, :registered_sign:, :trade_mark_sign:, :snake:, :smile: