Question

我正在解析推特并且需要对文本进行编码，因为如果没有编码，则会有例外。但是当我使用'utf-8'时，它不仅会在控制台输出中添加b符号，而且还无法访问字符串的部分内容。我该怎么做才能解决它或我应该尝试其他什么编码？

以下是发生的事情的一个例子。

>>> a="newyear"
>>> b=a.encode("utf-8")
>>> a
'newyear'
>>> b
b'newyear'
>>> a[0]
'n'
>>> b[0]
110

我的解析器代码如下：

tweets=soup.findAll("p", {"class":"TweetTextSize"})  

n=0
for tweet in tweets:  


    n+=1;
    print(n)
    a=tweet.text 
    b=a.encode("utf-8")   
    print(b)   #works fine, but returns bytestring, extra b character,
    #and I can't get b[0]
    print(b.decode("utf-8")) #doesn't work - 
    #UnicodeEncodeError: ‘charmap’ code can’t encode character '\u2026'

    #uncommented try section works, but it replaces "bad" tweets with ops, 
    #which I'd rather avoid
    # try:
        # print(tweet.text)
    # except:
        # print("OPS")

所以我可以尝试处理异常，但我想知道是否还有其他方法。

我正在使用Python 3。

Answer 1

您对何时encode以及何时decode

感到困惑

如果您有一个字节串，那么您decode将其转换为unicode

a="a string" 
b = a.decode('utf8') 
#b is now UNICODE

如果您将encode unicode用于编码的字节串

a=u"\u00b0C"
b = a.encode('utf8')
#b is now decoded back to a byte string

我怀疑你从twitter获得了一个字节串，所以你可能需要

b = a.decode('utf8')

utf-8编码并获取字符串切片

1 个答案: