如何解码“ utf-8”中的字符串?

时间:2018-12-30 01:00:22

标签: python python-3.x unicode encoding utf-8

我正在使用tweepy捕获一些葡萄牙语葡萄牙语的推文,并将这些推文保存在csv文件中。我们保存的所有tweet文本都带有特殊字符,现在我无法将其转换为正确的格式。

我对推文捕获的编码是:

csvFile = open('ua.csv', 'a')
csvWriter = csv.writer(csvFile)
for tweet in tweepy.Cursor(api.user_timeline,id=usuario,count=10,
                           lang="en",
                           since="2018-12-01").items():
csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])

我正在读取这样的结果:

test = pd.read_csv('ua.csv', header=None)
test.columns = ["date", "text"]
result = test['text'][0]
print(result)
'Aproveita essa promo\xc3\xa7\xc3\xa3o aqui!'

我需要的结果是:

print(result)
'Aproveita essa promoção aqui!'

我尝试了以下代码进行转换:

print(result.decode('utf-8'))

并收到以下错误消息:

AttributeError: 'str' object has no attribute 'decode'

我在哪里做错了?

3 个答案:

答案 0 :(得分:1)

问题是您在bytes上发布推文时正在创建.encode对象,您不需要这样做。

csv.writer对象将强制传递给您传递给它的任何字符串。

注意:

In [1]: import csv

In [2]: s = 'Aproveita essa promoção aqui!'

In [3]: print(s)
Aproveita essa promoção aqui!

In [4]: print(s.encode())
b'Aproveita essa promo\xc3\xa7\xc3\xa3o aqui!'

In [5]: with open('test.txt', 'a') as f:
   ...:     writer = csv.writer(f)
   ...:     writer.writerow([1, 3.4, 'Aproveita essa promoção aqui!'.encode()])
   ...:

In [6]: !cat test.txt
1,3.4,b'Aproveita essa promo\xc3\xa7\xc3\xa3o aqui!'

因此只需使用:

csvWriter.writerow([tweet.created_at, tweet.text])

答案 1 :(得分:0)

熊猫read_csv有一个encoding参数:

  

在读/写时用于UTF的编码(例如'utf-8')。

答案 2 :(得分:0)

使用要使用的编码打开文件。不要手动对其进行编码(Zen of Python:显式优于隐式):

# newline='' per csv documentation
# encoding='utf-8-sig' if you plan on using Excel to read the csv, else 'utf8' is fine.
with open('ua.csv','a',encoding='utf-8-sig',newline='') as csvFile:
    csvWriter = csv.writer(csvFile)
    for tweet in tweepy.Cursor(api.user_timeline,id=usuario,count=10,
                               lang="en",
                               since="2018-12-01").items():
    csvWriter.writerow([tweet.created_at, tweet.text)

这是一个可行的示例:

import csv
import pandas as pd

with open('ua.csv','w',encoding='utf-8-sig',newline='') as csvFile:
    csvWriter = csv.writer(csvFile)
    csvWriter.writerow(['timestamp','Aproveita essa promoção aqui!'])

test = pd.read_csv('ua.csv', encoding='utf-8-sig', header=None)
print(test)

输出:

           0                              1
0  timestamp  Aproveita essa promoção aqui!