我已尝试过各种方法来阅读此文件中的这个(样本)推文。 unicode字符Victory Hand似乎不想解析。这是数据样本。
399491624029274112,Kyle aka K-LO,I unlocked 2 Xbox Live achievements in WWE 2K14! http://t.co/wRIxZTjYWg,False,0,Raptr,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,0
399491626584014848,Dots Group LLC,GeekWire Radio: Amazon vs. author Xbox One first take and favorite iPad apps - GeekWire http://t.co/jbbryoHpHe,False,0,IFTTT,,,,,2013,11,10,11,0,0,0,0,1,0,0,0,0,2
399491630149169152,BETTINGGENIUS!,RT @xJohn69: Sergio Ramos giveaway!; XBOX + PS3; ; -RT; -Follow me and @NeillWagers; -S/Os appreciated; ; Goodluck http://t.co/D997faGSB5,False,0,Twitter for iPad,,,,,2013,11,10,11,0,1,1,0,1,0,0,0,0,2
399491635735953408,Princess of TV,Toy Story of Terror is amaze balls. Thanks Xbox for the free NowTV #disneyweekend,False,0,Twitter for iPhone,,,,,2013,11,10,11,0,2,0,0,1,0,0,0,0,2
399491654136369152,Sam Hambre,'9 Things You Should Know Before Buying a PlayStation 4' http://t.co/Q3Ma1R83cF,False,0,Buffer,,,,,2013,11,10,11,0,7,0,1,0,0,0,0,0,0
399491655780167680,Rhi ✌,@Escape2theMoon that's done what? im not on rn obvs i dont even have access to an xbox :c ?,False,0,web,399490703761223680,Escape2theMoon,1404625770,,2013,11,10,11,0,7,0,0,1,0,0,0,0,0
你可以在最后一条推文的第二个字段中看到胜利手。
我想要做的是从所有推文中构建一个长字符串。很简单地说,我甚至无法处理这个脚本:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
import csv
current_file = codecs.open("C:/myfile.csv", encoding="utf-8")
data = csv.reader(current_file, delimiter=",")
tweets = ""
for record in data:
tweets = tweets + " " + record[2].encode('utf-8', errors='replace')
我尝试了许多导入,编码,连接,转换为unicode等的排列......但我无法通过胜利之手。我经常收到的错误是:
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-114-fd9b136abd74> in <module>()
----> 1 for record in data:
2 tweets = tweets + ' ' + record[2].encode('utf-8', 'replace')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u270c' in position 23: ordinal not in range(128)
我做错了什么?如何在没有unicode问题的情况下将所有这些推文连接成一个字符串?
答案 0 :(得分:2)
问题在于csv.reader尝试将unicode转换回ascii
。请注意csv docs:
此版本的
csv
模块不支持Unicode
输入。此外,目前有一些关于ASCII NUL
字符的问题。因此,所有输入应为UTF-8
或可打印ASCII
以确保安全;请参阅示例部分中的示例。
建议您使用此配方from the docs examples:
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
使用unicode_csv_reader
辅助工具,您的代码可能如下所示(略微修改为使用闭包和连接而不是循环):
from operator import itemgetter
tweets_fname = "C:/myfile.csv"
with codecs.open(tweets_fname , encoding="utf-8") as current_file:
data = unicode_csv_reader(current_file, delimiter=",")
tweets = u' '.join(map(itemgetter(2), data))
encoded_tweets = tweets.encode('utf8', 'replace')