def openFile(fileName):
try:
trainFile = io.open(fileName,"r",encoding = "utf-8")
except IOError as e:
print ("File could not be opened: {}".format(e))
else:
trainData = csv.DictReader(trainFile)
print trainData
return trainData
def computeTFIDF(trainData):
bodyList = []
print "Inside computeTFIDF"
for row in trainData:
for key, value in row.iteritems():
print key, unicode(value, "utf-8", "ignore")
print "Done"
return
if __name__ == "__main__":
print "Main"
trainData = openFile("../Data/TrainSample.csv")
print "File Opened"
computeTFIDF(trainData)
错误:
Traceback (most recent call last):
File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 62, in <module>
computeTFIDF(trainData)
File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 42, in computeTFIDF
for row in trainData:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 215: ordinal not in range(128)
TrainSample.csv
:是一个包含4列(带标题)的csv文件
操作系统:Windows 7 64位
使用Python 2.x
我不知道这里出了什么问题。我说要忽略编码。但仍然会抛出同样的错误。
我认为在控件到达编码之前,它会抛出一个错误。
有谁能告诉我哪里出错了。
答案 0 :(得分:4)
Python 2 CSV模块不处理Unicode输入。
以二进制模式打开文件,并在将其解析为CSV后解码。这对于UTF-8编解码器是安全的,因为换行符,分隔符和引号都编码为1个字节。
csv
模块文档在example section中包含一个UnicodeReader
包装器类,它将为您进行解码;它很容易适应DictReader
类:
import csv
class UnicodeDictReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
self.encoding = encoding
self.reader = csv.DictReader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return {k: unicode(v, "utf-8") for k, v in row.iteritems()}
def __iter__(self):
return self
将此文件用于以二进制模式打开的文件:
def openFile(fileName):
try:
trainFile = open(fileName, "rb")
except IOError as e:
print "File could not be opened: {}".format(e)
else:
return UnicodeDictReader(trainFile)
答案 1 :(得分:0)
我无法对Martijn进行评论,该解决方案在稍作升级后就非常适合我了,我留给其他人使用:
@attr.s
class A(object):
a_dict = attr.ib(factory=Dict, type=Dict[str, A], validator=optional(instance_of(Dict)))
一件事是python 2.6及更低版本不支持dict修饰。
另一个,该字典可以使用不同的类型,而不能使用unicode函数,因此在null或number的情况下值得捕获TypeError。
使我疯狂的另一件事是,当您打开带有编码的文件时,它不起作用!只需简单type=Dict
即可。