打开CSV文件时出现“无效的起始字节”Unicode错误

时间:2014-12-09 02:57:40

标签: python csv unicode

请请帮忙。我已经有一段时间难以与之斗争,并在遇到问题后遇到问题。我只想尝试创建一个打开文件夹中每个csv文件的循环。这是循环:

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'

for file in os.listdir (folder):
    with codecs.open(file, mode='rU', encoding='utf-8') as f:
        m=min(int(line[1]) for line in csv.reader(f))
        f.seek(0)
        for line in csv.reader(f):
            if int(line[1])==m:
                print line

这是错误:

Traceback (most recent call last):
  File "findfirsttrigram.py", line 11, in <module>
    m=min(int(line[1]) for line in csv.reader(f))
  File "findfirsttrigram.py", line 11, in <genexpr>
    m=min(int(line[1]) for line in csv.reader(f))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 684, in next
    return self.reader.next()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 615, in next
    line = self.readline()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x87 in position 0: invalid start byte

我来到这里是因为我有一个&#34; Null Byte&#34;错误,我用这篇文章解决了这个问题:"Line contains NULL byte" in CSV reader (Python)

然后我收到一个整数错误,我在帖子"an integer is required" when open()'ing a file as utf-8?

中解决了这个错误

然后我收到一条错误,上面写着:&#39; UnicodeException:UTF-16流不以BOM&#39;我用这篇文章utf-16 file seeking in python. how?

解决了这个问题

然后我意识到csv模块需要utf-8所以我在这里。

但我终于达到了现有问题的极限。我无法弄清楚发生了什么。请帮忙。

3 个答案:

答案 0 :(得分:1)

我不确定为什么但最终有效:

import csv
import os
import unicodecsv

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams3/'

for file in os.listdir (folder):
    with open(os.path.join(folder, file), mode='rU') as f:
        try:
            m=min(int(line[1]) for line in unicodecsv.reader(f, encoding='utf-8', errors='replace'))
        except:
            print "one no work"
            continue
        f.seek(0)
        for line in unicodecsv.reader(f):
            if int(line[1])==m:
                print line

答案 1 :(得分:0)

也许尝试使用os.walk以及使用文件中的文件?

folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
    for subdir, dirs, files in os.walk(folder):
        for file in files:
             with codecs.open(file, mode='rU', encoding='utf-16-be') as f:
                   #Your code here

答案 2 :(得分:0)

显然,您的文件未以UTF-8编码。尝试其他编码。如果您使用的是Windows,'mbcs'将使用您的Windows版本的默认编码。