读取csv文件时出错:utf-8编解码器无法解码

时间:2016-05-02 07:55:00

标签: python csv merge

在运行代码进行合并(基本上是内连接)两个csv文件时,我在读取csv文件时遇到错误。我的代码:

import csv
import pandas as pd
s1= pd.read_csv(".../noun.csv")
s2= pd.read_csv(".../verb.csv")
merged= s1.merge(s2, on=("userID" ,"sentID"), how ="inner")
merged.to_excel(".../merge1.xlsx",index = False)

错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte

我的内容示例是:

verb file

userID  sentID  verb
['3477'  1     ['am', 'were', 'having', 'attended', 'stopped']
['3477'  2     ['felt', 'thrusting']

noun file
userID  sentID  Sentences
['3477'   1    Thursday,
['3477'   1    November

1 个答案:

答案 0 :(得分:0)

您可以使用试图检测编码的库,例如cchardet

pip install cchardet

如果您使用python 2.X,还需要backport CSV库。它们原生支持Unicode,而Python 2的csv则不支持:

pip install backports.csv

然后在您的代码中,您可以执行以下操作:

import cchardet
import io
from backports import csv

# detect encoding
with io.open(filename, mode="rb") as f:
    data = f.read()
detect = cchardet.detect(data)
encoding_ = detect['encoding']
# retrieve data
with io.open(filename, encoding=encoding_) as csvfile:
    reader = csv.reader(csvfile, ...)
...

我不知道大熊猫,但你可以这样做:

# retrieve data
s1= pd.read_csv(".../noun.csv", encoding=encoding_)