在运行代码进行合并(基本上是内连接)两个csv文件时,我在读取csv文件时遇到错误。我的代码:
import csv
import pandas as pd
s1= pd.read_csv(".../noun.csv")
s2= pd.read_csv(".../verb.csv")
merged= s1.merge(s2, on=("userID" ,"sentID"), how ="inner")
merged.to_excel(".../merge1.xlsx",index = False)
错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte
我的内容示例是:
verb file
userID sentID verb
['3477' 1 ['am', 'were', 'having', 'attended', 'stopped']
['3477' 2 ['felt', 'thrusting']
noun file
userID sentID Sentences
['3477' 1 Thursday,
['3477' 1 November
答案 0 :(得分:0)
您可以使用试图检测编码的库,例如cchardet:
pip install cchardet
如果您使用python 2.X,还需要backport CSV库。它们原生支持Unicode,而Python 2的csv则不支持:
pip install backports.csv
然后在您的代码中,您可以执行以下操作:
import cchardet
import io
from backports import csv
# detect encoding
with io.open(filename, mode="rb") as f:
data = f.read()
detect = cchardet.detect(data)
encoding_ = detect['encoding']
# retrieve data
with io.open(filename, encoding=encoding_) as csvfile:
reader = csv.reader(csvfile, ...)
...
我不知道大熊猫,但你可以这样做:
# retrieve data
s1= pd.read_csv(".../noun.csv", encoding=encoding_)