熊猫错误读取带有双引号的csv

时间:2021-07-22 19:02:30

标签: python pandas csv

我已阅读所有相关主题 - 例如 thisthisthis - 但无法找到可行的解决方案。

我有一个像这样的输入 csv 文件:

ItemId,Content                                                      
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

我尝试了几种不同的方法,但无法让它发挥作用。我想将此 csv 文件读入这样的数据帧:

ItemId    Content
--------  -------------------------------------------------------------------------------
i0000008  {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010  {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

使用以下代码(Python 3.9)

df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')

据我所知,字典列内的逗号和引号内的逗号被视为常规分隔符,因此会引发以下错误:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

是否有可能产生预期的结果?谢谢。

2 个答案:

答案 0 :(得分:2)

问题是 Content 列中的逗号被解释为分隔符。您可以通过使用 pd.read_fwf 手动设置要拆分的字符数来解决此问题:

df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])  

结果:

<头>
ItemId 内容
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

答案 1 :(得分:0)

我认为你不能用 Pandas 正常读取它,因为它的分隔符多次用于单个值;但是,使用 python 读取它并进行一些处理,您应该能够将其转换为 Pandas 数据帧:

def splitValues(x):
    index = x.find(',')
    return x[:index], x[index+1:].strip()

import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))

输出:

     ItemId                                                                          Content
0  i0000008   {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1  i0000010  {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}