我已阅读所有相关主题 - 例如 this、this 和 this - 但无法找到可行的解决方案。
我有一个像这样的输入 csv 文件:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
我尝试了几种不同的方法,但无法让它发挥作用。我想将此 csv 文件读入这样的数据帧:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
使用以下代码(Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
据我所知,字典列内的逗号和引号内的逗号被视为常规分隔符,因此会引发以下错误:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
是否有可能产生预期的结果?谢谢。
答案 0 :(得分:2)
问题是 Content
列中的逗号被解释为分隔符。您可以通过使用 pd.read_fwf
手动设置要拆分的字符数来解决此问题:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
结果:
ItemId | 内容 | |
---|---|---|
0 | i0000008 | {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} |
1 | i0000010 | {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"} |
答案 1 :(得分:0)
我认为你不能用 Pandas 正常读取它,因为它的分隔符多次用于单个值;但是,使用 python 读取它并进行一些处理,您应该能够将其转换为 Pandas 数据帧:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
输出:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}