我有一个csv文件,我想用Python中的pandas库读取它。
这是文件的标题和第一行。
content,topic,class,NRC-Affect-Intensity-anger_Score,NRC-Affect-Intensity-fear_Score,NRC-Affect-Intensity-sadness_Score,NRC-Affect-Intensity-joy_Score
'@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.',kindle2,positive,0,0,0,0
用逗号分隔,它有7个字段。当我尝试读取此文件时,出现错误:
zz= pd.read_csv('proc_data.csv', sep=',')
Error tokenizing data. C error: Expected 8 fields in line 14, saw 9
我想第一列中的逗号是在抱怨。 ('
个字符之间的部分)
是否可以正确读取此文件?
head -15 less proc_data.csv
head: less: No such file or directory
==> proc_data.csv <==
content,topic,class,NRC-Affect-Intensity-anger_Score,NRC-Affect-Intensity-fear_Score,NRC-Affect-Intensity-sadness_Score,NRC-Affect-Intensity-joy_Score
'@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.',kindle2,positive,0,0,0,0
'Reading my kindle2... Love it... Lee childs is good read.',kindle2,positive,0,0,0,1.375
'Ok, first assesment of the #kindle2 ...it fucking rocks!!!',kindle2,positive,0,0,0,0
'@kenburbary You\'ll love your Kindle2. I\'ve had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)',kindle2,positive,0,0,0.594,1.125
'@mikefish Fair enough. But i have the Kindle2 and I think it\'s perfect :)',kindle2,positive,0,0,0,0.719
'@richardebaker no. it is too big. I\'m quite happy with the Kindle2.',kindle2,positive,0,0,0,0.788
'Fuck this economy. I hate aig and their non loan given asses.',aig,negative,0.828,0.484,0.656,0
'Jquery is my new best friend.',jquery,positive,0,0,0,0.471
'Loves twitter',twitter,positive,0,0,0,0
'how can you not love Obama? he makes jokes about himself.',obama,positive,0,0,0,0.828
'Check this video out -- President Obama at the White House Correspondents\' Dinner ',obama,neutral,0,0,0,0.109
'@Karoli I firmly believe that Obama/Pelosi have ZERO desire to be civil. It\'s a charade and a slogan, but they want to destroy conservatism',obama,negative,0,0,0,0.484
'House Correspondents dinner was last night whoopi, barbara & sherri went, Obama got a standing ovation',obama,positive,0,0,0.078,0
'Watchin Espn..Jus seen this new Nike Commerical with a Puppet Lebron..sh*t was hilarious...LMAO!!!',nike,positive,0,0,0,0.672
答案 0 :(得分:1)
您正尝试用逗号分隔各列,但在字符串中仍可以出现逗号。
通常可以通过quoting
方法的read_csv
参数来解决,该参数默认为quoting='"'
。但是,在csv文件中,您只有单引号,因此您需要更改为quoting="'"
。
但是,这会遇到一个问题,即在字符串内部存在撇号,并在其前面转义了反斜杠。默认情况下,pd.read_csv
的{{1}}参数设置为escapechar
,因此您也必须设置此参数。
总而言之,我们最终得到:
None
请注意,pd.read_csv('proc_data.csv', sep=',',quotechar="'", escapechar='\\')
本身需要在此处转义。
如果您不太在意单个行,而只是想尽可能多地读取它们,则可以在它们中添加关键字escapechar
。然后从警告中找出这些行是否可以固定或需要放弃。