我继承了几百个CSV,我想将其导入到pandas数据帧中。它们的格式如下:
username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281
为了将其拉入熊猫数据框,我尝试了:
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)
并收到此错误:
ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11
我认为这是因为在字段中有一个未转义的引用
ICYMI:“你所拥有的是87个为下一代工作的共同目标的人;这就是为什么我们......
所以,我试过
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)
并得到一个新错误(我假设因为有;在该字段中):
将在2/26 FSGG听证会上提出#IRS的棘手问题; 支持法案使代理商更负责任。 HTTP:// tinyurl.com/n8ozeg5
ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11
我无法重新生成这些CSV文件。我想知道的是,我如何预处理/修复它们以使它们格式正确(即字段内的转义引号)?或者,有没有办法将它们直接读入数据框,即使使用未转义的引号?
答案 0 :(得分:-1)
我会在读入大熊猫之前清理数据。这是我当前问题的解决方案。
编辑:
这将替换双引号内的;
(基于this答案)
o = open("fileOut.csv", 'w')
with open("fileIn.txt") as f:
for lines in f:
o.write(re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';',''), lines))
o.close()
原件:
o = open("fileOut.csv", 'w')
with open("fileIn.txt") as f:
for lines in f:
o.write(lines.replace("; ", ""))
o.close()