我有一个包含一些问题的CSV文件。
- 不匹配的引号
- 这些不匹配的引号中的逗号。
这使得在噩梦中阅读数据。
我已经读过了 reading badly formed csv in R - mismatched quotes
使用
读取我的文件rawData = read.csv(curFile, stringsAsFactors=FALSE, header=TRUE, quote="")
正如此处所示:R Programming: "More Columns than Column Names"
我认为它是因为引号无与伦比,但是使用read.csv(quote ="")仍然会给我这个错误。删除引用=""允许我读取文件(不会有更多列而不是列名错误)但它仍然读取错误。
"@realdonaldtrump","870440000","870442502","Louis Tonelli","L00byLou26","364","292","",0,0,"Wed Mar 23 03:03:18 +0000 2016","RT @realDonaldTrump: Incompetent Hillary, despite the horrible attack in Brussels today, wants borders to be weak and open-and let the Musl&","7.1247e+17","712474777378820097","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",NA,NA,NA,NA,NA,"7.1247e+17","712473816614772736","Wed Mar 23 02:59:29 +0000 2016","Donald J. Trump","realDonaldTrump","New York, NY","7259400","41","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>"
Phone</a>"
"@realdonaldtrump","4831200000","4831194209","Chris Mattingly","_chrismattingly","605","194","Missouri, USA",0,0,"Wed Mar 23 03:03:18 +0000 2016","@realDonaldTrump <- Favorite buffoonish reply: ""Be careful, or [insert stock threat]"". How's the ""libel"" suit going? https://twitter.com","7.1247e+17","712474777064181761","<a href=""http://twitter.com/#!/download/ipad"" rel=""nofollow"">Twitter for iPad</a>",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"@realdonaldtrump","4799600000","4799556991","Leann Rehm Lawrence","rehm_leann","101","295","",0,0,"Tue Mar 22 21:24:15 +0000 2016","RT @TrumpDynastyUSA: KINDRED SPIRITSBrought to TEARS.
LOVE & HONOR the ""Apple of G-D's EYE!""
Deuteronomy 32:9-10
@ElianaBenador @realDona&","7.1239e+17","712389451679342593","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",NA,NA,NA,NA,NA,"7.1238e+17","712384968836718593","Tue Mar 22 21:06:26 +0000 2016","Lionhearted1","TrumpDynastyUSA","United States","5153","5140","<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>"
由于我无法控制的原因,这就是数据在文件中的显示方式。
我应该是什么&#34;阅读&#34;这里有3行/观察。
每行以&#34开头; @ realdonaldtrump&#34;是一个新的观察。
如果有任何东西低于它,它应该是它上面的观察的一部分,但是有嵌入的\ n或\ r。但是,这可能不重要,因为它的逗号分隔。
然而,当逗号介于两者之间时,这确实会引起问题。
错误引用问题很容易看到线&#34; 2&#34;
答案 0 :(得分:2)
尝试package data.table's
fread
(See page 31 of help file)。它会自动进行多行比较,以尝试以您描述的方式识别不匹配的引号和逗号。它并不完美,但它的效果往往比read.csv
好很多。
它还支持只读取某些行范围,所以通过一些反复试验,如果你能识别顽皮的行,你可以跳过那些初始fread
然后单独处理它们,假设那里不是太多了。
使用PERL
或PHP
预先处理您的数据,以便在阅读之前识别并更正不匹配的引号,这可能是R
的最佳选择。< / p>