R - 格式不正确的数据 - 不匹配的报价,CSV文件

时间:2016-03-29 15:06:17

标签: regex r csv

我有一个包含一些问题的CSV文件。

- 不匹配的引号

- 这些不匹配的引号中的逗号。

这使得在噩梦中阅读数据。

我已经读过了 reading badly formed csv in R - mismatched quotes

使用

读取我的文件
rawData = read.csv(curFile, stringsAsFactors=FALSE, header=TRUE, quote="")

正如此处所示:R Programming: "More Columns than Column Names"

我认为它是因为引号无与伦比,但是使用read.csv(quote ="")仍然会给我这个错误。删除引用=""允许我读取文件(不会有更多列而不是列名错误)但它仍然读取错误。

"@realdonaldtrump","870440000","870442502","Louis  Tonelli","L00byLou26","364","292","",0,0,"Wed Mar 23 03:03:18 +0000 2016","RT @realDonaldTrump: Incompetent Hillary, despite the horrible attack in Brussels today, wants borders to be weak and open-and let the Musl&","7.1247e+17","712474777378820097","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",NA,NA,NA,NA,NA,"7.1247e+17","712473816614772736","Wed Mar 23 02:59:29 +0000 2016","Donald J. Trump","realDonaldTrump","New York, NY","7259400","41","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>" 
Phone</a>"
"@realdonaldtrump","4831200000","4831194209","Chris Mattingly","_chrismattingly","605","194","Missouri, USA",0,0,"Wed Mar 23 03:03:18 +0000 2016","@realDonaldTrump &lt;- Favorite buffoonish reply: ""Be careful, or [insert stock threat]"". How's the ""libel"" suit going? https://twitter.com","7.1247e+17","712474777064181761","<a href=""http://twitter.com/#!/download/ipad"" rel=""nofollow"">Twitter for iPad</a>",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"@realdonaldtrump","4799600000","4799556991","Leann Rehm      Lawrence","rehm_leann","101","295","",0,0,"Tue Mar 22 21:24:15 +0000 2016","RT     @TrumpDynastyUSA: KINDRED SPIRITSBrought to TEARS.
LOVE &amp; HONOR the ""Apple of G-D's EYE!""
Deuteronomy 32:9-10
@ElianaBenador @realDona&","7.1239e+17","712389451679342593","<a     href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for     Android</a>",NA,NA,NA,NA,NA,"7.1238e+17","712384968836718593","Tue Mar 22     21:06:26 +0000 2016","Lionhearted1","TrumpDynastyUSA","United     States","5153","5140","<a href=""http://twitter.com"" rel=""nofollow"">Twitter     Web Client</a>"

由于我无法控制的原因,这就是数据在文件中的显示方式。

我应该是什么&#34;阅读&#34;这里有3行/观察。

每行以&#34开头; @ realdonaldtrump&#34;是一个新的观察。

如果有任何东西低于它,它应该是它上面的观察的一部分,但是有嵌入的\ n或\ r。但是,这可能不重要,因为它的逗号分隔。

然而,当逗号介于两者之间时,这确实会引起问题。

错误引用问题很容易看到线&#34; 2&#34;

1 个答案:

答案 0 :(得分:2)

尝试package data.table's fread (See page 31 of help file)。它会自动进行多行比较,以尝试以您描述的方式识别不匹配的引号和逗号。它并不完美,但它的效果往往比read.csv好很多。

它还支持只读取某些行范围,所以通过一些反复试验,如果你能识别顽皮的行,你可以跳过那些初始fread然后单独处理它们,假设那里不是太多了。

使用PERLPHP预先处理您的数据,以便在阅读之前识别并更正不匹配的引号,这可能是R的最佳选择。< / p>