在R中的read.csv中查找错误数据的技术

时间:2013-04-02 18:26:59

标签: r

我正在阅读一个看起来像这样的数据文件:

userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith@gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"

但实际文件大约为20000条记录。我使用以下R代码来读取它:

user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")

我有quote=""的原因是因为没有它导致过早停止。我最终得到了9569次观察。为什么我不明白为什么quote=""完全克服了这个问题,似乎这样做。

除了它引入了我必须'修复'的其他问题。我看到的第一个是日期最终成为包含引号的字符串,当我对它们使用to.Date()时,它们不想转换为实际日期。

现在我可以修复字符串并破解我的方式。但更好地了解我正在做的事情。有人可以解释一下:

  1. 为什么quote=""修复'错误数据'
  2. 什么是最佳实践技术,以找出导致read.csv提前停止的原因? (如果我只看+/-指示行的输入数据,我看不出任何错误。)
  3. 以下是“问题”附近的行。我不认为你有什么损害吗?

    "16888","user1","user1@gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
    "16889","user2","user2@gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
    "16890","user3","user3@gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
    "16891","user4","user4@gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
    "16892","user5","user5@gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
    

    *更新*

    这更棘手。即使导入的总行数是9569,如果我查看最后几行,它们对应于最后几行数据。因此,我猜测导入过程中发生了一些事情,导致很多行被跳过。实际上15914 - 9569 = 6345条记录。当我在那里有引号=“”时,我得到15914。

    所以我的问题可以修改:有没有办法让read.csv报告它决定不导入的行?

    *更新2 *

    @Dwin,我不得不删除na.strings =“\ N”,因为count.fields函数不允许它。有了它,我得到这个看起来很有趣的输出,但我不明白。

    3     4    22    23    24 
    1    83 15466   178     4 
    

    你的第二个命令产生大量数据(并在达到max.print时停止。)但第一行是这样的:

    [1]  2  4  2  3  5  3  3  3  5  3  3  3  2  3  4  2  3  2  2  3  2  2  4  2  4  3  5  4  3  4  3  3  3  3  3  2  4
    

    我不明白输出是否应该显示每个输入记录中有多少个字段。显然,第一行都有超过2,4,2等字段...感觉我越来越近了,但仍然感到困惑!

2 个答案:

答案 0 :(得分:4)

我发现的一个问题(感谢data.table)是John Smith之后的缺失引号(“)。这对你的其他行也是一个问题吗?

如果我在John Smith之后添加“缺失”引用,则可以正常显示。

我已将此数据保存到data.txt

userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","john.smith@gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","user1@gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2@gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3@gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4@gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5@gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"

这是一个代码。 freadread.csv都可以正常使用。

require(data.table)

dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1

dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2

答案 1 :(得分:4)

count.fields函数在识别查找格式错误的数据的位置时非常有用。

这给出了每行字段的列表忽略引用,如果有嵌入的逗号可能会出现问题:

table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") ) 

这给出了一个列表,忽略了引号和“#”(octothorpe)作为注释字符:

table( count.fields("~/Desktop/dbdump/users.txt",  quote="", comment.char="") )

查看您为第一个列表查看的内容.....大部分内容都符合要求...您可以获取非22值的行位置列表(使用逗号和非引号设置) :

which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)

有时问题可以通过fill=TRUE解决,如果唯一的困难是在行末端缺少逗号。