使用Fread

时间:2015-05-28 19:36:52

标签: r data.table

我有一个包含19列字符/数字数据的大型csv文件。

运行fread时,我收到一条错误消息,指出我的一个数字列正在转换为字符,因为该字段的值为""。然后我在文本编辑器中打开了我的数据,找到了问题的根源。在一行中,字符列显示为:

"""PARENTS"", ""Y.M."", AND ""EXPECTING"""

对应于字符串:

"PARENTS", "Y.M.", AND "EXPECTING"

作为:

  • 第一个引用是字符串保护程序
  • 第2至第6对引号为单引号
  • 最后一个引用是字符串保护器的结束。

根据我之前见过的内容,fread会将""转换为\"。这种情况下的问题是该字符串还包含逗号。这些被解释为分隔符,它与我的列顺序混淆并将后面的字符列推送到我的数字字段。

有没有办法阻止这个,或者我应该使用另一个包吗?

注意:我已经四处寻找解决方案,感觉"" + fread是令人沮丧的原因,但是没有看到一个增加逗号复杂性的例子。

重现:

将以下内容放在txt文件中:

"A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S"
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"

阅读数据:

DT <- fread("myfile.csv",colClasses = c(rep("Character",5),
                                        rep("numeric",2),
                                        rep("character",12))
            ,sep = ",")

1 个答案:

答案 0 :(得分:1)

最近对当前开发中的fread()进行了修复,v1.9.5,这就是我得到的:

require(data.table) #v1.9.5+
fread('A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S
"168263291","Gruner & Jahr Printing and Publishing Company","Parents Ym and Expecting","""PARENTS"", ""Y.M."", AND ""EXPECTING""",0,0,3,"73130201","055302756","Quad/Graphics Inc.","013034588","02","093671063","000000000","Unclassified","94133","San Francisco","CALIFORNIA","UNITED STATES"')

#            A                                             B                        C
# 1: 168263291 Gruner & Jahr Printing and Publishing Company Parents Ym and Expecting
                                          D E F G        H         I
# 1: ""PARENTS"", ""Y.M."", AND ""EXPECTING"" 0 0 3 73130201 055302756
                    J         K  L         M         N            O     P
# 1: Quad/Graphics Inc. 013034588 02 093671063 000000000 Unclassified 94133
               Q          R             S
# 1: San Francisco CALIFORNIA UNITED STATES

fread()更加健壮地处理嵌入式引号,默认情况下剥离空格(新strip.white参数,默认= TRUE),还获得encoding参数。请在项目页面上查看README以获取最新消息。

如果您的问题仍未得到解决(请在此处或在项目页面上),请使用可重现的示例告知我们。