如何在没有自动列检测的情况下将fread()用作readLines()?

时间:2015-10-03 07:25:03

标签: r data.table

我有一个5Gb .dat文件(> 1000万行)。例如,每行的格式类似于aaaa bb cccc0123 xxx kkkkkkkkkkkkkkaaaaabbbcccc01234xxxkkkkkkkkkkkkkk。由于readLines在阅读大文件时效果不佳,我选择fread()来阅读此内容,但发生了错误:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") : 
  Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
  Unable to find 5 lines with expected number of columns (+ middle)

如何在没有自动列检测的情况下将fread()用作readLines()?或者还有其他方法可以解决这个问题吗?

1 个答案:

答案 0 :(得分:22)

这是一个技巧。您可以使用您知道不在文件中的sep值。这样做会强制fread()将整行读作一列。然后我们可以将该列拖放到原子向量(在下面显示为[[1L]])。以下是我使用?作为sep的csv示例。这样,它的行为类似于readLines(),速度要快得多。

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

您可以在sep中尝试的其他罕见字符是\ ^ @ # =和其他人。我们可以看到这将产生与readLines()相同的输出。这只是找到文件中不存在的sep值。

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,," 

注意:正如@Cath在评论中提到的那样,您也可以使用换行符\n作为sep值。