Question

我在R中有一个非常大的数据文件（在Giga中），如果我尝试用R打开它，我将会出现内存不足错误。

我需要逐行读取文件并进行一些分析。我在这个问题上找到了一个先前的问题，其中文件被n行读取并跳转到具有clump的某些行。我使用了“Nick Sabbe”的答案并添加了一些修改以满足我的需要。

考虑到我有以下test.csv文件 - 文件样本：

A    B    C
200 19  0.1
400 18  0.1
300 29  0.1
800 88  0.1
600 80  0.1
150 50  0.1
190 33  0.1
270 42  0.1
900 73  0.1
730 95  0.1

我想逐行阅读文件内容并执行分析。所以我根据“Nick Sabbe”发布的代码创建了以下循环。我有两个问题： 1）每次打印新行时都会打印标题。 2）虽然我正在删除此列，但也会打印R的索引“X”列。

以下是我正在使用的代码：

test<-function(){
 prev<-0

for(i in 1:100){
  j<-i-prev
  test1<-read.clump("file.csv",j,i)
  print(test1)
  prev<-i

}
}
####################
# Code by Nick Sabbe
###################
read.clump <- function(file, lines, clump, readFunc=read.csv,
                   skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
                   nrows=lines,header=TRUE,...){
if(clump > 1){
colnms<-NULL
if(header)
{
  colnms<-unlist(readFunc(file, nrows=1, header=F))
  #print(colnms)
}
p = readFunc(file, skip = skip,
             nrows = nrows, header=FALSE,...)
if(! is.null(colnms))
{
  colnames(p) = colnms
}
} else {
 p = readFunc(file, skip = skip, nrows = nrows, header=header)
}
p$X<-NULL   # Note: Here I'm setting the index to NULL
return(p)
}

我得到的输出：

       A       B    C
1      200      19   0.1
  NA   1       1     1
1  2   400     18   0.1
  NA   1       1    1
1  3   300     29   0.1
  NA   1       1    1
1  4   800     88   0.1
  NA   1       1    1
1  5   600     80   0.1

我想摆脱其余的阅读：

 NA   1       1     1

另外，有没有办法让for循环停止文件结束时这样的EOF用其他语言???

Answer 1

也许这样的事情可以帮到你：

inputFile <- "foo.txt"
con  <- file(inputFile, open = "r")
while (length(oneLine <- readLines(con, n = 1)) > 0) {
  myLine <- unlist((strsplit(oneLine, ",")))
  print(myLine)
} 
close(con)

或扫描以避免分裂为@MatthewPlourde

我使用scan：我跳过标题，quiet = TRUE没有消息说明已经有多少项

while (length(myLine <- scan(con,what="numeric",nlines=1,sep=',',skip=1,quiet=TRUE)) > 0 ){
   ## here I print , but you must have a process your line here
   print(as.numeric(myLine))

}

Answer 2

我建议您签出chunked和disk.frame。它们都有读取CSV的功能。

disk.frame::csv_to_disk.frame可能是您想要的功能。

在没有标题的R中逐行读取大文件

我得到的输出：

2 个答案: