Question

我有一个包含许多文件夹的文件。我编写了以下代码来获取所有文件的地址，将它们导入R并绑定它们，效果很好。问题是，某些文件具有不同的列数，这导致显示错误。我的问题是如何在我的代码的第三行添加一个计数器？我基本上想检查计数器何时停止，然后手动删除具有与其他列数不同的文件。谢谢

file_names <- list.files(path="D:/ABCDE", recursive=TRUE)
setwd("D:/ ABCDE ") 
all_dta <- do.call(rbind, lapply(file_names, function(x) read.csv(file=x,header = FALSE)))

Answer 1

要向lapply添加一个计数器，我会这样做：

file_names <- list.files(path="D:/ABCDE", recursive=TRUE)
idx=1:length(filenames) #this will server as your 'counter'
lapply(idx, function(i) {print(file_names[i]); read.csv(file=file_names[i],header = FALSE)}) # this will print the file and when the loop stops you'll see the file that is faulty

但是，作为另一种解决方案-既要知道哪些文件有故障又可以自然地跳过它们-我会这样做：

wanted=c()
for(f in file_names){
    first_line=system(paste0('head -n 1',f),intern=T) # sends prompt to command line to print first line of files. intern=T means one can set this to a variable
if(nchar(first_line > quota)){ #set quota to provide threshold for a number of columns
     wanted=c(wanted,f)
}
}

您也可以使用sapply进行上述操作。

然后：

all_dta <- do.call(rbind, lapply(wanted, function(x) read.csv(file=x,header = FALSE)))

如果您想知道哪些文件有问题（即哪些文件没有足够的列）。只需找到少于列的配额数量的文件即可：

unwanted=c()
for(f in file_names){
    first_line=system(paste0('head -n 1',f),intern=T) # sends prompt to command line to print first line of files. intern=T means one can set this to a variable
if(nchar(first_line < quota)){ 
     unwanted=c(unwanted,f)
}
}

Answer 2

一个更简单的解决方案是修改该调用以跳过第一行，而仅使用dplyr :: bind_rows（）获取所有文件

file_names <- list.files(path="D:/ABCDE", recursive=TRUE)
setwd("D:/ ABCDE ") 
all_dta <- do.call(dplyr::bind_rows, lapply(file_names, function(x) read.table(file=x,header = FALSE, sep = ',', skip = 1)))

唯一的事情是您需要设置列名。您可以阅读一行以获取名称，或者在列不多的情况下手动进行操作。

向lapply（do.call）添加一个计数器-R

2 个答案: