Question

我拥有的内容：~100个txt文件，每个文件有9列，> 100,000行我想要的：一个组合文件，只有2列而是所有行。那么这应该转换为> 100,000列的输出＆amp; 2行。

我创建了以下功能，系统地浏览文件夹中的文件，提取我想要的数据，然后在每个文件之后，与原始模板一起加入。

问题：这在我的小测试文件上工作正常，但是当我尝试在大文件上执行时，我遇到了内存分配问题。我的8GB RAM还不够，我认为其中一部分是我编写代码的方式。

我的问题：有没有办法循环浏览文件，然后在最后加入所有文件以节省处理时间？

另外，如果这是一个错误的地方放这种东西，什么是更好的论坛来获取WIP代码？

##Script to pull in genotype txt files, transpose them, delete commented rows & 
## & header rows, and then put files together.

library(plyr)

## Define function
Process_Combine_Genotype_Files <- function(
        inputdirectory = "Rdocs/test", outputdirectory = "Rdocs/test", 
        template = "Rdocs/test/template.txt",
        filetype = ".txt", vars = ""
        ){

## List the files in the directory & put together their path
        filenames <- list.files(path = inputdirectory, pattern = "*.txt")
        path <- paste(inputdirectory,filenames, sep="/")


        combined_data <- read.table(template,header=TRUE, sep="\t")

## for-loop: for every file in directory, do the following
        for (file in path){

## Read genotype txt file as a data.frame
                currentfilename  <- deparse(substitute(file))
                currentfilename  <- strsplit(file, "/")
                currentfilename <- lapply(currentfilename,tail,1)

                data  <- read.table(file, header=TRUE, sep="\t", fill=TRUE)

                #subset just the first two columns (Probe ID & Call Codes)
                #will need to modify this for Genotype calls....
                data.calls  <- data[,1:2]

                #Change column names & row names
                colnames(data.calls)  <- c("Probe.ID", currentfilename)
                row.names(data.calls) <- data[,1]


## Join file to previous data.frame
                combined_data <- join(combined_data,data.calls,type="full")


## End for loop
        }
## Merge all files
        combined_transcribed_data  <- t(combined_data)
print(combined_transcribed_data[-1,-1])
        outputfile  <- paste(outputdirectory,"Genotypes_combined.txt", sep="/")        
        write.table(combined_transcribed_data[-1,-1],outputfile, sep="\t")

## End function
}

提前致谢。

Answer 1

尝试：

filenames <- list.files(path = inputdirectory, pattern = "*.txt")
require(data.table)
data_list <- lapply(filenames,fread, select = c(columns you want to keep))

现在您拥有所有数据的列表。假设所有txt文件都具有相同的列结构，您可以通过以下方式组合它们：

data <- rbindlist(data_list)

转置数据：

t(data)

（感谢@Jakob H在fread中select）

Answer 2

如果关注速度/工作记忆，那么我建议使用Unix进行合并。一般来说，Unix比R更快。此外，Unix不要求将所有信息加载到RAM中，而是要以块的形式读取信息。因此，Unix从不受内存限制。如果您不了解Unix但计划将来经常操作大文件，那么请学习Unix。它学习简单，功能强大。我将用csv文件做一个例子。

在R

中生成CSV文件

for (i in 1:10){
  write.csv(matrix(rpois(1e5*10,1),1e5,10), paste0('test',i,'.csv'))
}

在Shell中（即在Mac上）/终端（即在Linux Box上）/ Cygwin（在Windows上）

cut -f 2,3 -d , test1.csv > final.csv #obtain column 2 and 3 form test1.csv
cut -f 2,3 -d , test[2,9].csv test10.csv | sed 1d >> final.csv #removing header in test2.csv onward

注意，如果您已经安装了Rtools，那么您可以使用system函数从R运行所有这些Unix命令。

将读取final.csv转置为R并转置。

<强>更新

我计时了上面的代码。运行 .4秒。因此，要为100个文件而不是仅仅10个文件执行此操作，可能需要 4秒。我还没有定时R代码，但是，当只有10个文件时，Unix和R程序可能会有相似的性能，但是，如果有100多个文件，你的计算机可能会受到内存限制而R可能会崩溃

快速组合和转置许多固定格式的数据集文件

2 个答案: