我正在尝试将大型csv文件作为数据表读取,根据字段' sample_name'将其拆分为64个块。并应用函数' myfunction'在每个块上,以平行的方式。
library(data.table)
library(plyr)
library(doMC)
registerDoMC(5) #assign 5 cores
#read large csv file with 6485845 rows, 13 columns
dt = fread('~/data/samples.csv')
#example subset of dt (I am showing only 3 columns)
#sample_name snpprobeset_id snp_strand
#C00060 exm1002141 +
#C00060 exm1002260 -
#C00060 exm1002276 +
#C00075 exm1002434 -
#C00075 exm1002585 -
#C00150 exm1002721 -
#C00150 exm1004566 -
#C00154 exm100481 +
#C00154 exm1004821 -
#split into 64 chunks based on column 'sample_name'.
#each chunk is passed as an argument to a function 'myfunction'
ddply(dt,.(sample_name),myfunction,.parallel=TRUE)
#function definition
myfunction <- function(arg1)
{
#arg1 <- data.table(arg1)
#write columns 9,11,12 to a tab-limited bed file named 'sample_name.bed' for e.g. C00060.bed, C00075.bed and so on. 64 bed files for 64 chunks would be written out.
write.table(arg1[,c(9,11,12)],paste("~/Desktop/",paste(unique(arg1$sample_name),".bed",sep=""),sep=""),row.names=F,quote=F,sep="\t",col.names=F)
#execute a system command for bam-readcount (bioinformatics program)
#build command
p1 <- paste(unique(arg1$sample_name),".bed",sep="")
p2 <- paste("bam-readcount -b 20 -f hg19.fa -l",p1,sep=" ")
p3 <- paste(unique(arg1$sample_name),".bam",sep="")
p4 <- paste(p2,p3,sep=" ")
p5 <- paste(unique(arg1$sample_name),"_output.txt",sep="")
p6 <- paste(p4,p5,sep=" > ")
system(p6) #execute system command
#executes something like this, for sample_name=C00060
#bam-readcount -b 20 -f hg19.fa -l C00060.bed C00060.bam > C00060_output.txt
#read back in C00060_output.txt file
#manipulate the file..multiple steps
#write output to another file
}
在这里,当我分割我的数据表&#39; dt&#39;基于&#39; sample_name&#39;使用ddply(),它被分成数据帧而不是数据表。所以我想在将数据帧传递到函数(函数定义的第一行)之后将数据帧转换为数据表,然后使用数据表执行其余的处理。有没有更好的&amp;有效的替代方案?