Question

我正在使用此处的data.table解决方案： Duplicate entry pooling while averaging values in neighbouring columns

dt.out <- dt[, lapply(.SD, function(x) paste(x, collapse=",")), 
          by=c("ID2", "chrom", "strand", "txStart", "txEnd")]

dt.out <- dt.out[ ,list(ID=paste(ID, collapse=","), ID2=paste(ID2, collapse=","), 
                       txStart=min(txStart), txEnd=max(txEnd)), 
                       by=c("probe", "chrom", "strand", "newCol")]

数据集：

ID      ID2         probe       chrom   strand txStart  txEnd  newCol
Rest_3  uc001aah.4  8044649     chr1    0      14361    29370  1.02
Rest_4  uc001aah.4  7911309     chr1    0      14361    29370  1.30  
Rest_5  uc001aah.4  8171066     chr1    0      14361    29370  2.80         
Rest_6  uc001aah.4  8159790     chr1    0      14361    29370  4.12 

Rest_17 uc001abw.1  7896761     chr1    0      861120   879961 1.11
Rest_18 uc001abx.1  7896761     chr1    0      871151   879961 3.12

我添加了这个for循环，以便newCol获取单个单元格中折叠的vaules（来自第一个dt.out）。但是，通过此循环需要很长时间。有更快的方法吗？

for(i in 1:NROW(dt.out)){
  con <- textConnection(dt.out[i,grep("newCol", colnames(dt.out))])
  data <- read.csv(con, sep=",", header=FALSE)
  close(con)
  dt.out[i,grep("newCol", colnames(dt.out))]<- as.numeric(rowMeans(data)) 

}

Answer 1

与其他问题中的数据相比，

newCol似乎是一个额外的列。我想在获得第一个dt.out之后，你想要取newCol的折叠值的平均值？

您可以直接将newCol替换为sapply(strsplit(.))来实现。基本上，在获得第一个dt.out后执行此操作：

dt.out[ , newCol := sapply(strsplit(newCol, ","), function(x) mean(as.numeric(x)))]

优化data.table中的for循环

1 个答案: