ddply超级慢,我怎样才能加快计算速度?已请求R代码修改:

时间:2017-06-26 00:02:10

标签: r performance plyr

我有一个数据(delisle)和以下代码,我的大数据矩阵运行需要几天时间。什么是ddply的替代品?有人请指导或帮忙吗?

TYPE  SAMPLE probeA probeB probeC 
CatA  52 1.2 3.2 3.4
CatA  52 2.2 4.2 3.4
CatA  58 1.5 6.5 7.8
CatA  58 8.3 6.5 9.5
CatA  94 1.5 4.3 6.4
CatB  52 2.2 2.2 3.4
CatB  58 2.5 4.5 6.8
CatB  58 6.2 6.0 5.3
CatB  94 2.5 5.3 6.4

我为每个探针计算每个" SAMPLE",使用ddply计算catA和catB之间的倍数变化。

输出应为:

SAMPLE probe FC

52  probeA  mean(CatA)/mean(CatB)
52  probeB  mean(CatA)/mean(CatB)
58  probeA  mean(CatA)/mean(CatB)
58  probeB  mean(CatA)/mean(CatB)

对于大数据(20K行和5K列),我的代码 EXTREMELY SLOW:

 probenames <- as.vector(colnames(delisle))

 for (i in 3:ncol(delisle))
 {
 probe = probenames[i]

 Stats <- function(gs) {

 typeA.sub <- gs[which(gs$TYPE=="CatA"),]
 typeB.sub <- gs[which(gs$TYPE=="CatB"),]
 fc.AB = mean(typeA.sub[,i])/mean(typeB.sub[,i])
 fc.AC = 
 fc.BC = 
 data.frame(probe,fc.AB, fc.AC, fc.BC)
 }
 output <- ddply(.data=delisle, .variables="SAMPLE", .progress=progress_text(style=3), Stats)
 write.table(output,"SAMPLETYPE.txt",quote=F,sep="\t",append=T,col.names=F)
 }

1 个答案:

答案 0 :(得分:0)

这是否能以快速的方式为您提供预期的结果?

library(tidyverse)
d %>% 
  select(-probeC) %>% 
  gather(key, value, -TYPE, -SAMPLE) %>% 
  group_by(SAMPLE, key, TYPE) %>% 
  summarise(a = mean(value)) %>% 
  spread(TYPE, a) %>% 
  mutate(res = CatA/CatB)

Source: local data frame [6 x 5]
Groups: SAMPLE, key [6]
  SAMPLE    key  CatA  CatB       res
   <int>  <chr> <dbl> <dbl>     <dbl>
1     52 probeA   1.7  2.20 0.7727273
2     52 probeB   3.7  2.20 1.6818182
3     58 probeA   4.9  4.35 1.1264368
4     58 probeB   6.5  5.25 1.2380952
5     94 probeA   1.5  2.50 0.6000000
6     94 probeB   4.3  5.30 0.8113208