Question

我可以使用data.table的固有速度来获得具有可变列名的更快的逐行t.test结果吗？下面是我当前的代码，每1000行需要几秒钟。

slow.diffexp <- function(dt, samples1, samples2) {
  for (i in 1:nrow(dt)) {
    if (round(i/1000)==i/1000) {
      cat(i, "\n");
    }
    a <- t.test(dt[i, samples1, with=FALSE],
                dt[i, samples2, with=FALSE]);
    set(dt, i, "tt.p.value", a$p.value)
    set(dt, i, "tt.mean1", a$estimate[1])
    set(dt, i, "tt.mean2", a$estimate[2])
  }
}

test.dt <- data.table(V1=sample(1000, 100000, replace=TRUE));
for (i in 2:20) {
  colname <- paste0("V", i);
  test.dt[ , (colname):=sample(1000, 100000, replace=TRUE)];
}
samples1 <- sample(names(test.dt), size=10);
samples2 <- setdiff(names(test.dt), samples1);
slow.diffexp(test.dt, samples1, samples2);

我查看了以下相关帖子：

Paired t-test for each row of a data table：有解决方案，但我们可以加快速度吗？
Doing t.test for columns for each row in data set：不使用data.table;

我正在使用set（），因为我有这样的想法，对于data.frames来说，set比＆lt; - 更快......

Answer 1

这并没有明确使用data.table，但它应该比for循环快得多：

set.seed(700)
test.dt <- data.table(V1=sample(1000, 100000, replace=TRUE));
for (i in 2:20) {
  colname <- paste0("V", i);
  test.dt[ , (colname):=sample(1000, 100000, replace=TRUE)];
}
samples1 <- sample(names(test.dt), size=10);
samples2 <- setdiff(names(test.dt), samples1);

system.time(myList<-apply(test.dt, 1, function(x) t.test(x[samples1], x[samples2])))
# user  system elapsed 
# 18.44    0.00   18.47 

test.dt$tt.p.value<-sapply(myList, function(x) x[[3]])
test.dt$tt.mean1<-sapply(myList, function(x) x[[5]][[1]])
test.dt$tt.mean2<-sapply(myList, function(x) x[[5]][[2]])

test.dt[1:10, 19:23, with = F]

V19 V20 tt.p.value tt.mean1 tt.mean2
962 536    0.98203    460.8    463.9
882 767    0.06294    657.4    416.0
371 111    0.73440    463.1    502.8
173 720    0.57195    595.9    513.3
126 404    0.86948    602.8    619.5
 14  16    0.63462    315.7    377.3
870 384    0.03670    377.7    626.6
142 997    0.19836    623.2    442.8
  4 193    0.99891    628.4    628.2
250 888    0.35232    590.9    498.5

另一种方法慢了约10倍（在稍长的时间内工作的1/10）

system.time(slow.diffexp(test.dt[1:10000], samples1, samples2))
# user  system elapsed 
# 22.12    0.00   22.17

在R data.table

1 个答案: