相对于"基线"所有组的分组操作组,有多个观察

时间:2016-09-08 13:21:11

标签: r data.table dplyr

从包含每个组的多个观察的数据开始,如下所示:

set.seed(1)
my.df <- data.frame(
  timepoint = rep(c(0, 1, 2), each= 3),
  counts = round(rnorm(9, 50, 10), 0)
)
> my.df
  timepoint counts
1         0     44
2         0     52
3         0     42
4         1     66
5         1     53
6         1     42
7         2     55
8         2     57
9         2     56

要在相对于timepoint的每个timepoint == 0处执行摘要计算,对于每个组,我需要传递timepoint == 0的计数向量和组的计数向量(例如timepoint == 0)任意函数,例如

NonsenseFunction <- function(x, y){
  (mean(x) - mean(y)) / (1 - mean(y))
}

我可以使用dplyr

从此表中获取所需的输出
library(dplyr)
my.df %>%
  group_by(timepoint) %>%
  mutate(rep = paste0("r", 1:n())) %>%
  left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
  group_by(timepoint.x) %>%
  summarise(result = NonsenseFunction(counts.x, counts.y))

data.table

library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
  , NonsenseFunction(counts.x, counts.y), by = timepoint.x]

仅当组之间的观察数相同时才有效。无论如何,观察结果并不匹配,因此使用临时rep变量似乎很容易。

对于更一般的情况,我需要将基线值的矢量和组的值传递给任意(更复杂)的函数,是否有惯用的data.table或{{1}使用所有组的分组操作的方式吗?

3 个答案:

答案 0 :(得分:3)

这是直截了当的data.table方法:

my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]

对于每个小组,这可能会一次又一次地抓住my.dt[timepoint==0, counts]。您可以提前保存该值:

v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]

...或者如果您不想将v添加到环境中,可能

with(list(v = my.dt[timepoint==0, counts]), 
  my.dt[, f(counts, v), by=timepoint]
)

答案 1 :(得分:1)

您可以使用第二个参数将您感兴趣的组中的向量用作常量。

my.df %>%
    group_by(timepoint) %>%
    mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))

或者如果您想事先制作它:

constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
    group_by(timepoint) %>%
    mutate(response = NonsenseFunction(counts, constant))

答案 2 :(得分:0)

你可以尝试,

library(dplyr)
my.df %>% 
    mutate(new = mean(counts[timepoint == 0])) %>% 
    group_by(timepoint) %>% 
    summarise(result = NonsenseFunction(counts, new))

# A tibble: 3 × 2
#  timepoint    result
#      <dbl>     <dbl>
#1         0 0.0000000
#2         1 0.1398601
#3         2 0.2097902