Question

我需要使用ddply在我的数据框的多个列上应用多个函数。当我使用列名（下例中的RV）时，我的拆分变量（下面的Group和Round）可以工作（我得到Round和Group的每个组合的平均值）。

我需要在20列上执行此操作，并且我正在考虑创建for循环并传递列索引。

当我使用列索引（例如df [[1]]，即我的数据框中的“RV”）时，将忽略Group和Round，并为Round和Group的所有组合返回grand mean。

我尝试在new.df3中传递列名，但Round和Group再次被忽略。

df <- data.frame("RV" = 1:5, "Group" = c("a","b","b","b","a"), "Round" = c("2","1","1","2","1"))

# this works and a separate mean for each combination of "Group" and "Round" is calculated 
new.df <- ddply(df, c("Group", "Round"), summarise,
            mean= mean(RV))

# this does not work and the grand mean is returned for all combinations of "Group" and "Round" 
new.df2 <- ddply(df, c("Group", "Round"), summarise,
            mean= mean(df[[1]]))

# this does not work and the grand mean is returned for all combinations of "Group" and "Round"     
new.df3 <- ddply(df, c("Group", "Round"), summarise,
             mean= mean(df[,colnames(df[1])]))

我试过“lapply”并存在同样的问题。有什么建议为什么会发生这种情况以及如何解决它？

Answer 1

使用dplyr

library(dplyr)
df$RV_1 <- df$RV*2    
result <- df %>% dplyr::distinct(Group, Round)

for (i in 1:2) { #1:2 as we have only two numeric/integer columns in the data set
      t <- df %>% group_by(Group, Round) %>% 
              summarise_at(c(i), mean, na.rm = T)

      result <- cbind(result, t[, 3])
             } 

   Group Round  RV RV_1
1     a     2 5.0   10
2     b     1 1.0    2
3     b     2 2.5    5
4     a     1 4.0    8

Answer 2

与plyr一样好的包，你可以在这里更新到最新的迭代dplyr。在那里，代码将是

v <- vars(RV) # add all your variables here
new.df <- df %>%
  group_by(Group, Round) %>%
  summarize_at(v, funs(mean))

因此，使用此方法，您可以将所有变量插入v，并为Group和Round的每个组合获得所有变量的均值。管道运算符（%>%）在您第一次看到它时看起来很奇怪，但它有助于简化代码。它接受前一个函数的输出并将其设置为下一个函数的第一个参数。通过df和Group分组，我们可以轻松查看Round，然后对其进行汇总。

如果真的想要坚持使用plyr，我们也可以在那里找到解决方案：

new.df <- ddply(df, c("Group", "Round"), summarise,
  RV_mean = mean(RV),
  var2_mean = mean(var2) # add a more variables just like this
)

我们也可以使用您的列表方法：

new.df2 <- ddply(df, .(Group, Round), function(data_subset) { # note alternative way to reference Group and Round
  as.data.frame(llply(data_subset[,c("RV"), drop = FALSE], mean)) # add your variables here
})

请注意，在ddply中，我总是在函数调用中引用数据框的子集，我从不引用df。 df始终引用原始数据框 - 而不是您尝试使用的子集。

当使用列索引时，R ddply忽略拆分因子

2 个答案: