R,dplyr:将几个group_by()级别的出现次数指定为列

时间:2015-01-16 16:00:52

标签: r count dplyr reshape2

require(plyr)
require(dplyr)

set.seed(8)
df <- data.frame(
  group = sample(c("A","B"), 10, replace=T),
  subgroup = sample(c("a", "b", "c"),10, replace=T),
  value = runif(10, -1,1)
  )
df %>% arrange(group,subgroup)

给出:

         group subgroup      value
1      A        a -0.1841505
2      A        a  0.3265360
3      A        a -0.8045035
4      A        b -0.5526222
5      B        a  0.2238653
6      B        a  0.0552373
7      B        b  0.2297515
8      B        b -0.5700525
9      B        b  0.6347312
10     B        c  0.9550054

我可以指示值是高还是低,例如:

df2<-
df %>% mutate(reg = ifelse(value > 0, "high", "low"))
df2

给出:

   group subgroup      value  reg
1      A        b -0.5526222  low
2      A        a -0.1841505  low
3      B        b  0.2297515 high
4      B        b -0.5700525  low
5      A        a  0.3265360 high
6      B        c  0.9550054 high
7      A        a -0.8045035  low
8      B        a  0.2238653 high
9      B        a  0.0552373 high
10     B        b  0.6347312 high

问题: 我想得到列low.grouphigh.grouplow.subgrouphigh.subgroup,表示在该组中找到了多少次高值和低值(我想到{{1} } dplyrgroup_by(group),可能还有n())以及组+子组级别(summarise())。这将生成6行×6列数据帧(A / B和a / b / c的组合,以及group_by(group, subgroup)列,groupsubgrouplow.grouphigh.grouplow.subgroup)。第一列应为(A,a,3,1,2,1),第二列(A,b,3,1,1,0)等。 我可以算一下,例如由:

high.subgroup

但是如何将df %>% group_by(group,reg) %>% mutate(n.group=n()) 分成两列n.grouplow.group。子组也存在同样的问题。

我确信high.groupplyrdplyr中的功能可以将计数和摘要结合起来,但是如何?

更新: 以下是我将得到的手工制作结果:

reshape2

2 个答案:

答案 0 :(得分:2)

有点冗长,但似乎做了预期的事情:

library(dplyr)
library(tidyr)
df %>% 
  mutate(value = ifelse(value > 0, "high", "low")) %>%
  group_by(group, subgroup, value) %>%
  mutate(sub = n()) %>%
  group_by(group, value) %>%
  mutate(grp = n()) %>% 
  distinct(group, subgroup, value) %>% 
  gather(key, val, sub:grp) %>%
  unite(x, value:key, sep = ".") %>%
  spread(x, val, fill = 0)

#Source: local data frame [5 x 6]
#
#  group subgroup high.grp high.sub low.grp low.sub
#1     A        a        1        1       3       2
#2     A        b        0        0       3       1
#3     B        a        5        2       0       0
#4     B        b        5        2       1       1
#5     B        c        5        1       0       0

请注意,组合A-c不会出现在样本数据中,因此不会出现在输出中。

答案 1 :(得分:0)

docendo discimus解决方案的变体 - 使用更多reshape2和更少的tidyr - 是:

library(dplyr)
library(tidyr)
library(stringr)
library(reshape2)

df %>%
 mutate(value=ifelse(value > 0, "high", "low")) %>%
 group_by(group, subgroup, value) %>%
 mutate(sub = n()) %>%
 group_by(group, value) %>%
 mutate(grp = n()) %>%
 gather(key,val,sub:grp) %>%
 mutate(val.key=str_c(value,".",key)) %>%
 distinct() %>%
 dcast(group+subgroup~val.key, value.var="val", fill=0)