Question

我刚刚开始使用dplyr，我有以下两个问题，group_by应该很容易解决，但我不能解决。我的数据看起来像这样：

data <- data.frame(cbind("year" = c(2010, 2010, 2010, 2011, 2012, 2012, 2012, 2012),
                     "institution" = c("a", "a", "b", "a", "a", "a", "b", "b"),
                     "branch.num" = c(1, 2, 1, 1, 1, 2, 1, 2)))

data
#  year institution branch.num
#1 2010           a          1
#2 2010           a          2
#3 2010           b          1
#4 2011           a          1
#5 2012           a          1
#6 2012           a          2
#7 2012           b          1
#8 2012           b          2

数据是层次结构化的：最高级别的机构可以有几个分支，从1开始编号。

问题1：我想选择仅包含分支的行，每年都存在一个值，即在示例数据中仅为Institution a的Branch 1，因此选择应为第1,4和5行。

Pronlem 2：我想知道一家机构多年来的平均分支机构数量。这就是机构a（2 + 1 + 2）/ 3 = 1.67和机构b（1 + 0 + 2）/ 3 = 1的例子。

Answer 1

这是一个解决方案：

问题＃1：

library(dplyr)
nYears <- n_distinct(data$year)
data %>% group_by(institution, branch.num) %>% filter(n_distinct(year) == nYears)
Source: local data frame [3 x 3]
Groups: institution, branch.num [1]

    year institution branch.num
  (fctr)      (fctr)     (fctr)
1   2010           a          1
2   2011           a          1
3   2012           a          1

问题＃2：

data %>% group_by(institution, year) %>% summarise(nBranches = n_distinct(branch.num)) %>% ungroup() %>% group_by(institution) %>% summarise(meanBranches = sum(nBranches)/nYears)
Source: local data frame [2 x 2]

  institution meanBranches
       (fctr)        (dbl)
1           a     1.666667
2           b     1.000000

R：在dplyr

1 个答案: