确定子组索引

时间:2016-05-16 20:21:59

标签: r grouping dplyr

我有一个包含组和子组的大型数据框。我想确定每个组中子组的索引,如以下数据框的OUTPUT列所示:

df <- data.frame(
  Group = factor(c("A","A","A","A","A","B","B","B","B")),
  Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
  OUTPUT = c(1,1,2,2,2,1,1,2,2)
)

我尝试了几种可能性而没有任何成功。我想和dplyr合作,但我不知道如何解决这个问题。以下代码返回意外结果。

require(dplyr)

df <- df %>%
  group_by(Group) %>%
  mutate(
    OUTPUT_2 = dplyr::id(Subgroup)
  )

#df
#   Group Subgroup OUTPUT_2
#  (fctr)   (fctr)    (int)
#1      A        a        8
#2      A        a        8
#3      A        b        8
#4      A        b        8
#5      A        b        8
#6      B        a        4
#7      B        a        4
#8      B        b        4
#9      B        b        4

我感觉我很亲密,但没有到达那里。有人可以帮忙吗?

3 个答案:

答案 0 :(得分:2)

以下是data.table没有聚合的解决方案:

dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]

与基于聚合的方法相比,这将快得多。

答案 1 :(得分:2)

我们可以将def chunk(str) chars = str.chars chars.inject([chars.shift]) do |arr, char| if arr[-1].include?(char) arr[-1] << char else arr << char end arr end end 路线与factor

一起使用
dplyr

或另一个选项是library(dplyr) df %>% group_by(Group) %>% mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup)))) # Group Subgroup OUTPUT # <fctr> <fctr> <dbl> #1 A a 1 #2 A a 1 #3 A b 2 #4 A b 2 #5 A b 2 #6 B a 1 #7 B a 1 #8 B b 2 #9 B b 2 ,其中match元素为&#39;子组&#39;经过&#39; Group&#39;

分组后
unique

答案 2 :(得分:1)

library(data.table)
dt = as.data.table(df) # or setDT to convert in place

unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
#   Group Subgroup idx OUTPUT
#1:     A        a   1      1
#2:     A        a   1      1
#3:     A        b   2      2
#4:     A        b   2      2
#5:     A        b   2      2
#6:     B        a   1      1
#7:     B        a   1      1
#8:     B        b   2      2
#9:     B        b   2      2

dplyr的翻译应该很简单。

根据使用aosmith评论中的因素的想法,另一种方法是:

dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]

这将创建一个每个组具有正确级别的因子, 是您之后的索引。