我有一个包含组和子组的大型数据框。我想确定每个组中子组的索引,如以下数据框的OUTPUT
列所示:
df <- data.frame(
Group = factor(c("A","A","A","A","A","B","B","B","B")),
Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
OUTPUT = c(1,1,2,2,2,1,1,2,2)
)
我尝试了几种可能性而没有任何成功。我想和dplyr
合作,但我不知道如何解决这个问题。以下代码返回意外结果。
require(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(
OUTPUT_2 = dplyr::id(Subgroup)
)
#df
# Group Subgroup OUTPUT_2
# (fctr) (fctr) (int)
#1 A a 8
#2 A a 8
#3 A b 8
#4 A b 8
#5 A b 8
#6 B a 4
#7 B a 4
#8 B b 4
#9 B b 4
我感觉我很亲密,但没有到达那里。有人可以帮忙吗?
答案 0 :(得分:2)
以下是data.table
没有聚合的解决方案:
dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]
与基于聚合的方法相比,这将快得多。
答案 1 :(得分:2)
我们可以将def chunk(str)
chars = str.chars
chars.inject([chars.shift]) do |arr, char|
if arr[-1].include?(char)
arr[-1] << char
else
arr << char
end
arr
end
end
路线与factor
dplyr
或另一个选项是library(dplyr)
df %>%
group_by(Group) %>%
mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
# Group Subgroup OUTPUT
# <fctr> <fctr> <dbl>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
,其中match
元素为&#39;子组&#39;经过&#39; Group&#39;
unique
答案 2 :(得分:1)
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
# Group Subgroup idx OUTPUT
#1: A a 1 1
#2: A a 1 1
#3: A b 2 2
#4: A b 2 2
#5: A b 2 2
#6: B a 1 1
#7: B a 1 1
#8: B b 2 2
#9: B b 2 2
向dplyr
的翻译应该很简单。
根据使用aosmith评论中的因素的想法,另一种方法是:
dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]
这将创建一个每个组具有正确级别的因子, 是您之后的索引。