我有一个包含2列,ID和类别名称的数据框:
X1 X2
1234 Metal
1234 Metal
1234 Plastic
1234 Plastic
1234 Glass
1235 Metal
1235 Metal
1235 Plastic
1235 Plastic
1235 Glass
1236 Glass
1236 Glass
1236 Metal
1236 Metal
1236 Plastic
我想找到整个数据集中最频繁的组合以及2个组合的计数(对于较大的数据集,我希望3或4个组合):
Metal, Plastic 2
Glass, Metal 1
我尝试首先通过ID(X2
)生成X1
的所有可能组合,因此我可以使用dplyr
来汇总和组合顶部组合。不幸的是,我的数据集太大,无法有效运行。有什么想法可以更简便快捷地解决这一问题吗?
答案 0 :(得分:0)
这里是我想您正在尝试的尝试。您可以更改top_n
参数,并且让类别与它们自己组合,但是如果不是这种情况,则可以添加一个附加过滤器。
library(dplyr)
df %>%
mutate(ID = row_number()) %>%
inner_join(., ., by = c('X1' = 'X1')) %>%
filter(ID.x != ID.y) %>% # shouldn't count as combo with itself
group_by(X2.x, X2.y) %>%
summarize(n = n()) %>%
ungroup() %>%
top_n(5, n) %>%
arrange(desc(n))
# A tibble: 7 x 3
X2.x X2.y n
<chr> <chr> <int>
1 Metal Plastic 10
2 Plastic Metal 10
3 Glass Metal 8
4 Metal Glass 8
5 Glass Plastic 6
6 Metal Metal 6
7 Plastic Glass 6
# Tie results in more than 5 rows for top_n()
df <- data.table::fread("X1 X2
1234 Metal
1234 Metal
1234 Plastic
1234 Plastic
1234 Glass
1235 Metal
1235 Metal
1235 Plastic
1235 Plastic
1235 Glass
1236 Glass
1236 Glass
1236 Metal
1236 Metal
1236 Plastic")
答案 1 :(得分:0)
输入
df
# X1 X2
# 1 1234 Metal
# 2 1234 Metal
# 3 1234 Plastic
# 4 1234 Plastic
# 5 1234 Glass
# 6 1235 Metal
# 7 1235 Metal
# 8 1235 Plastic
# 9 1235 Plastic
# 10 1235 Glass
# 11 1236 Glass
# 12 1236 Glass
# 13 1236 Metal
# 14 1236 Metal
# 15 1236 Plastic
对于每个唯一的X1
变量,X2
个元素的数量
result <- table(cbind.data.frame(df$X1, df$X2))
result
# df$X2
# df$X1 Glass Metal Plastic
# 1234 1 2 2
# 1235 1 2 2
# 1236 2 2 1
为每个唯一的X2
X1
的前两个最频繁的元素
final <- apply(result,1, function(x) names(which(x == max(x))))
final
# df$X1
# 1234 1235 1236
# [1,] "Metal" "Metal" "Glass"
# [2,] "Plastic" "Plastic" "Metal"