如何在R中找到公共变量对?

时间:2018-07-18 19:46:04

标签: r

我有一个包含2列,ID和类别名称的数据框:

     X1     X2
    1234   Metal
    1234   Metal
    1234   Plastic
    1234   Plastic
    1234   Glass
    1235   Metal
    1235   Metal
    1235   Plastic
    1235   Plastic
    1235   Glass
    1236   Glass
    1236   Glass
    1236   Metal
    1236   Metal
    1236   Plastic

我想找到整个数据集中最频繁的组合以及2个组合的计数(对于较大的数据集,我希望3或4个组合):

    Metal, Plastic     2
    Glass, Metal       1

我尝试首先通过ID(X2)生成X1的所有可能组合,因此我可以使用dplyr来汇总和组合顶部组合。不幸的是,我的数据集太大,无法有效运行。有什么想法可以更简便快捷地解决这一问题吗?

2 个答案:

答案 0 :(得分:0)

这里是我想您正在尝试的尝试。您可以更改top_n参数,并且让类别与它们自己组合,但是如果不是这种情况,则可以添加一个附加过滤器。

library(dplyr)

df %>% 
  mutate(ID = row_number()) %>%
  inner_join(., ., by = c('X1' = 'X1')) %>%
  filter(ID.x != ID.y) %>% # shouldn't count as combo with itself
  group_by(X2.x, X2.y) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  top_n(5, n) %>%
  arrange(desc(n))

# A tibble: 7 x 3
  X2.x    X2.y        n
  <chr>   <chr>   <int>
1 Metal   Plastic    10
2 Plastic Metal      10
3 Glass   Metal       8
4 Metal   Glass       8
5 Glass   Plastic     6
6 Metal   Metal       6
7 Plastic Glass       6

# Tie results in more than 5 rows for top_n()

数据

df <- data.table::fread("X1     X2
1234   Metal
1234   Metal
1234   Plastic
1234   Plastic
1234   Glass
1235   Metal
1235   Metal
1235   Plastic
1235   Plastic
1235   Glass
1236   Glass
1236   Glass
1236   Metal
1236   Metal
1236   Plastic")

答案 1 :(得分:0)

输入

df
#      X1      X2
# 1  1234   Metal
# 2  1234   Metal
# 3  1234 Plastic
# 4  1234 Plastic
# 5  1234   Glass
# 6  1235   Metal
# 7  1235   Metal
# 8  1235 Plastic
# 9  1235 Plastic
# 10 1235   Glass
# 11 1236   Glass
# 12 1236   Glass
# 13 1236   Metal
# 14 1236   Metal
# 15 1236 Plastic

对于每个唯一的X1变量,X2个元素的数量

 result <- table(cbind.data.frame(df$X1, df$X2))
 result
 #       df$X2
 # df$X1  Glass Metal Plastic
 #   1234     1     2       2
 #   1235     1     2       2
 #   1236     2     2       1

为每个唯一的X2

打印X1的前两个最频繁的元素
 final <- apply(result,1, function(x) names(which(x == max(x))))
 final
 #  df$X1
 #    1234      1235      1236   
 # [1,] "Metal"   "Metal"   "Glass"
 # [2,] "Plastic" "Plastic" "Metal"