在R中找到组合的模式

时间:2016-09-22 04:53:22

标签: r pattern-matching combinations

我有数据框,我的目标是通过ID找到组合var1的模式,如果每组至少有3个类别相同,我们设置"是",然后哪个ID具有相同的组合

ID1: I have 4 unique categories (A,B,C,D)
ID2: I have 4 unique categories (B,C,D,F)
ID3: I have 3 unique categories (A,B,C)
ID4: I have 2 unique categories (A,B)
ID5: I have 4 unique categories (C,D,F)

我们可以看到ID1,ID2至少有3个类别相同(B,C,D),ID1和ID3有(A,B,C),ID2,ID5至少有3个相同(C,D) ,F)。所以有4个ID会有"是"只有ID4 =="否"。

ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,5,5,5,5,5)
var1 <- c("A","B","C","A","D","D","C","D","B","F","A","B","C","C",
      "A","B","D","D","C","C","F")
df <- data.frame(ID,var1)
    ID var1
1   1    A
2   1    B
3   1    C
4   1    A
5   1    D
6   2    D
7   2    C
8   2    D
9   2    B
10  2    F
11  3    A
12  3    B
13  3    C
14  3    C
15  4    A
16  4    B
17  5    D
18  5    D
19  5    C
20  5    C
21  5    F

我的输出将是

    ID var1 var2    var3
1   1    A  Yes 1-2
2   1    B  Yes 1-2
3   1    C  Yes 1-2
4   1    A  Yes 1-2
5   1    D  Yes 1-2
6   2    D  Yes 1-2
7   2    C  Yes 1-2
8   2    D  Yes 1-2
9   2    B  Yes 1-2
10  2    F  Yes 1-2
11  3    A  Yes 1-3
12  3    B  Yes 1-3
13  3    C  Yes 1-3
14  3    C  Yes 1-3
15  4    A   No       4
16  4    B   No       4
17  5    D  Yes     2-5
18  5    D  Yes     2-5
19  5    C  Yes     2-5
20  5    C  Yes     2-5
21  5    F  Yes     2-5

感谢您提前。

1 个答案:

答案 0 :(得分:2)

问题基本上是基于共同成员资格构建邻接表,例如, Working with Bipartite/Affiliation Network Data in R。为此,我们从数据中创建一个表(在删除重复项之后),然后使用交叉产品。

dd <- unique(df)
tab <- table(dd)
dd <- crossprod(t(tab))
diag(dd) <- 0
#    ID
# ID  1 2 3 4 5
#   1 0 3 3 2 2
#   2 3 0 2 1 3
#   3 3 2 0 2 1
#   4 2 1 2 0 0
#   5 2 3 1 0 0

上表允许我们查看ID共享的类别数。现在我们只需要经历这些行;对于每一行,我选择第一个ID值至少为3(matched)。

matched <- apply(dd >= 3, MAR = 1, function(x) which(x == TRUE)[1])   
#  1  2  3  4  5 
#  2  1  1 NA  2 

所以&#34; 1&#34;与&#34; 2&#34;,&#34; 2&#34;匹配匹配&#34; 1&#34;,&#34; 3&#34;匹配&#34; 1&#34;,&#34; 4&#34;没有比赛,&#34; 5&#34;与&#34; 2&#34;匹配。通过操纵此输出结束以获得所需的最终产品:

out <- apply(cbind(as.numeric(names(matched)), matched), MAR = 1, function(x) {
  if (any(is.na(x))) {
    data.frame(var2 = "No", var3 = x[1])
  } else {
    data.frame(var2 = "Yes", var3 = paste(sort(x), collapse = "-"))
  }
})
out <- plyr::ldply(out, .id = "ID")

merge(df, out, all.x = TRUE)
#    ID var1 var2 var3
# 1   1    A  Yes  1-2
# 2   1    B  Yes  1-2
# 3   1    C  Yes  1-2
# 4   1    A  Yes  1-2
# 5   1    D  Yes  1-2
# 6   2    D  Yes  1-2
# 7   2    C  Yes  1-2
# 8   2    D  Yes  1-2
# 9   2    B  Yes  1-2
# 10  2    F  Yes  1-2
# 11  3    A  Yes  1-3
# 12  3    B  Yes  1-3
# 13  3    C  Yes  1-3
# 14  3    C  Yes  1-3
# 15  4    A   No    4
# 16  4    B   No    4
# 17  5    D  Yes  2-5
# 18  5    D  Yes  2-5
# 19  5    C  Yes  2-5
# 20  5    C  Yes  2-5
# 21  5    F  Yes  2-5