我有数据框,我的目标是通过ID找到组合var1的模式,如果每组至少有3个类别相同,我们设置"是",然后哪个ID具有相同的组合
ID1: I have 4 unique categories (A,B,C,D)
ID2: I have 4 unique categories (B,C,D,F)
ID3: I have 3 unique categories (A,B,C)
ID4: I have 2 unique categories (A,B)
ID5: I have 4 unique categories (C,D,F)
我们可以看到ID1,ID2至少有3个类别相同(B,C,D),ID1和ID3有(A,B,C),ID2,ID5至少有3个相同(C,D) ,F)。所以有4个ID会有"是"只有ID4 =="否"。
ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,5,5,5,5,5)
var1 <- c("A","B","C","A","D","D","C","D","B","F","A","B","C","C",
"A","B","D","D","C","C","F")
df <- data.frame(ID,var1)
ID var1
1 1 A
2 1 B
3 1 C
4 1 A
5 1 D
6 2 D
7 2 C
8 2 D
9 2 B
10 2 F
11 3 A
12 3 B
13 3 C
14 3 C
15 4 A
16 4 B
17 5 D
18 5 D
19 5 C
20 5 C
21 5 F
我的输出将是
ID var1 var2 var3
1 1 A Yes 1-2
2 1 B Yes 1-2
3 1 C Yes 1-2
4 1 A Yes 1-2
5 1 D Yes 1-2
6 2 D Yes 1-2
7 2 C Yes 1-2
8 2 D Yes 1-2
9 2 B Yes 1-2
10 2 F Yes 1-2
11 3 A Yes 1-3
12 3 B Yes 1-3
13 3 C Yes 1-3
14 3 C Yes 1-3
15 4 A No 4
16 4 B No 4
17 5 D Yes 2-5
18 5 D Yes 2-5
19 5 C Yes 2-5
20 5 C Yes 2-5
21 5 F Yes 2-5
感谢您提前。
答案 0 :(得分:2)
问题基本上是基于共同成员资格构建邻接表,例如, Working with Bipartite/Affiliation Network Data in R。为此,我们从数据中创建一个表(在删除重复项之后),然后使用交叉产品。
dd <- unique(df)
tab <- table(dd)
dd <- crossprod(t(tab))
diag(dd) <- 0
# ID
# ID 1 2 3 4 5
# 1 0 3 3 2 2
# 2 3 0 2 1 3
# 3 3 2 0 2 1
# 4 2 1 2 0 0
# 5 2 3 1 0 0
上表允许我们查看ID共享的类别数。现在我们只需要经历这些行;对于每一行,我选择第一个ID值至少为3(matched
)。
matched <- apply(dd >= 3, MAR = 1, function(x) which(x == TRUE)[1])
# 1 2 3 4 5
# 2 1 1 NA 2
所以&#34; 1&#34;与&#34; 2&#34;,&#34; 2&#34;匹配匹配&#34; 1&#34;,&#34; 3&#34;匹配&#34; 1&#34;,&#34; 4&#34;没有比赛,&#34; 5&#34;与&#34; 2&#34;匹配。通过操纵此输出结束以获得所需的最终产品:
out <- apply(cbind(as.numeric(names(matched)), matched), MAR = 1, function(x) {
if (any(is.na(x))) {
data.frame(var2 = "No", var3 = x[1])
} else {
data.frame(var2 = "Yes", var3 = paste(sort(x), collapse = "-"))
}
})
out <- plyr::ldply(out, .id = "ID")
merge(df, out, all.x = TRUE)
# ID var1 var2 var3
# 1 1 A Yes 1-2
# 2 1 B Yes 1-2
# 3 1 C Yes 1-2
# 4 1 A Yes 1-2
# 5 1 D Yes 1-2
# 6 2 D Yes 1-2
# 7 2 C Yes 1-2
# 8 2 D Yes 1-2
# 9 2 B Yes 1-2
# 10 2 F Yes 1-2
# 11 3 A Yes 1-3
# 12 3 B Yes 1-3
# 13 3 C Yes 1-3
# 14 3 C Yes 1-3
# 15 4 A No 4
# 16 4 B No 4
# 17 5 D Yes 2-5
# 18 5 D Yes 2-5
# 19 5 C Yes 2-5
# 20 5 C Yes 2-5
# 21 5 F Yes 2-5