我的数据低于哪个,我想根据ID将其分成几个部分
df1<- structure(list(Ids1 = 1:7, string1 = structure(c(3L, 2L, 4L,
1L, 1L, 1L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", "hdydg", "hishsgd,gugddf",
"ydis"), class = "factor"), Ids2 = c(1L, 3L, 4L, 9L, 10L, NA,
NA), string2 = structure(c(4L, 6L, 2L, 3L, 5L, 1L, 1L), .Label = c("",
"gdyijq,udyhfs", "gqdtr", "hishsgd,gugddf", "nlrshf", "ydis"), class = "factor")), .Names = c("Ids1",
"string1", "Ids2", "string2"), class = "data.frame", row.names = c(NA,
-7L))
第一个我想制作df.1,当我只保留那些具有相似Ids并且计算string1与string2相似的数量时(它们用逗号分隔)。
Ids1 string1 ids2 string2 Similar
1 hishsgd,gugddf 1 hishsgd,gugddf 2
3 ydis 3 ydis 1
4 gdyijq,udyhfs,gqdtr 4 gdyijq,udyhfs 2
我这样做
df.1 <- df1[which(df1$Ids1 == df1$Ids2), ]
只给我第一行而没有别的
然后我想要那些只有ids2
中不存在的ID 1Ids1 string1
2 hdydg
5 gdyijq,udyhfs,gqdtr
6 gdyijq,udyhfs,gqdtr
7 gdyijq,udyhfs,gqdtr
我这样做但也不起作用
df.2<- df1[which(df1$Ids1 != df1$Ids2), ]
最后我想保留那些只在ids2而不是ids1
的那些Ids1 string1
9 gqdtr
10 nlrshf
我这样做但也不起作用
df.3<- df1[which(df1$Ids2 != df1$Ids1), ]
答案 0 :(得分:1)
以下是我可以根据使用dplyr
包的连接提出的一个解决方案:
library(dplyr)
df.1 <- inner_join(select(df1, Ids1, string1), select(df1, Ids2, string2), by = c('Ids1' = 'Ids2'))
df.1$Similar <- apply(df.1[, -1], 1, function(x) sum(unlist(strsplit(x[1], ',')) %in% unlist(strsplit(x[2], ','))))
df.2 <- anti_join(select(df1, Ids1, string1), select(df1, Ids2, string2), by = c('Ids1' = 'Ids2'))
df.3 <- anti_join(select(df1, Ids2, string2), select(df1, Ids1, string1), by = c('Ids2' = 'Ids1'))
df.3 <- df.3[complete.cases(df.3), ]
你也可以为df.2和df.3做一些不同的事情,如下所示:
df.2 <- df1[!df1$Ids1 %in% df1$Ids2, c('Ids1', 'string1')]
df.3 <- df1[!df1$Ids2 %in% df1$Ids1, c('Ids2', 'string2')]
df.3 <- df.3[complete.cases(df.3), ]