我有这个大型数据框,其记录具有多个组中的重复ID,如下所示:
ID GROUP
-- ------
1 GROUPA
1 GROUPB
3 GROUPA
3 GROUPC
3 GROUPC
2 GROUPB
如何才能获得每个组中唯一记录的计数,以及组之间重叠的ID数量?像:
# Unique To Group Overlap with others Uniques not in group
------ --------------- ------------------- -------------------
GROUPA 1 2 1
GROUPB 1 1 2
GROUPC 1 1 2
所以重叠是由ID:
目前我想在一个循环中这样做:
GROUPA = df[which(df$Group == 'A'), ]
for (id in df$id) {
if is.element(id, GROUPA):
GroupACount <- GroupACount+1
etc
不知道怎么做重叠。但有没有更好的方法可能与申请和%%?
提前致谢
答案 0 :(得分:1)
本着提供对您仍然有用的快速而肮脏的答案的精神,这是使用for()循环的不完美解决方案。也许其他人可以通过矢量化来改进它。
#slight expansion of your sample data
> d <- data.frame(id = c(1, 1, 3, 3, 3, 2, 4, 2, 3), group = c("A", "B", "A", "C", "C", "B", "D", "E", "E"))
# create empty storage matrix
> myDF <- matrix(numeric(0), ncol=4)
# for loop
> for(i in unique(d$group)) {
#get IDs of this group, and IDs of all other groups
groupIDs <- unique(d[d$group == i,]$id,)
otherIDs <- unique(d[d$group != i,]$id,)
#number of unique IDs in group
test1 <- groupIDs %in% otherIDs
uniques_in_group <- length(test1[test1 == FALSE])
#number of IDs overlapping with other groups
overlaps <- length(test1[test1 == TRUE])
#number of unique IDs not in group
test2 <- otherIDs %in% groupIDs
uniques_not_in_group <- length(test2[test2 == FALSE])
#build dataframe
myDF_i <- data.frame(group = i, uniques_in_group, overlaps, uniques_not_in_group)
myDF <- rbind(myDF, myDF_i)
}
> myDF
# group uniques_in_group overlaps uniques_not_in_group
# 1 A 0 2 2
# 2 B 0 2 2
# 3 C 0 1 3
# 4 D 1 0 3
# 5 E 0 2 2
答案 1 :(得分:1)
这是使用@jogal提供的示例数据的dyplr
解决方案。
require(dplyr)
d %>% mutate(ids = length(unique(id)),
n = 1:n(),
countInOthers = sapply(n, function(currentn){sum(ID[group != group[currentn]] == ID[n == currentn])})) %>%
group_by(group) %>%
summarize(UniqueInGroup = length(unique(id[countInOthers == 0])),
OverlapWithOthers = length(unique(ID[countInOthers>0])),
UniquesNotInGroup = ids[1] - UniqueInGroup - OverlapWithOthers)
# group UniqueInGroup OverlapWithOthers UniquesNotInGroup
#1 A 0 2 2
#2 B 0 2 2
#3 C 0 1 3
#4 D 1 0 3
#5 E 0 2 2
对于问题中的样本数据,结果如下:
# group UniqueInGroup OverlapWithOthers UniquesNotInGroup
#1 GROUPA 0 2 1
#2 GROUPB 1 1 1
#3 GROUPC 0 1 2