获取R数据帧中值的唯一和重叠计数

时间:2014-06-16 08:49:53

标签: r loops dataframe unique

我有这个大型数据框,其记录具有多个组中的重复ID,如下所示:

ID   GROUP               
--   ------
1    GROUPA                      
1    GROUPB                      
3    GROUPA                      
3    GROUPC                      
3    GROUPC                      
2    GROUPB                      

如何才能获得每个组中唯一记录的计数,以及组之间重叠的ID数量?像:

#         Unique To Group    Overlap with others     Uniques not in group
------    ---------------    -------------------     -------------------
GROUPA          1                    2                     1
GROUPB          1                    1                     2
GROUPC          1                    1                     2

所以重叠是由ID:

  • 如果ID对GROUP是唯一的 - 那么对于Group
  • 是唯一的
  • 如果ID在其他组中重复,那么它是重叠
  • 如果ID在组中不存在但在其他组中存在 - 它不是组中的唯一

目前我想在一个循环中这样做:

GROUPA = df[which(df$Group == 'A'), ]
for (id in df$id) {
  if is.element(id, GROUPA):
    GroupACount <- GroupACount+1
etc

不知道怎么做重叠。但有没有更好的方法可能与申请和%%?

提前致谢

2 个答案:

答案 0 :(得分:1)

本着提供对您仍然有用的快速而肮脏的答案的精神,这是使用for()循环的不完美解决方案。也许其他人可以通过矢量化来改进它。

#slight expansion of your sample data
> d <- data.frame(id = c(1, 1, 3, 3, 3, 2, 4, 2, 3), group = c("A", "B", "A", "C", "C", "B", "D", "E", "E"))

# create empty storage matrix
> myDF <- matrix(numeric(0), ncol=4)
# for loop
> for(i in unique(d$group)) {
    #get IDs of this group, and IDs of all other groups
    groupIDs <- unique(d[d$group == i,]$id,)
    otherIDs <- unique(d[d$group != i,]$id,)

    #number of unique IDs in group
    test1 <- groupIDs %in% otherIDs
    uniques_in_group <- length(test1[test1 == FALSE])

    #number of IDs overlapping with other groups
    overlaps <- length(test1[test1 == TRUE])

    #number of unique IDs not in group
    test2 <- otherIDs %in% groupIDs
    uniques_not_in_group <- length(test2[test2 == FALSE])

    #build dataframe
    myDF_i <- data.frame(group = i, uniques_in_group, overlaps, uniques_not_in_group)
    myDF <- rbind(myDF, myDF_i)
}

> myDF

#   group uniques_in_group overlaps uniques_not_in_group
# 1     A                0        2                    2
# 2     B                0        2                    2
# 3     C                0        1                    3
# 4     D                1        0                    3
# 5     E                0        2                    2

答案 1 :(得分:1)

这是使用@jogal提供的示例数据的dyplr解决方案。

require(dplyr)

d %>% mutate(ids = length(unique(id)),
              n = 1:n(),
              countInOthers = sapply(n, function(currentn){sum(ID[group != group[currentn]] == ID[n == currentn])})) %>%
  group_by(group) %>%
  summarize(UniqueInGroup = length(unique(id[countInOthers == 0])),
            OverlapWithOthers = length(unique(ID[countInOthers>0])),
            UniquesNotInGroup = ids[1] - UniqueInGroup - OverlapWithOthers)

#  group  UniqueInGroup OverlapWithOthers UniquesNotInGroup
#1     A             0                 2                 2
#2     B             0                 2                 2
#3     C             0                 1                 3
#4     D             1                 0                 3
#5     E             0                 2                 2

对于问题中的样本数据,结果如下:

#   group UniqueInGroup OverlapWithOthers UniquesNotInGroup
#1 GROUPA             0                 2                 1
#2 GROUPB             1                 1                 1
#3 GROUPC             0                 1                 2