根据几个不同的(子)分组(列)计算一列中唯一字符元素的数量

时间:2018-04-13 13:30:59

标签: r group-by dplyr

这是一个示例数据集。

test_data <- structure(list(ID = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("P39190", 
"U93491", "X28348", "Z93930"), class = "factor"), Sex = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), Group = structure(c(2L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("C83Z", "CAP_1", "P000"), class = "factor"), 
    Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L), .Label = c("A", 
    "B", "C"), class = "factor")), .Names = c("ID", "Sex", "Group", 
"Category"), class = "data.frame", row.names = c(NA, -36L))

head(test_data, n = 10)
       ID Sex Group Category
1  Z93930   M CAP_1        A
2  Z93930   M CAP_1        A
3  Z93930   M  C83Z        A
4  Z93930   M  C83Z        A
5  Z93930   M  C83Z        A
6  Z93930   M  C83Z        A
7  X28348   F  C83Z        B
8  X28348   F  C83Z        B
9  X28348   F CAP_1        B
10 X28348   F CAP_1        B

我想计算三个级别中唯一元素的数量:

  1. 每个“类别”的唯一元素数量
  2. 按“组”
  3. 分组的每个“类别”中的唯一元素的计数
  4. 按“性别”分组的每个“组”中的唯一元素的数量
  5. 我当然可以使用基数R和一些dplyr来实现这一点:

    library(dplyr)
    for(i in 1:length(unique(test_data$Category))){
    
        temp <- test_data %>% dplyr::filter(Category == unique(test_data$Category)[i])
        message(paste0(unique(test_data$Category)[i]), ": ", length(unique(temp$ID)))
    
        for(k in 1:length(unique(temp$Group))){
            temp_grp <- temp %>% dplyr::filter(Group == unique(temp$Group)[k])
            message(paste0("\n ├──", unique(temp$Group)[k], 
                           ": ", length(unique(temp_grp$ID))))
            message(paste0("\n\t"), "F: ", length(unique(temp_grp[which(temp_grp$Sex == "F"),])$ID))
            message(paste0("\n\t"), "M: ", length(unique(temp_grp[which(temp_grp$Sex == "M"),])$ID))
        }
    }
    

    但是这太脏了,不太聪明。

    R中是否有一个能够以更清洁,更有效的方式实现这一功能,并且最好以数据帧的形式产生输出?

    我的印象是dplyr::group_by是为这些任务而做的。但我无法弄清楚它如何适用于子分组。

    以下代码:

    test_data %>% dplyr::group_by(Category) %>% summarise(n = n_distinct(ID))
    

    完成第一项任务(上述第1点)。但我无法以同样的方式实现第2点和第3点。

    SOLUTION:

    test_data %>% dplyr::group_by(Category, Group, Sex) %>% summarise(n = n_distinct(ID))

2 个答案:

答案 0 :(得分:2)

如果我理解你的问题,那你根本就不是很远。这个想法只是一次按两列分组:group_by(col1, col2)

对于第2点:

test_data %>% dplyr::group_by(Category, Group) %>% summarise(n = n_distinct(ID))

Source: local data frame [9 x 3]
Groups: Category [?]
Category  Group     n 
<fctr> <fctr> <int>
1        A   C83Z     1
2        A  CAP_1     1
3        A   P000     2
4        B   C83Z     1
5        B  CAP_1     1
6        B   P000     1
7        C   C83Z     1
8        C  CAP_1     1
9        C   P000     2

对于第3点:

test_data %>% dplyr::group_by(Group, Sex) %>% summarise(n = n_distinct(ID))

答案 1 :(得分:1)

如果我理解正确,您可以对所有三种情况使用dplyr::count

test_data %>% dplyr::count(Category)
test_data %>% dplyr::count(Group, Category)
test_data %>% dplyr::count(Sex, Group)