这是一个示例数据集。
test_data <- structure(list(ID = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("P39190",
"U93491", "X28348", "Z93930"), class = "factor"), Sex = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), Group = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("C83Z", "CAP_1", "P000"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("ID", "Sex", "Group",
"Category"), class = "data.frame", row.names = c(NA, -36L))
head(test_data, n = 10)
ID Sex Group Category 1 Z93930 M CAP_1 A 2 Z93930 M CAP_1 A 3 Z93930 M C83Z A 4 Z93930 M C83Z A 5 Z93930 M C83Z A 6 Z93930 M C83Z A 7 X28348 F C83Z B 8 X28348 F C83Z B 9 X28348 F CAP_1 B 10 X28348 F CAP_1 B
我想计算三个级别中唯一元素的数量:
我当然可以使用基数R和一些dplyr来实现这一点:
library(dplyr)
for(i in 1:length(unique(test_data$Category))){
temp <- test_data %>% dplyr::filter(Category == unique(test_data$Category)[i])
message(paste0(unique(test_data$Category)[i]), ": ", length(unique(temp$ID)))
for(k in 1:length(unique(temp$Group))){
temp_grp <- temp %>% dplyr::filter(Group == unique(temp$Group)[k])
message(paste0("\n ├──", unique(temp$Group)[k],
": ", length(unique(temp_grp$ID))))
message(paste0("\n\t"), "F: ", length(unique(temp_grp[which(temp_grp$Sex == "F"),])$ID))
message(paste0("\n\t"), "M: ", length(unique(temp_grp[which(temp_grp$Sex == "M"),])$ID))
}
}
但是这太脏了,不太聪明。
R中是否有一个能够以更清洁,更有效的方式实现这一功能,并且最好以数据帧的形式产生输出?
我的印象是dplyr::group_by
是为这些任务而做的。但我无法弄清楚它如何适用于子分组。
以下代码:
test_data %>% dplyr::group_by(Category) %>% summarise(n = n_distinct(ID))
完成第一项任务(上述第1点)。但我无法以同样的方式实现第2点和第3点。
SOLUTION:
test_data %>% dplyr::group_by(Category, Group, Sex) %>% summarise(n = n_distinct(ID))
答案 0 :(得分:2)
如果我理解你的问题,那你根本就不是很远。这个想法只是一次按两列分组:group_by(col1, col2)
。
对于第2点:
test_data %>% dplyr::group_by(Category, Group) %>% summarise(n = n_distinct(ID))
Source: local data frame [9 x 3]
Groups: Category [?]
Category Group n
<fctr> <fctr> <int>
1 A C83Z 1
2 A CAP_1 1
3 A P000 2
4 B C83Z 1
5 B CAP_1 1
6 B P000 1
7 C C83Z 1
8 C CAP_1 1
9 C P000 2
对于第3点:
test_data %>% dplyr::group_by(Group, Sex) %>% summarise(n = n_distinct(ID))
答案 1 :(得分:1)
如果我理解正确,您可以对所有三种情况使用dplyr::count
test_data %>% dplyr::count(Category)
test_data %>% dplyr::count(Group, Category)
test_data %>% dplyr::count(Sex, Group)