对一个因子的每个级别进行计数,然后按另一个因子进行分组

时间:2019-03-29 21:18:58

标签: r dplyr

我想要一个数据帧输出,其中记录了4个级别的计数2(“是”和“否”)。我可以通过对是或否进行子集和过滤来做到这一点,但我认为必须有一种更好的方法来使用dplyr

std::array

以上是我假设必须做的事情,但不知道如何使散布函数适用于此特定变量。我不介意是否同时包含所有4个级别,那么我可以在事实之后再删几列。

null.ta <- dbdata %>%
filter(MutGroup == "Null") %>%
group_by(ICD_Grouping) %>%
summarise(n()) %>%
spread(???????)

我想要的输出看起来像

structure(list(ICD_Grouping = structure(c(50L, 50L, 33L, 33L, 
50L, 50L, 50L, 18L, 21L, 33L, 18L, 18L, 50L, 50L, 50L, 17L, 17L, 
17L, 17L, 17L, 17L, 50L, 50L, 50L, 50L, 18L, 18L, 16L, 50L, 50L, 
50L, 16L, 17L, 50L, 50L, 50L, 16L, 16L, 30L, 50L, 50L, 16L, 18L, 
17L, 50L, 50L, 50L, 50L, 50L, 50L, 21L, 30L, 21L, 18L, 21L, 21L, 
13L, 30L, 50L, 50L, 50L, 50L, 13L, 34L, 33L, 18L, 16L, 16L, 16L, 
16L, 18L, 10L, 34L, 37L, 34L, 34L, 18L, 33L, 33L, 18L, 18L, 37L, 
50L, 30L, 30L, 50L, 50L, 50L, 50L, 50L, 50L, 34L, 34L, 33L, 17L, 
14L, 19L, 33L, 18L, 18L, 18L, 50L, 30L, 30L, 30L, 34L, 18L, 18L, 
18L, 18L, 30L, 30L, 17L, 17L, 33L), .Label = c("", "C01-2", "C03-6", 
"C09-10", "C11", "C15", "C16", "C18-20", "C21", "C22", "C25", 
"C30-31", "C33-34", "C37-39", "C40-41", "C43", "C44", "C45", 
"C47/49", "C48", "C50", "C51", "C53", "C54-55", "C56", "C57-58", 
"C60", "C61", "C62", "C64", "C65-66/68", "C67", "C69", "C70", 
"C71", "C72", "C73", "C74-75", "C76.0", "C76.2", "C76.3", "C80", 
"C81", "C82-86", "C90.0", "C91.0", "C94.3/95", "D04", "D05", 
"D22", "D31", "D33", "D35"), class = "factor"), Immunohistochemistry = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 3L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 2L, 2L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 
2L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 2L, 4L, 2L, 4L, 4L, 4L, 4L, 3L, 
3L, 4L), .Label = c("", "N/A", "No", "Yes"), class = "factor")), row.names = c(NA, 
-115L), class = "data.frame")

这是随机数据的示例,而不是此数据。就像一个数据框,其中包含通过ICD_Grouping进行的免疫组织化学中每个因子水平的计数。

1 个答案:

答案 0 :(得分:0)

如果我理解正确,我们可以使用基本table来做到这一点:

table(dbdata)

table将显示每个级别的结果(即使它不再存在于数据中),因此为了使表具有合理的大小,我们使用droplevels首先删除未使用的级别:

table(droplevels(dbdata))

            Immunohistochemistry
ICD_Grouping N/A No Yes
      C22      0  0   1
      C33-34   0  0   2
      C37-39   1  0   0
      C43      0  2   7
      C44      1  2   8
      C45      2  0  17
      C47/49   1  0   0
      C50      0  1   4
      C64      0  0  10
      C69      7  0   2
      C70      1  0   6
      C73      0  1   1
      D22      8  0  30

可以使用以下方法将table转换为具有相同结构的data.frame:

table(droplevels(dbdata)) %>%
    as.data.frame.matrix() %>%
    tibble::rownames_to_column('ICD_Grouping')

或者如果您喜欢管道:

dbdata %>%
    droplevels() %>%
    table() %>%
    as.data.frame.matrix() %>%
    tibble::rownames_to_column('ICD_Grouping')

两者都给出相同的data.frame

   ICD_Grouping N/A No Yes
1           C22   0  0   1
2        C33-34   0  0   2
3        C37-39   1  0   0
4           C43   0  2   7
5           C44   1  2   8
6           C45   2  0  17
7        C47/49   1  0   0
8           C50   0  1   4
9           C64   0  0  10
10          C69   7  0   2
11          C70   1  0   6
12          C73   0  1   1
13          D22   8  0  30

这种形式是可以在任何下游过程中使用的适当数据帧,或可以与ICD_Grouping变量结合使用